Experimenting with JBIG2 Support to pdfTeX

Introduction

The following is just an informal write-up from my private experimenting:

Adobe Systems have defined a new filter /JBIG2Decode in their newest PDF format, version 1.4, which allows decoding of image data after the JBIG2 standard. It seems that this feature is first supported by Adobe Acroread version 5.0.

The JBIG2 encoding is for bi-level images only, e. g. scanned texts, where it is told to give very high lossy or lossless compression ratios. It is especially well geared towards compression of multi-page documents, by using a global page with information commonly used by all pages. This rather new standard is worked out by the JBIG Committee. The latest JBIG2 draft standard is available from here as PDF-file.

JBIG2 Data Streams

I don't yet have any program, which would produce JBIG2 files. But some sample data streams are available from here. And there is a small but working ASCII-JBIG2 example in section 3.3.6 of the PDF reference, which can be typed in and binarized, e. g. by some awk tool. It produces two letters 'C', stacked over each other.

The Driver

I have experimented with JBIG2 image inclusion in PDF streams generated by program pdfTeX as part of the teTeX bundle, using the freshest beta version at that time (teTeX-src-beta-20020530.tar.gz). Program pdfTeX already allows JPEG image inclusion, so I could start from source code writejpg.c. The experimental driver is writejbig2.c. This I put into the pdftexdir directory of the teTeX tree on my Linux PC (debian 2.2r6), together with the other drivers. A few other files required patching, just to add jbig2 things similarly to the already existing jpeg things. Here is the list of new/patched files, all in the subdirectory pdftexdir:

writejbig2.c The JBIG2 driver.
writejbig2.c.bz2 The same JBIG2 driver, all DEBUG info removed for better legibility, compressed with bzip2.
writeimg.c JBIG2 additions to readimage(), writeimage(), and deleteimage().
image.h Added struct JBIG2_IMAGE_INFO, macro IMAGE_TYPE_JBIG2, and macro jbig2_ptr(N).
Makefile Added target writejbig2.o. These changes must be done by hand, after the Makefile it is automatically generated by configure.
foo.pdf A file generated by pdfTeX (without compression) from the Datastream Example and Test Sequence, Appendix H of the JBIG2 draft standard. Caveat: Some viewers might crash (some monitors may survive).

The JBIG2 pictures must have the ending '.jb2' or '.jbig2'.

Experimenting

The driver as is only allows inclusion of one page, preset to number 1. I could test the driver only on the about 28 available JBIG2 files, which are of type:

non-striped sequential
non-striped random-access
striped random-access

The driver could process all three types. The fresh Linux Acroread, Version x86 linux 5.05 Apr 25 2002, chokes on only one file, 042_13.jbig2, from the above mentioned set with info 'Bad error code'. Don't know why. Another problem is, that above the included images there is a horizontal black hairline, which appears also in the print. I don't know the origin of this, but similar lines sometimes appear also with inclusion of JPEG and other types of pictures.

Open Points

Avoiding the above mentioned hairline.

There is no check of .jbig2 file validity. Program pdfTeX might crash completely on a corrupted file (not tested).

Real JBIG2 multi-page inclusion would be fine to have, utilizing the full JBIG2 compression power by using the same global page information for several image objects from the same JBIG2-file.

Determining the segment data length (section 7.2.7 of the JBIG2 draft standard) by detecting two-byte sequences is not supported.

To do more, I would have to understand the JBIG2 standard :-)

Lots of fprintf(stderr,) statements in the driver code. Well, it's experimental.

Unclear to me is, by which funny way Pascal-variables are related to C-variables in the web2c system. The underscores seem to have a magic role there...

End Remark

For now, this is experimenting result of about one week's evenings/nights. It was just fun to dig (for the first time) a little into PDF objects and the pdfTeX driver interface, to see that additions like a new image driver can be done rather easily through pdfTeX's very practical driver interface, without changing the pdftex.web code, --- and finally to see Acroread even showing the result on paper.

This page first put online 2 June 2002.

News

13 Nov. 2002: pdfcrypting removed. Tested with teTeX version teTeX-beta-20021112.
08 Dec. 2002: bug in page 0 stream writing repaired. Strategy for multiple page inclusion from same JBIG2 file: When writing 1st image, create fresh PDF object for page 0, and include any page 0 segments from complete file (even if these segments are not needed for image). When writing next image, check by filename comparison if PDF object for page 0 of this JBIG2 file has already been written. This can only remember the file name for the direct predecessor JBIG2 image (but images of other types might come inbetween). If such page 0 PDF object exists, reference it. Else create fresh one. Tested with teTeX version teTeX-beta-20021116. It seems that XPDF 2.0 can display the three pages of foo.pdf.
09 Dec. 2002: JBIG2 seg. page numbers > 0 are now set to 1, see PDF Ref.