from Patrick Gundlach |

Debugging PDF files

Categories: Development, PDF

While developing the speedata Publisher, I have to create PDF instructions to draw shapes, create accessibility data structures and embed files for example. For boxes and glue, I have to create a PDF file from scratch. But once in a while I make mistakes and the PDF file cannot be displayed in the viewer. Then I need to look into the PDF file and check manually where the problem is. For example Adobe Acrobat shows a message:

Binary or not binary

What kind of PDF file format is PDF? Is it a binary or a text file format? I’d say it is both. Normally PDF files are compressed, and the text editors or text viewers like “less” do not show the file or show garbled data.

If you click on “Open Anyway”, you will see something like this:

which is not helpful. Luckily there are tools do decompress the file. My favorite command (a command line tool) is qpdf. The syntax for decompressing everything is

qpdf --qdf --object-streams=disable myfile.pdf uncompressed.pdf

Now the PDF looks much more friendly:

%PDF-1.6
%����
%QDF-1.0

%% Original object ID: 17 0
1 0 obj
<<
  /Lang (en)
  /PageMode /UseNone
  /Pages 3 0 R
  /Type /Catalog
>>
endobj

You can now see all the objects of the PDF file and see if there are any errors. Sometimes there are very subltle errors such as:

1 0 obj
<<
  /Lang (en)
  /PageMode UseNone
  /Pages 3 0 R
  /Type /Catalog
>>
endobj

Have you spotted the mistake? The name UseNone in line 4 is missing the / in front.

qpdf would have issued a warning in this case:

 WARNING: publisher.pdf object stream 3
 (object 17 0, offset 1157):
 unknown token while reading object; treating as string
 qpdf: operation succeeded with warnings;
 resulting file may have some problems

Other tools

But there are other tools to check the PDF:

pdfcpu

pdfcpu validate publisher.pdf

finds this error as well:

validating(mode=relaxed) publisher.pdf ...
decodeObjectStreamObjects: problem decoding object stream 3
: strconv.ParseFloat: parsing "UseNone": invalid syntax

VeraPDF

VeraPDF comes in two flavors: a command line tool and the same with a graphical user interface.

VeraPDF is good to check against a certain PDF standard such as PDF/A-1, but for this simple case (a missing slash in front of UseNone) it failed to find the problem (but reported other problems, because the PDF is not A-1 compliant).

Adobe Acrobat

I don’t need to explain what Adobe Acrobat is. You can subscribe to it for currently 19.99 USD and is a very feature rich PDF editor. It also has the ability to do syntax checking and validating against PDF standards.

I the case of the missing slash (see above) this tool is not helpful:

Logic errors

If there is a logic problem in the file format, Acrobat is a bit more helpful. For example in a hyperlink, the object should look like this:

<<
  /A <<
    /S /URI
    /Type /Action
    /URI (https://www.speedata.de)
  >>
  /Border [ 0 0 0 ]
  /Rect [ 28.346 801.543 107.326 813.543 ]
  /Subtype /Link
  /Type /Annot
>>

When you have a wong type (/Type /XAction for example), the syntax of the PDF file is correct, but the logical structure is incorrect.

Adobe Acrobat complains:

and shows exactly the problem location.

Also pdfcpu can find the logic error:

$ pdfcpu validate publisher.pdf
validating(mode=relaxed) publisher.pdf ...
validation error (obj#:8): pdfcpu: validateNameEntry:
dict=actionDict entry=Type invalid dict entry: XAction

Checking for conformance

Adobe Acrobat and veraPDF can check for conformance to some PDF standards.

While veraPDF only validates conforming to PDF/A (archiving) and PDF/UA (accessibility) standards, whereas Adobe Acrobat checks conforming to many other PDF standards such as PDF/X (graphics exchange).

There are two specialized validators to be mentioned: the PDF accessibility checker (Windows only) which test against PDF/UA compliance.

The ZUGFeRD validation portal validates against the electronic invoice standard used in the EU.

PDF 2.0

Although the PDF 2.0 standard is a few years old now, only very few validators can handle PDF 2.0. Currently only pdfcpu can validate PDF 2.0 files. It has a disclaimer that it has only limited PDF support yet.

Conclusion

Usually I use a variety of tools to validate my PDF files. First I decompress the file with qpdf, then look into the file with a text editor. If I can’t find an error while looking at the text, I open the file in Adobe Acrobat, and try to use the preflight tool with the “syntax check” profile. But using pdfcpu is also handy. veraPDF has the advantage that it mentions the source of the error next to a reference to the ISO standard which describes the correct way.

For special cases I use the PDF accessibility checker or the ZUGFeRD validation portal.