from Patrick Gundlach |

What is PDF? Part 6 - Tagged PDF

Categories: Development

This sixth part of the mini-series about PDF is about tagged PDF, used in accessible PDF documents.

Part 1 – PDF syntax and file structure
Part 2 – Fonts
Part 3 - Vector graphics
Part 4 - Interactive features
Part 5 - Metadata
Part 6 - Tagged PDF

Please note that all of these examples are created manually. If you wish to experiment with the examples, you can do so yourself. For more information, visit https://github.com/speedata/fixxref which provides a small program that supports manual PDF editing.

What is tagged PDF?

Tagged PDF is to enrich the PDF with semantic structure like sections, paragraphs, table of contents, captions and similar structures. These structures are called roles and are hierarchical nested. A document can contain several sections and each section can contain several paragraphs and so on.

The difficulty is, that PDF is about visual appearance and not about semantic structure. Therefore the hard work is to tell the PDF renderer (or any other PDF processor) which visual part belongs to which semantic structure.

Since these “inside PDF” posts are about examples, I create a two line document with a level 1 heading and a paragraph:

You might be able to recognize which line is the heading and which line is the paragraph, but the computer would have difficulties. Therefore I add a structure to the PDF so it has the following semantics:

And all these items will be part of the document without further division into sections or parts:

There are a few tasks now:

  1. Add a root structure object
  2. Create a Document / H1 / P hierarchy
  3. Add markup to the two lines
  4. Connect the markup to the hierarchy from item 2
  5. Add information to the Page object that allows the viewer (or a screen reader) to quickly find the structure objects for that page

Add a root structure object

The PDF consuming application must be able to find the structure tree, which is linked from the document catalog:

1 0 obj
<<
    /Type /Catalog
    /Pages 2 0 R
    /Metadata 6 0 R
    /MarkInfo << /Marked true >>
    /Lang (en)
    /StructTreeRoot 7 0 R
    /ViewerPreferences <<
        /DisplayDocTitle true
    >>
>>
endobj

The structure root (object 7) looks like this:

7 0 obj
<<
  /Type /StructTreeRoot
  /K 8 0 R
  /ParentTree 11 0 R
>>
endobj

It has one child (/K, object 8) and a number tree which has information about all marked-content objects appearing on the pages:

11 0 obj
<<
    /Nums [ 0 [ 9 0 R 10 0 R ] ]
>>

The entry 0 refers to objects 9 and 10. This is used in the /StructParents object in the Page dictionary.

Creating a document structure

A typical structure element looks like this:

8 0 obj
<<
    /Type /StructElem
    /K [ 9 0 R  10 0 R]
    /S /Document
    /T (tagged PDF demo)
    /P 7 0 R
>>
endobj

It has two children (objects 9 and 10), has the role of “Document” (/S), the parent object is number 7 and it has an optional text.

The leaves in the structure tree have a different entry for /K:

9 0 obj
<<
    /Type /StructElem
    /K 0
    /Pg 3 0 R
    /P 8 0 R
    /S /H1
    /T (A short story)
>>
endobj

This leaf represents the heading and it refers to a marked-content identifier 0 (/K entry) which is described in the next part.

Adding markup to the text

The marked-content identifier (mcid) 0 can be found in the page stream object:

4 0 obj
<<
    /Length 252
>>
stream

/H1<</MCID 0>>
BDC
  BT
    /F1 14 Tf
    10 100 Td
    12 TL
    (A short story) Tj
  ET
EMC
...

Content is marked up with the BDCEMC operators. The operator BDC has two arguments: the tag name and a properties dictionary. In this case the text until EMC refers to the marked-content id 0.

Connecting the text object to the structure element

Actually this is already done by having the /K 0 entry in the object 9 above. In real world applications a text could consist of more than one child. You can even mix numbers representing mark-content ids and StructElem objects.

Add information for the page

The last step is necessary to associate the items on the page with the Page object.

The Page object should have a pointer to a number tree entry:

3 0 obj
<<
    /Type /Page
    /MediaBox [ 0 0 200 200 ]
    /Contents 4 0 R
    /Parent 2 0 R
    /Resources << /Font << /F1 5 0 R  >>  >>
    /StructParents 0
>>
endobj

Here the entry /StructParents points to the entry 0 in the number tree. See the /StructTreeRoot object which contains this number tree (object 11).

Now the PDF has the desired structure:

Is this document now fully compliant?

No, it is not, because fonts must be embedded in complying to ISO 14289-1:2012. Therefore the PDF accessibility checker complains:

Other than that, everything is just fine! Yay!

The full example

You can find the source code online.

%PDF-1.7
%··

1 0 obj
<<
    /Type /Catalog
    /Pages 2 0 R
    /Metadata 6 0 R
    /MarkInfo << /Marked true >>
    /Lang (en)
    /StructTreeRoot 7 0 R
    /ViewerPreferences <<
        /DisplayDocTitle true
    >>
>>
endobj

2 0 obj
<<
    /Type /Pages
    /Kids [ 3 0 R ]
    /Count 1
>>
endobj

3 0 obj
<<
    /Type /Page
    /MediaBox [ 0 0 200 200 ]
    /Contents 4 0 R
    /Parent 2 0 R
    /Resources << /Font << /F1 5 0 R  >>  >>
    % entry 0 in the num index in object 11
    /StructParents 0
>>
endobj

4 0 obj
<<
    /Length 252
>>
stream
% role = H1, markup content id 0
/H1<</MCID 0>>
BDC
  BT
    /F1 14 Tf
    10 100 Td
    12 TL
    (A short story) Tj
  ET
EMC
% role = P, markup content id 1
/P<</MCID 1>>
BDC
  BT
    /F1 12 Tf
    10 90 Td
    12 TL
    (Once upon a time)'
  ET
EMC
endstream
endobj

5 0 obj
<<
    /Type     /Font
    /Subtype  /Type1
    /BaseFont /Helvetica
>>
endobj

6 0 obj
<<
    /Type /Metadata
    /Subtype /XML
    /Length 1478
>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
       <x:xmpmeta xmlns:x="adobe:ns:meta/">
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
          <xmpMM:DocumentID>uuid:e1fd5f92-3cb5-4e33-ced8-cb914873004d</xmpMM:DocumentID>
          <xmpMM:InstanceID>uuid:343be305-e7e9-4eb9-ce04-3735d12f2fb2</xmpMM:InstanceID>
        </rdf:Description>
        <rdf:Description rdf:about="" xmlns:pdfuaid="http://www.aiim.org/pdfua/ns/id/">
          <pdfuaid:part>1</pdfuaid:part>
        </rdf:Description>
        <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
           <xmp:CreateDate>2024-04-25T20:59:11+02:00</xmp:CreateDate>
           <xmp:ModifyDate>2024-04-25T20:59:11+02:00</xmp:ModifyDate>
           <xmp:MetadataDate>2024-04-25T20:59:11+02:00</xmp:MetadataDate>
           <xmp:CreatorTool>manual creation</xmp:CreatorTool>
        </rdf:Description>
        <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
          <pdf:Producer>text editor</pdf:Producer>
        </rdf:Description>
        <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
          <dc:title>
            <rdf:Alt>
              <rdf:li xml:lang="x-default">tagged PDF demo</rdf:li>
            </rdf:Alt>
          </dc:title>
        </rdf:Description>
      </rdf:RDF>
    </x:xmpmeta>

<?xpacket end="w"?>
endstream
endobj

7 0 obj
<<
  /Type /StructTreeRoot
  % child
  /K 8 0 R
  % a number tree
  /ParentTree 11 0 R
>>
endobj

% The first child (the document node)
8 0 obj
<<
    /K [ 9 0 R  10 0 R]
    /Type /StructElem
    /S /Document
    /T (tagged PDF demo)
    /P 7 0 R
>>
endobj

9 0 obj
<<
    /Type /StructElem
    /K 0                 % mcid 0
    /Pg 3 0 R
    /P 8 0 R
    /S /H1
    /T (A short story)
>>
endobj

10 0 obj
<<
    /Type /StructElem
    /K 1                 % mcid 1
    /Pg 3 0 R
    /P 8 0 R
    /S /P
    /T (once upon a time)
>>
endobj

11 0 obj
<<
    %  page with index 0 has the “mcid” items
    %  in objects 9 and 10
    /Nums [ 0 [ 9 0 R 10 0 R ] ]
>>
endobj
xref
0 12
0000000000 65535 f
0000000016 00000 n
0000000231 00000 n
0000000303 00000 n
0000000526 00000 n
0000000833 00000 n
0000000921 00000 n
0000002493 00000 n
0000002638 00000 n
0000002763 00000 n
0000002902 00000 n
0000003044 00000 n
trailer <<
    /Size 12
    /Root 1 0 R
>>
startxref
3176
%%EOF