Skip to content

PDF stage 3.3: embedded TrueType → @font-face + reverse map#550

Merged
andiwand merged 11 commits into
mainfrom
pdf-stage-3.3-truetype-fontface
Jun 23, 2026
Merged

PDF stage 3.3: embedded TrueType → @font-face + reverse map#550
andiwand merged 11 commits into
mainfrom
pdf-stage-3.3-truetype-fontface

Conversation

@andiwand

@andiwand andiwand commented Jun 22, 2026

Copy link
Copy Markdown
Member

Stage 3.3 of the in-house PDF renderer (DecoderEngine::odr): the first
end-to-end PDF display win
. Embedded TrueType programs are now rendered in the
actual font instead of a system fallback, and their Unicode is recovered for
selection.

What lands

  • Embedded font on the PDF font. pdf::Font carries the SFNT parsed from
    /FontFile2 (a simple TrueType font, or a composite CIDFontType2) through the
    abstract::Font interface as embedded_font, plus a /CIDToGIDMap. Unsupported
    flavors (/FontFile3 CFF → 3.4, /FontFile Type1 → 3.5) leave embedded_font
    null and keep the legacy fallback path.
  • Code → glyph. Font::glyph_for_code: composite Identity-H/V (CID = code)
    → GID via /CIDToGIDMap (Identity default, or explicit stream); simple
    TrueType best-effort per ISO 32000-1 9.6.6.4 (embedded cmap on the byte code,
    then on the code's Unicode, then code-as-GID).
  • Embedded-font reverse map. Font::to_unicode gains a final fallback
    (code → glyph → code_point_for_glyph), closing the stage-1 extraction gap for
    fonts with neither /ToUnicode nor a usable /Encoding. Flows through
    extract_text with no page-text changes.
  • HTML: dual layer. Per embedded font, one @font-face (re-encoded to the
    PUA via the 3.1 pipeline, once, after extraction so it can't disturb the
    reverse-map lookups). Each run emits a transparent selectable span carrying the
    real Unicode with a visible PUA glyph layer nested inside it (in the
    embedded font, user-select:none) — so display is glyph-exact and copy/search
    still work.

Refinements

  • Font memoization. parse_font now caches Font elements by object
    reference (mirroring the XObject cache), so a font shared across pages is
    parsed once. Without it the @font-face dedup (keyed on Font*) re-inlined
    each font's base64 program per page — Core_v5.1.pdf's document.html ballooned
    to 1.4 GB / 11,873 inlined fonts; now 28 fonts, parsed once.
  • HTML size. Restructured the dual layer to fit GitHub's 100 MB limit while
    keeping uncompressed, one-run-per-line output (so the committed reference stays
    diffable): nest the glyph layer (placement classes live only on the parent and
    are inherited), emit each run inline (one line, not an open/text/close triple),
    and drop the per-glyph aria-hidden attribute (an ARIA attribute can't be
    deduped into a class). The painted layer sets .gv{color:#000} explicitly
    since nesting would otherwise inherit the text layer's transparent.
    Core_v5.1.pdf's document.html: 138 MB → 88 MB.
  • OTS table synthesis. OTS (the font sanitizer in Chrome/Firefox/Safari)
    rejects a web font missing post or OS/2, and PDF-embedded TrueType fonts
    routinely omit both — so every embedded glyph layer rendered as tofu.
    SfntFont::write() now synthesizes a minimal format-3.0 post and a
    version-4 OS/2 (neutral weight/width, metrics from the bounding box, ranges
    from the cmap) when absent; existing tables pass through verbatim. Verified
    end-to-end in headless Chrome — previously-boxed runs now render real glyphs.

Decision captured

Dual-layer (vs. PUA-display-now/select-later) was chosen for selectable text in
this PR; the trade-off is up to ~2× text spans, mitigated by the nesting/inline
restructuring above.

Tests

  • pdf_font_program.cpp (new): glyph_for_code (Identity / CIDToGIDMap /
    simple-cmap / no-font) and reverse-map to_unicode (incl. /ToUnicode
    precedence). All assertion-based, inline SFNT.
  • sfnt_transform.cpp: post- and OS/2-table synthesis (added vs. passed
    through; field/offset checks).
  • Verified end-to-end in headless Chrome on style-various-1.pdf: every
    @font-face accepted by OTS, real glyphs rendered (headings, bold/italic,
    bullets, alignment), no tofu.
  • Focused PDF/font/SFNT suites green (193 tests).

Deferred (follow-ups)

  • Non-identity simple-TrueType glyph-selection edge cases; CFF/Type1 (3.4/3.5).
  • Baseline placement (still box-top).

🤖 Generated with Claude Code

Wire embedded TrueType programs into the PDF HTML so glyphs render in the
actual font instead of a fallback, and recover their Unicode for selection.

- pdf_document_element/document: Font carries the embedded program
  (abstract::Font from /FontFile2), a /CIDToGIDMap, and Font::glyph_for_code
  (composite Identity + explicit CIDToGIDMap; simple TrueType best-effort per
  ISO 32000-1 9.6.6.4). Font::to_unicode gains an embedded-font reverse-map
  fallback (code -> glyph -> code_point_for_glyph), closing the stage-1 gap.
- pdf_document_parser: load /FontFile2 from the (descendant CIDFont's)
  /FontDescriptor and /CIDToGIDMap; unsupported flavors (/FontFile3, /FontFile)
  leave program null and keep the fallback path.
- html/pdf_file: per-font @font-face (re-encode to PUA once, after extraction)
  and a dual-layer emission — a visible PUA glyph layer in the embedded font
  plus a transparent selectable layer carrying the real Unicode.
- Tests: pdf_font_program covers glyph_for_code (Identity / CIDToGIDMap /
  simple-cmap / no-program) and the reverse-map to_unicode (incl. /ToUnicode
  precedence). Verified end-to-end on style-various-1.pdf (10 @font-face).

Design: src/odr/internal/pdf/STAGE_3.3_DESIGN.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand marked this pull request as ready for review June 22, 2026 16:09

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb18d44320

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/html/pdf_file.cpp Outdated
andiwand and others added 10 commits June 22, 2026 18:25
Rename `Font::program` to `embedded_font` (and `load_font_program` ->
`load_embedded_font`, `sample_program` -> `sample_font`) for a clearer
name now that the field holds the parsed embedded SFNT. Scrub the
"stage 3.x" scaffolding language from comments and AGENTS.md now that
3.3 is done, and delete the completed STAGE_3.3_DESIGN.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fonts are shared by indirect reference across every page that uses them.
Without memoization each page's /Resources parsed a fresh Font element, and
since the HTML writer dedups embedded @font-face rules by Font pointer, a font
program got re-inlined (base64) once per page — Core_v5.1.pdf ballooned to
~1.4 GB. Cache parsed fonts so every page resolves to the one element; the file
drops to ~145 MB.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
OTS (the font sanitizer in Chrome/Firefox) requires a `post` table and rejects
the whole font when it is absent — the browser then drops the @font-face and
renders tofu. PDF-embedded TrueType fonts routinely omit `post` (the viewer
needs no glyph names), and SfntFont::write() copied every table through
verbatim, so a re-encoded @font-face font inherited the omission. About half the
embedded fonts in Core_v5.1.pdf rendered as empty boxes for this reason.

SfntFont::write() now appends a minimal format-3.0 `post` table (header only, no
glyph names) when the source lacks one; an existing `post` is still passed
through verbatim. Verified end-to-end in headless Chrome: the previously-boxed
runs now render their real glyphs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The embedded-font dual layer emitted the visible glyph span and the
transparent selectable span as siblings, each restating the run's
placement/size/spacing classes, each pretty-printed across three lines,
and each glyph span carrying an `aria-hidden="true"` attribute. On a
large document this is most of the output: Core_v5.1.pdf's document.html
was 138 MB, over GitHub's 100 MB limit.

Three changes, all preserving uncompressed, one-run-per-line output (so
the committed reference stays diffable):

- Nest the glyph layer inside the text span. The child is absolutely
  positioned at the run origin (`.t`) and inherits font-size, spacing,
  and the parent transform, so the placement classes live only on the
  parent and the glyph child carries just `g`/paint/`ff`. (~-15 MB)
- Mark the spans inline so the whole run, including the nested glyph
  layer, stays on one line instead of an open/text/close triple. Smaller
  and a more legible diff. (~-18 MB whitespace)
- Drop `aria-hidden`: it is an ARIA attribute, not stylable, so it can't
  be deduped into a class; the only lever is removing it. (~-15.6 MB)

Nesting makes the glyph child inherit the `.i` text layer's
`color:transparent`, so the painted layer now sets `.gv{color:#000}`
explicitly; invisible render modes still use `.i`.

Core_v5.1.pdf's document.html: 138 MB -> 88 MB.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The companion to the `post` fix. OTS (the font sanitizer in Chrome,
Firefox, and Safari) also lists `OS/2` among the tables a web font must
carry and rejects the whole font when it is absent — so the @font-face is
dropped and the glyph layer renders tofu. With `post` synthesized but
`OS/2` still missing, OTS now failed on the next required table
("OTS parsing error: OS/2: missing required table"), so the embedded-font
glyphs were still invisible in real browsers.

SfntFont::write() now appends a minimal version-4 `OS/2` when the source
lacks one: neutral weight/width (regular, medium), sub/superscript and
strikeout defaults scaled to the font's em, vertical metrics from the
bounding box (falling back to 0.8/0.2 em when degenerate), and the
character-range bounds from the cmap. OTS reconciles the remaining fields
(e.g. the fsSelection style bits against head). An existing `OS/2` is
passed through verbatim.

Verified end-to-end in headless Chrome: style-various-1.pdf, previously
all-tofu, now renders its real glyphs with no OTS error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand enabled auto-merge (squash) June 23, 2026 16:07
@andiwand andiwand merged commit 5d01839 into main Jun 23, 2026
11 checks passed
@andiwand andiwand deleted the pdf-stage-3.3-truetype-fontface branch June 23, 2026 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant