PDF stage 3.3: embedded TrueType → @font-face + reverse map#550
Merged
Conversation
Wire embedded TrueType programs into the PDF HTML so glyphs render in the actual font instead of a fallback, and recover their Unicode for selection. - pdf_document_element/document: Font carries the embedded program (abstract::Font from /FontFile2), a /CIDToGIDMap, and Font::glyph_for_code (composite Identity + explicit CIDToGIDMap; simple TrueType best-effort per ISO 32000-1 9.6.6.4). Font::to_unicode gains an embedded-font reverse-map fallback (code -> glyph -> code_point_for_glyph), closing the stage-1 gap. - pdf_document_parser: load /FontFile2 from the (descendant CIDFont's) /FontDescriptor and /CIDToGIDMap; unsupported flavors (/FontFile3, /FontFile) leave program null and keep the fallback path. - html/pdf_file: per-font @font-face (re-encode to PUA once, after extraction) and a dual-layer emission — a visible PUA glyph layer in the embedded font plus a transparent selectable layer carrying the real Unicode. - Tests: pdf_font_program covers glyph_for_code (Identity / CIDToGIDMap / simple-cmap / no-program) and the reverse-map to_unicode (incl. /ToUnicode precedence). Verified end-to-end on style-various-1.pdf (10 @font-face). Design: src/odr/internal/pdf/STAGE_3.3_DESIGN.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb18d44320
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Rename `Font::program` to `embedded_font` (and `load_font_program` -> `load_embedded_font`, `sample_program` -> `sample_font`) for a clearer name now that the field holds the parsed embedded SFNT. Scrub the "stage 3.x" scaffolding language from comments and AGENTS.md now that 3.3 is done, and delete the completed STAGE_3.3_DESIGN.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fonts are shared by indirect reference across every page that uses them. Without memoization each page's /Resources parsed a fresh Font element, and since the HTML writer dedups embedded @font-face rules by Font pointer, a font program got re-inlined (base64) once per page — Core_v5.1.pdf ballooned to ~1.4 GB. Cache parsed fonts so every page resolves to the one element; the file drops to ~145 MB. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
OTS (the font sanitizer in Chrome/Firefox) requires a `post` table and rejects the whole font when it is absent — the browser then drops the @font-face and renders tofu. PDF-embedded TrueType fonts routinely omit `post` (the viewer needs no glyph names), and SfntFont::write() copied every table through verbatim, so a re-encoded @font-face font inherited the omission. About half the embedded fonts in Core_v5.1.pdf rendered as empty boxes for this reason. SfntFont::write() now appends a minimal format-3.0 `post` table (header only, no glyph names) when the source lacks one; an existing `post` is still passed through verbatim. Verified end-to-end in headless Chrome: the previously-boxed runs now render their real glyphs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The embedded-font dual layer emitted the visible glyph span and the
transparent selectable span as siblings, each restating the run's
placement/size/spacing classes, each pretty-printed across three lines,
and each glyph span carrying an `aria-hidden="true"` attribute. On a
large document this is most of the output: Core_v5.1.pdf's document.html
was 138 MB, over GitHub's 100 MB limit.
Three changes, all preserving uncompressed, one-run-per-line output (so
the committed reference stays diffable):
- Nest the glyph layer inside the text span. The child is absolutely
positioned at the run origin (`.t`) and inherits font-size, spacing,
and the parent transform, so the placement classes live only on the
parent and the glyph child carries just `g`/paint/`ff`. (~-15 MB)
- Mark the spans inline so the whole run, including the nested glyph
layer, stays on one line instead of an open/text/close triple. Smaller
and a more legible diff. (~-18 MB whitespace)
- Drop `aria-hidden`: it is an ARIA attribute, not stylable, so it can't
be deduped into a class; the only lever is removing it. (~-15.6 MB)
Nesting makes the glyph child inherit the `.i` text layer's
`color:transparent`, so the painted layer now sets `.gv{color:#000}`
explicitly; invisible render modes still use `.i`.
Core_v5.1.pdf's document.html: 138 MB -> 88 MB.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The companion to the `post` fix. OTS (the font sanitizer in Chrome, Firefox, and Safari) also lists `OS/2` among the tables a web font must carry and rejects the whole font when it is absent — so the @font-face is dropped and the glyph layer renders tofu. With `post` synthesized but `OS/2` still missing, OTS now failed on the next required table ("OTS parsing error: OS/2: missing required table"), so the embedded-font glyphs were still invisible in real browsers. SfntFont::write() now appends a minimal version-4 `OS/2` when the source lacks one: neutral weight/width (regular, medium), sub/superscript and strikeout defaults scaled to the font's em, vertical metrics from the bounding box (falling back to 0.8/0.2 em when degenerate), and the character-range bounds from the cmap. OTS reconciles the remaining fields (e.g. the fsSelection style bits against head). An existing `OS/2` is passed through verbatim. Verified end-to-end in headless Chrome: style-various-1.pdf, previously all-tofu, now renders its real glyphs with no OTS error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stage 3.3 of the in-house PDF renderer (
DecoderEngine::odr): the firstend-to-end PDF display win. Embedded TrueType programs are now rendered in the
actual font instead of a system fallback, and their Unicode is recovered for
selection.
What lands
pdf::Fontcarries the SFNT parsed from/FontFile2(a simple TrueType font, or a compositeCIDFontType2) through theabstract::Fontinterface asembedded_font, plus a/CIDToGIDMap. Unsupportedflavors (
/FontFile3CFF → 3.4,/FontFileType1 → 3.5) leaveembedded_fontnull and keep the legacy fallback path.
Font::glyph_for_code: compositeIdentity-H/V(CID = code)→ GID via
/CIDToGIDMap(Identity default, or explicit stream); simpleTrueType best-effort per ISO 32000-1 9.6.6.4 (embedded
cmapon the byte code,then on the code's Unicode, then code-as-GID).
Font::to_unicodegains a final fallback(code → glyph →
code_point_for_glyph), closing the stage-1 extraction gap forfonts with neither
/ToUnicodenor a usable/Encoding. Flows throughextract_textwith no page-text changes.@font-face(re-encoded to thePUA via the 3.1 pipeline, once, after extraction so it can't disturb the
reverse-map lookups). Each run emits a transparent selectable span carrying the
real Unicode with a visible PUA glyph layer nested inside it (in the
embedded font,
user-select:none) — so display is glyph-exact and copy/searchstill work.
Refinements
parse_fontnow cachesFontelements by objectreference (mirroring the XObject cache), so a font shared across pages is
parsed once. Without it the
@font-facededup (keyed onFont*) re-inlinedeach font's base64 program per page — Core_v5.1.pdf's
document.htmlballoonedto 1.4 GB / 11,873 inlined fonts; now 28 fonts, parsed once.
keeping uncompressed, one-run-per-line output (so the committed reference stays
diffable): nest the glyph layer (placement classes live only on the parent and
are inherited), emit each run inline (one line, not an open/text/close triple),
and drop the per-glyph
aria-hiddenattribute (an ARIA attribute can't bededuped into a class). The painted layer sets
.gv{color:#000}explicitlysince nesting would otherwise inherit the text layer's
transparent.Core_v5.1.pdf's
document.html: 138 MB → 88 MB.rejects a web font missing
postorOS/2, and PDF-embedded TrueType fontsroutinely omit both — so every embedded glyph layer rendered as tofu.
SfntFont::write()now synthesizes a minimal format-3.0postand aversion-4
OS/2(neutral weight/width, metrics from the bounding box, rangesfrom the cmap) when absent; existing tables pass through verbatim. Verified
end-to-end in headless Chrome — previously-boxed runs now render real glyphs.
Decision captured
Dual-layer (vs. PUA-display-now/select-later) was chosen for selectable text in
this PR; the trade-off is up to ~2× text spans, mitigated by the nesting/inline
restructuring above.
Tests
pdf_font_program.cpp(new):glyph_for_code(Identity /CIDToGIDMap/simple-cmap / no-font) and reverse-map
to_unicode(incl./ToUnicodeprecedence). All assertion-based, inline SFNT.
sfnt_transform.cpp:post- andOS/2-table synthesis (added vs. passedthrough; field/offset checks).
style-various-1.pdf: every@font-faceaccepted by OTS, real glyphs rendered (headings, bold/italic,bullets, alignment), no tofu.
Deferred (follow-ups)
🤖 Generated with Claude Code