Skip to content

feature(extract-core): output page byte ranges#12

Open
ClemDoum wants to merge 1 commit into
mainfrom
feature(extract-python)/page-by-ranges
Open

feature(extract-core): output page byte ranges#12
ClemDoum wants to merge 1 commit into
mainfrom
feature(extract-python)/page-by-ranges

Conversation

@ClemDoum

@ClemDoum ClemDoum commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Descriptions

Output mardown page bytes range, preliminary to support ICIJ/datashare#2229 in the datashare-python's extract-worker

Changes

extract-core

Changed

  • introduced Pages(total: int, bytes_ranges: list[tuple[int, int]]) and replaced ConversionOutput.pages: PageIndexes = []) by ConversionOutput.pages: Pages

extract-python

Added

  • added the utils.write_pages helper to serialize markdown documents and yield pages bytes ranges

@ClemDoum ClemDoum self-assigned this Jun 24, 2026
@ClemDoum ClemDoum force-pushed the feature(extract-python)/page-by-ranges branch from 28ef803 to 8d5032d Compare June 24, 2026 13:26
@ClemDoum ClemDoum force-pushed the feature(extract-python)/page-by-ranges branch from 8d5032d to b531afa Compare June 24, 2026 13:38
@ClemDoum ClemDoum marked this pull request as ready for review June 24, 2026 13:38
@ClemDoum ClemDoum requested a review from pirhoo June 24, 2026 13:39

@pirhoo pirhoo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the swift implem! I've made a few suggestions.

import markdown2
import pypdfium2
from extract_core import BaseModel, OutputFormat, PageIndexes
from extract_core import BaseModel, OutputFormat, PageRanges

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: I think this PageRangesdoesnt exist anymore. If I understood correctly it's replaced by Page.

) -> list[dict[str, tuple[int, int]]]:
all_pages = [
PageIndexes.model_validate_json(
PageRanges.model_validate_json(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: same here, we need:

Pages.model_validate_json(
    (root / compared / "artifacts" / "pages.json").read_text()
).byte_ranges

with md_path.open("wb") as f:
pages = write_pages(pages, page_sep, f)
# Clean up the tmp page file before move everything to the end destination
current_page_path.unlink()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hint: this will crash if the current page doesn't exist. This can be adjusted:

Suggested change
current_page_path.unlink()
current_page_path.unlink(missing_ok=True)

pages_byte_sizes = []
sentinel = object()
while True:
content = next(pages, sentinel) if next_page is None else next_page

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hint: write_pages uses next_page is None as the "no lookahead yet" marker. The only re-entry to the next(pages) branch is when next_page is None, which can only recur if a page value is literally None.

You can rewrite with a sentinel, something like:

def write_pages(pages: Iterable[str | None], page_sep: str, out: BinaryIO) -> Pages:
    it = iter(pages)
    sentinel = object()
    sizes = []
    prev = next(it, sentinel)
    while prev is not sentinel:
        cur = next(it, sentinel)
        content = prev or "" # This is the trick
        if cur is not sentinel:
            content += page_sep
        sizes.append(out.write(content.encode()))
        prev = cur
    return Pages.from_pages_bytes_sizes(sizes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants