feature(extract-core): output page byte ranges by ClemDoum · Pull Request #12 · ICIJ/extract-python

ClemDoum · 2026-06-24T13:22:14Z

Descriptions

Output mardown page bytes range, preliminary to support ICIJ/datashare#2229 in the datashare-python's extract-worker

Changes

`extract-core`

Changed

introduced Pages(total: int, bytes_ranges: list[tuple[int, int]]) and replaced ConversionOutput.pages: PageIndexes = []) by ConversionOutput.pages: Pages

`extract-python`

Added

added the utils.write_pages helper to serialize markdown documents and yield pages bytes ranges

pirhoo

Thanks for the swift implem! I've made a few suggestions.

pirhoo · 2026-06-24T15:04:02Z

 import markdown2
 import pypdfium2
-from extract_core import BaseModel, OutputFormat, PageIndexes
+from extract_core import BaseModel, OutputFormat, PageRanges


important: I think this PageRangesdoesnt exist anymore. If I understood correctly it's replaced by Page.

pirhoo · 2026-06-24T15:04:30Z

 ) -> list[dict[str, tuple[int, int]]]:
    all_pages = [
-        PageIndexes.model_validate_json(
+        PageRanges.model_validate_json(


important: same here, we need:

Pages.model_validate_json( (root / compared / "artifacts" / "pages.json").read_text() ).byte_ranges

pirhoo · 2026-06-24T15:08:08Z

+            with md_path.open("wb") as f:
+                pages = write_pages(pages, page_sep, f)
+        # Clean up the tmp page file before move everything to the end destination
+        current_page_path.unlink()


hint: this will crash if the current page doesn't exist. This can be adjusted:

Suggested change

current_page_path.unlink()

current_page_path.unlink(missing_ok=True)

pirhoo · 2026-06-24T15:24:09Z

+    pages_byte_sizes = []
+    sentinel = object()
+    while True:
+        content = next(pages, sentinel) if next_page is None else next_page


hint: write_pages uses next_page is None as the "no lookahead yet" marker. The only re-entry to the next(pages) branch is when next_page is None, which can only recur if a page value is literally None.

You can rewrite with a sentinel, something like:

def write_pages(pages: Iterable[str | None], page_sep: str, out: BinaryIO) -> Pages: it = iter(pages) sentinel = object() sizes = [] prev = next(it, sentinel) while prev is not sentinel: cur = next(it, sentinel) content = prev or "" # This is the trick if cur is not sentinel: content += page_sep sizes.append(out.write(content.encode())) prev = cur return Pages.from_pages_bytes_sizes(sizes)

ClemDoum self-assigned this Jun 24, 2026

ClemDoum force-pushed the feature(extract-python)/page-by-ranges branch from 28ef803 to 8d5032d Compare June 24, 2026 13:26

feature(extract-core): output page byte ranges

b531afa

ClemDoum force-pushed the feature(extract-python)/page-by-ranges branch from 8d5032d to b531afa Compare June 24, 2026 13:38

ClemDoum marked this pull request as ready for review June 24, 2026 13:38

ClemDoum requested a review from pirhoo June 24, 2026 13:39

pirhoo requested changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature(extract-core): output page byte ranges#12

feature(extract-core): output page byte ranges#12
ClemDoum wants to merge 1 commit into
mainfrom
feature(extract-python)/page-by-ranges

ClemDoum commented Jun 24, 2026 •

edited

Loading

Uh oh!

pirhoo left a comment

Uh oh!

pirhoo Jun 24, 2026

Uh oh!

pirhoo Jun 24, 2026

Uh oh!

pirhoo Jun 24, 2026

Uh oh!

pirhoo Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	current_page_path.unlink()
	current_page_path.unlink(missing_ok=True)

Uh oh!

Conversation

ClemDoum commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Descriptions

Changes

extract-core

Changed

extract-python

Added

Uh oh!

pirhoo left a comment

Choose a reason for hiding this comment

Uh oh!

pirhoo Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

pirhoo Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

pirhoo Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

pirhoo Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ClemDoum commented Jun 24, 2026 •

edited

Loading

`extract-core`

`extract-python`