Skip to content

feat: add data licensing source registry#337

Open
duncanleo wants to merge 7 commits into
mainfrom
codex/data-licensing-attribution-plan
Open

feat: add data licensing source registry#337
duncanleo wants to merge 7 commits into
mainfrom
codex/data-licensing-attribution-plan

Conversation

@duncanleo

@duncanleo duncanleo commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

Adds the Phase 1 data licensing foundation for MRTDown data:

  • documents the MRTDown-authored data license boundary in LICENSE-DATA.md
  • adds data/rights/source-registry.json for source rights and attribution policy
  • adds shared rights and attribution schemas in @mrtdown/core
  • adds deterministic source-registry matching helpers in @mrtdown/fs
  • documents the source registry and licensing boundary in README.md

Impact

Downstream consumers get a clearer CC-BY-4.0 boundary for MRTDown-authored data while third-party evidence source material remains explicitly carved out. Later validation and attribution generation can now build on stable schemas and deterministic source-rule resolution.

Validation

  • npm run test:core
  • npm run test:fs
  • npm run build:packages
  • npm run lint
  • npm run check:boundaries
  • npm run check:docs

The fs test suite also parses the canonical source registry and confirms all current canonical and fixture evidence rows resolve to exactly one source rule.

Summary by CodeRabbit

  • New Features
    • Introduced a data licensing & attribution system with a source registry, rights resolution, and generated attribution artifacts.
    • Added public-export redaction for non-public evidence and added rights to export manifests.
    • Extended validation and CLI to support --scope rights.
  • Documentation
    • Added/expanded data licensing, attribution plan, and repository layout/licensing details.
  • Bug Fixes
    • Tightened crowd-report URL validation to the MRTDown reports domain.
  • Tests
    • Added rights schema/rule-matching/validation tests; updated fixtures, ingest/triage tests, and fixture generation.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@duncanleo, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 3 minutes and 24 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 00cc4dbb-da7b-4208-84b8-0a0d947d6ca4

📥 Commits

Reviewing files that changed from the base of the PR and between baa9eca and e326fbd.

📒 Files selected for processing (6)
  • data/rights/source-registry.json
  • packages/fs/src/index.test.ts
  • packages/fs/src/manifest.ts
  • packages/fs/src/publicExport.ts
  • packages/fs/src/rights.test.ts
  • scripts/generate-fixtures.mjs
📝 Walkthrough

Walkthrough

Adds a data licensing document, a canonical source-rights registry, schema and resolution logic, validation/export wiring, fixture generation, and updated crowd-report URL requirements across ingest and triage paths.

Changes

Data Licensing and Source Registry

Layer / File(s) Summary
License and plan docs
LICENSE-DATA.md, README.md, docs/plans/README.md, docs/plans/active/data-licensing-attribution.md
Adds the data license file, README licensing notes, the active plan entry, and the full data licensing and attribution plan document.
Registry JSON and rights schemas
data/rights/source-registry.json, packages/core/src/schema/Rights.ts, packages/core/src/index.ts, packages/core/src/schema/Rights.test.ts
Adds the canonical source registry data, the rights and attribution schemas, the schema export, and tests for registry and attribution parsing.
Rights rule matching and resolution
packages/fs/src/rights.ts, packages/fs/src/index.ts, packages/fs/src/rights.test.ts
Adds the rule-resolution helpers for matching evidence to registry rules, the public resolution union, the package barrel export, and tests for selector matching, archive handling, ambiguity, and failure reasons.
Validation wiring and export generation
packages/fs/src/constants.ts, packages/fs/src/validate.ts, packages/fs/src/publicExport.ts, packages/fs/src/index.test.ts, packages/fs/src/manifest.ts, packages/fs/src/pagesIndex.ts, scripts/generate-fixtures.mjs, scripts/build-pages-artifact.mjs
Extends filesystem constants and validation scope handling for rights data, loads the source registry during validation, validates evidence rights during reference checks, redacts non-public evidence during export, writes the registry into generated fixtures, and adds validation and manifest tests.
Crowd-report URL alignment
fixtures/ingest/crowd-report.json, packages/ingest-contracts/README.md, packages/ingest-contracts/src/index.ts, packages/ingest-contracts/src/index.test.ts, packages/triage/src/util/ingestContent/helpers/formatContentTextForIngest.test.ts, packages/triage/src/util/ingestContent/helpers/getEvidenceProvenanceForIngestContent.test.ts, packages/triage/src/util/ingestContent/index.test.ts
Updates the crowd-report ingest contract, its tests, the sample fixture, triage provenance tests, and the ingest content tests to use the MRTDown reports host.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • foldaway/mrtdown-data#222: Directly related through the ManifestSchema extension that adds the top-level rights field.
  • foldaway/mrtdown-data#223: Directly related through the validateDataRoot and validation-scope wiring extended here for the new rights scope.
  • foldaway/mrtdown-data#226: Directly related through the Pages artifact build flow updated here with redaction and license copying.

Poem

🐇 I hop by the registry, neat and bright,
Rights in a bundle, all tucked in tight.
Reports wear their home-domain badge with pride,
And archive trails still know where they’ve bide.
A license for data, a path made clear,
Hoppity-hop, the credits are here!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: introducing a data licensing source registry and related licensing foundation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/data-licensing-attribution-plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@duncanleo duncanleo marked this pull request as ready for review June 28, 2026 04:07

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/plans/active/data-licensing-attribution.md`:
- Around line 142-156: The documented taxonomy is missing the live generic-web
category, so update the recurring categories list to include it alongside the
existing entries. Use the existing generic-web symbol from Rights.ts and the
source-registry mappings for lta-website and web-archive-snapshot as the
reference point, and make sure the plan reflects the same category set used by
the schema and registry.

In `@packages/core/src/schema/Rights.ts`:
- Around line 36-51: The SourceRegistryRuleMatchSchema in Rights.ts currently
allows evidenceType-only matches, but the rights matcher in
packages/fs/src/rights.ts cannot use them when sourceUrl is missing or invalid.
Fix this by either tightening SourceRegistryRuleMatchSchema to require a URL
selector alongside evidenceType, or updating the rights matching logic
(especially the selector evaluation path) so evidenceType-only rules are still
checked without relying on sourceUrl parsing.

In `@packages/fs/src/rights.ts`:
- Around line 46-49: The host matching in rights.ts uses sourceUrl.host, which
includes the port and can cause false mismatches for rules like x.com versus
x.com:8443. Update the matching logic in the sourceUrlHost check to use
sourceUrl.hostname instead, keeping the normalizeHost comparison in place so the
existing rule matching behavior in the rights matching flow remains unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 884cd8b4-75a3-4727-954c-d1a22d9db990

📥 Commits

Reviewing files that changed from the base of the PR and between e0266c7 and d17174f.

📒 Files selected for processing (11)
  • LICENSE-DATA.md
  • README.md
  • data/rights/source-registry.json
  • docs/plans/README.md
  • docs/plans/active/data-licensing-attribution.md
  • packages/core/src/index.ts
  • packages/core/src/schema/Rights.test.ts
  • packages/core/src/schema/Rights.ts
  • packages/fs/src/index.ts
  • packages/fs/src/rights.test.ts
  • packages/fs/src/rights.ts

Comment thread docs/plans/active/data-licensing-attribution.md
Comment thread packages/core/src/schema/Rights.ts
Comment thread packages/fs/src/rights.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d17174f119

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread data/rights/source-registry.json
Comment thread data/rights/source-registry.json
Comment thread data/rights/source-registry.json

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6124b04d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/fs/src/rights.ts Outdated
Comment thread packages/fs/src/validate.ts
Comment thread data/rights/source-registry.json

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/fs/src/index.test.ts`:
- Around line 256-264: The tests are parsing and rewriting evidence.ndjson as a
single JSON document instead of using the already-parsed bundle evidence from
listIssueBundles(). Update the relevant assertions in the evidence-related test
cases to mutate bundle.evidence directly, then write the file back in NDJSON
form while preserving all existing rows. Use the listIssueBundles() result and
the existing EvidenceSchema/EvidenceSchema.parse flow as the entry points to
locate the affected test logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 00ac726e-886c-410a-88e4-9c2e937360e7

📥 Commits

Reviewing files that changed from the base of the PR and between d17174f and dd7ba45.

📒 Files selected for processing (16)
  • data/rights/source-registry.json
  • docs/plans/active/data-licensing-attribution.md
  • fixtures/ingest/crowd-report.json
  • packages/core/src/schema/Rights.ts
  • packages/fs/src/constants.ts
  • packages/fs/src/index.test.ts
  • packages/fs/src/rights.test.ts
  • packages/fs/src/rights.ts
  • packages/fs/src/validate.ts
  • packages/ingest-contracts/README.md
  • packages/ingest-contracts/src/index.test.ts
  • packages/ingest-contracts/src/index.ts
  • packages/triage/src/util/ingestContent/helpers/formatContentTextForIngest.test.ts
  • packages/triage/src/util/ingestContent/helpers/getEvidenceProvenanceForIngestContent.test.ts
  • packages/triage/src/util/ingestContent/index.test.ts
  • scripts/generate-fixtures.mjs
✅ Files skipped from review due to trivial changes (2)
  • packages/triage/src/util/ingestContent/helpers/getEvidenceProvenanceForIngestContent.test.ts
  • packages/ingest-contracts/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • data/rights/source-registry.json
  • docs/plans/active/data-licensing-attribution.md
  • packages/core/src/schema/Rights.ts

Comment thread packages/fs/src/index.test.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd7ba45bf3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/fs/src/validate.ts Outdated
Comment thread packages/fs/src/validate.ts
Comment thread packages/fs/src/constants.ts
Comment thread LICENSE-DATA.md

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f78e88d33

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread LICENSE-DATA.md
Comment thread packages/core/src/schema/Manifest.ts
Comment thread packages/core/src/schema/Manifest.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/fs/src/publicExport.ts`:
- Around line 31-33: The fallback in publicExport currently fails open because
unresolved rights checks (`!result.ok`) return the original evidence unchanged;
update the export flow in `publicExport()` so inconclusive results do not pass
through full text. Either redact the evidence when `resolveSourceRegistryRule()`
is not ok or abort the export before `validateDataRoot()`, and keep the allow
path limited to `result.rule.publicExportAllowed` only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 65638d2f-d980-40e8-9968-fa7992046e7f

📥 Commits

Reviewing files that changed from the base of the PR and between dd7ba45 and baa9eca.

📒 Files selected for processing (12)
  • .github/workflows/pages-deploy.yml
  • .github/workflows/pages-preview.yml
  • packages/cli/src/args.ts
  • packages/cli/src/index.test.ts
  • packages/core/src/schema/Manifest.ts
  • packages/fs/src/index.test.ts
  • packages/fs/src/index.ts
  • packages/fs/src/manifest.ts
  • packages/fs/src/pagesIndex.ts
  • packages/fs/src/publicExport.ts
  • packages/fs/src/validate.ts
  • scripts/build-pages-artifact.mjs
💤 Files with no reviewable changes (1)
  • packages/fs/src/validate.ts
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/pages-deploy.yml
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/fs/src/index.ts
  • packages/fs/src/index.test.ts

Comment thread packages/fs/src/publicExport.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: baa9eca838

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/fs/src/manifest.ts Outdated
Comment thread data/rights/source-registry.json Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant