Skip to content

Legal full-text access ladder (Unpaywall/CORE/author-request) for #183#189

Merged
realmarcin merged 7 commits into
mainfrom
claude/fulltext-access-layer
Jul 5, 2026
Merged

Legal full-text access ladder (Unpaywall/CORE/author-request) for #183#189
realmarcin merged 7 commits into
mainfrom
claude/fulltext-access-layer

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

Adds scripts/fulltext_access.py — a legal, no-gray-area full-text access ladder for the paywalled-record residual (#183): Europe PMC OA → Unpaywall → CORE → author-request draft. Wired into the add-growth-conditions skill so closed-access records get a legal access attempt (Unpaywall/CORE OA copies) plus a ready-to-send author-request email, instead of just being skipped. Directly implements the recommended stack from the deep-research on nonprofit OA access (Unpaywall + Europe PMC + CORE core; author requests for the residual; Sci-Hub excluded). CORE requires a free CORE_API_KEY (public-research orgs qualify). Related: #183.

Copilot AI review requested due to automatic review settings July 3, 2026 20:37
…183

scripts/fulltext_access.py: given a PMID/DOI, tries in order of legality+cost —
Europe PMC OA (fullTextXML) -> Unpaywall best OA location (free) -> CORE (needs
$CORE_API_KEY, free for public-research orgs) -> and if none, prints an
author-request email draft. No gray-area sources (Sci-Hub excluded by design).
Verified: OA paper -> europepmc_oa URL; closed-access paper -> NO_LEGAL_OA +
author-request draft.

Wires into the add-growth-conditions skill workflow (step 3) so closed-access
records (the #183 residual) get a legal access attempt + a ready author-request
instead of just "paywalled, skip". Directly informed by the deep-research report
on nonprofit OA access.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin force-pushed the claude/fulltext-access-layer branch from d49fd88 to 8b3d068 Compare July 3, 2026 20:38

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new helper script and updates the add-growth-conditions curation skill to pursue legally available full text (OA copies) for closed-access records before falling back to an author-request draft, supporting enrichment work described in #183.

Changes:

  • Added scripts/fulltext_access.py to locate legal full text via Europe PMC OA → Unpaywall → CORE, else generate an author-request email draft.
  • Updated .claude/skills/add-growth-conditions/skill.md to instruct curators/agents to use the new ladder and to avoid gray-area sources.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
scripts/fulltext_access.py Implements the legal full-text access ladder and author-request draft output for closed-access residuals.
.claude/skills/add-growth-conditions/skill.md Documents how to use the new ladder in the growth-conditions enrichment workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +34 to +35
EMAIL = os.environ.get("UNPAYWALL_EMAIL", "marcinjoachimiak@gmail.com")
UA = {"User-Agent": f"CommunityMech-fulltext/1.0 (mailto:{EMAIL})"}
Comment on lines +74 to +78
def try_unpaywall(doi: str | None) -> str | None:
if not doi:
return None
d = _get_json(f"https://api.unpaywall.org/v2/{doi}?email={EMAIL}")
if not d or not d.get("is_oa"):
Comment on lines +38 to +48
def _get_json(url: str, headers: dict | None = None):
if not url.startswith("https://"): # only fixed https API hosts are queried
return None
req = urllib.request.Request(
url, headers={**UA, **(headers or {})}
) # noqa: S310 (https-only, checked above)
try:
with urllib.request.urlopen(req, timeout=30) as r: # noqa: S310
return json.load(r)
except Exception:
return None
Comment on lines +51 to +52
def epmc_core(pmid: str | None, doi: str | None) -> dict:
"""Europe PMC core record: pmcid, OA status, DOI/PMID, title, corresponding author."""
Comment on lines +104 to +106
return (
f"To: <corresponding author of: {authors}>\n"
f"Subject: Request for Methods details — {title[:70]}\n\n"
Comment on lines +52 to +55
3. Locate legal full text via the access ladder — `scripts/fulltext_access.py
--pmid <pmid>` (or `--doi <doi>`). It tries, in order: **Europe PMC OA**
(`fullTextXML`) → **Unpaywall** best OA location (free) → **CORE** (set
`CORE_API_KEY`) → and if none, prints an **author-request email draft**.
realmarcin and others added 5 commits July 3, 2026 14:51
…fallback

Ran scripts/fulltext_access.py over the 52 records lacking growth conditions:
- 30/52 have a legal OA copy — 17 NEWLY accessible via Unpaywall (green/gold OA
  the earlier Europe-PMC-only sweep never checked), 13 already-seen europepmc_oa
  (natural/endosymbiont/computational — no cultivation conditions to gain).
- 22 have no legal OA -> author-request email drafts in reports/author_requests/.
Summary + lists in reports/fulltext_access_183.md.

Also: add a CrossRef title/author fallback to the author-request draft so DOI-only
records not in Europe PMC (e.g. older or in-press DOIs) still get a populated,
send-ready request instead of an empty template.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Applied the access ladder to the 9 cultivable records among the 17 Unpaywall
hits; 6 gained real growth conditions from OA full text the earlier
Europe-PMC-only sweep had missed:
- 000144 Altered Schaedler Flora — BHI/Schaedler's+serum, anaerobic, gnotobiotic (OUP PDF)
- 000147 Trichodesmium/Alteromonas — YBC-II seawater, 26C, 14:10 light (PMC mirror)
- 000098 Watermelon SynCom8 — NB medium, 37C, 170 rpm, greenhouse pots (BMC PDF)
- 000139 Aalborg EBPR — wastewater sludge + CONTINUOUS BioDenipho plant (OUP mirror)
- 000043 Maize root — 1/2 MS agar gnotobiotic, 30C, pH 6.0, 16:8 light (PMC mirror)
- 000075 Trichoderma lactate — Mandels medium + FED_BATCH membrane bioreactor (EPFL PDF)
All source-backed with verbatim evidence snippets; validate + validate-terms pass.
3 no-change (278/284/292): Unpaywall "OA" was a false positive (DOI resolver /
repository metadata page, not real full text).

Also harden try_unpaywall: reject a best_oa_location that is just the doi.org
resolver (Unpaywall marks some Elsevier bronze/hybrid records is_oa but the
location redirects to the paywall) — verified it now returns NO_LEGAL_OA for
those instead of a false ACCESS.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… residuals

scripts/access_sweep.py — goes one layer past the per-record ladder for the
NO_LEGAL_OA residuals: queries OpenAlex (ALL oa locations, not just Unpaywall's
single best_oa_location) + the bioRxiv/medRxiv API (catches 2025+ 10.64898-prefix
preprints OpenAlex still lags on) + CORE when $CORE_API_KEY is set. Emits
reports/missing_pdfs.md: recovered OA URLs where any exist, else the publisher
landing URL as a last-resort pointer for a curator with institutional access.
Result: 1/22 recovered (000281 bioRxiv preprint — OA but Cloudflare-gated to
bots), 21/22 genuinely paywalled with landing URLs listed. Sci-Hub and other
gray-area mirrors deliberately excluded (sibling CultureMech gates them off too).

scripts/author_request_index.py — consolidates the 22 author-request drafts into
reports/author_requests_index.md with resolved title/journal/likely-corresponding
author/landing URL. Corresponding-author emails are NOT machine-resolvable (they
live only in the paywalled full text; CrossRef/EPMC don't expose them), so the
package is built for a human to open each landing URL, grab the email, and send.
Nothing is sent by this script.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…full text

The access sweep recovered this record's source as an OA bioRxiv preprint
(doi:10.64898/2026.05.29.728681); added a growth_media block from the Methods:
- eGAM (enhanced Gifu Anaerobic Medium) for routine culturing, adjusted M9
  minimal medium for metabolic labelling; 37 C; anaerobic (5% CO2, 5% H2, N2
  balance) in a Coy anaerobic vinyl chamber; 48 h strain cultures pooled into
  the six-member SynCom (P. vulgatus, B. breve, B. infantis, E. coli,
  L. gasseri, R. gnavus), abundance-equalised to OD 0.2 at 5 h.
Three verbatim IN_VITRO evidence snippets. Schema + terms validation pass.

Clears the one open item from reports/missing_pdfs.md (the sole OA-recovered
residual); the remaining 21 stay genuinely paywalled.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rately)

Policy: kb/communities/ holds only defined communities. Undefined ones
(donor-derived faecal/environmental inocula, enrichments, unresolved consortia)
are not added to the resource but tracked in reports/undefined_communities.md so
the observation + cultivation conditions are preserved.

First entry: the donor-derived infant faecal fermentation from the 000281 paper
(Cryptobiotix bioreactors, M0017 medium, 5 g/L prebiotics, 37 C / 24 h anaerobic)
— the undefined backdrop to the defined SynCom already curated as 000281.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit 7933177 into main Jul 5, 2026
4 checks passed
@realmarcin realmarcin deleted the claude/fulltext-access-layer branch July 5, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants