Legal full-text access ladder (Unpaywall/CORE/author-request) for #183#189
Merged
Conversation
…183 scripts/fulltext_access.py: given a PMID/DOI, tries in order of legality+cost — Europe PMC OA (fullTextXML) -> Unpaywall best OA location (free) -> CORE (needs $CORE_API_KEY, free for public-research orgs) -> and if none, prints an author-request email draft. No gray-area sources (Sci-Hub excluded by design). Verified: OA paper -> europepmc_oa URL; closed-access paper -> NO_LEGAL_OA + author-request draft. Wires into the add-growth-conditions skill workflow (step 3) so closed-access records (the #183 residual) get a legal access attempt + a ready author-request instead of just "paywalled, skip". Directly informed by the deep-research report on nonprofit OA access. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
d49fd88 to
8b3d068
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new helper script and updates the add-growth-conditions curation skill to pursue legally available full text (OA copies) for closed-access records before falling back to an author-request draft, supporting enrichment work described in #183.
Changes:
- Added
scripts/fulltext_access.pyto locate legal full text via Europe PMC OA → Unpaywall → CORE, else generate an author-request email draft. - Updated
.claude/skills/add-growth-conditions/skill.mdto instruct curators/agents to use the new ladder and to avoid gray-area sources.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| scripts/fulltext_access.py | Implements the legal full-text access ladder and author-request draft output for closed-access residuals. |
| .claude/skills/add-growth-conditions/skill.md | Documents how to use the new ladder in the growth-conditions enrichment workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+34
to
+35
| EMAIL = os.environ.get("UNPAYWALL_EMAIL", "marcinjoachimiak@gmail.com") | ||
| UA = {"User-Agent": f"CommunityMech-fulltext/1.0 (mailto:{EMAIL})"} |
Comment on lines
+74
to
+78
| def try_unpaywall(doi: str | None) -> str | None: | ||
| if not doi: | ||
| return None | ||
| d = _get_json(f"https://api.unpaywall.org/v2/{doi}?email={EMAIL}") | ||
| if not d or not d.get("is_oa"): |
Comment on lines
+38
to
+48
| def _get_json(url: str, headers: dict | None = None): | ||
| if not url.startswith("https://"): # only fixed https API hosts are queried | ||
| return None | ||
| req = urllib.request.Request( | ||
| url, headers={**UA, **(headers or {})} | ||
| ) # noqa: S310 (https-only, checked above) | ||
| try: | ||
| with urllib.request.urlopen(req, timeout=30) as r: # noqa: S310 | ||
| return json.load(r) | ||
| except Exception: | ||
| return None |
Comment on lines
+51
to
+52
| def epmc_core(pmid: str | None, doi: str | None) -> dict: | ||
| """Europe PMC core record: pmcid, OA status, DOI/PMID, title, corresponding author.""" |
Comment on lines
+104
to
+106
| return ( | ||
| f"To: <corresponding author of: {authors}>\n" | ||
| f"Subject: Request for Methods details — {title[:70]}\n\n" |
Comment on lines
+52
to
+55
| 3. Locate legal full text via the access ladder — `scripts/fulltext_access.py | ||
| --pmid <pmid>` (or `--doi <doi>`). It tries, in order: **Europe PMC OA** | ||
| (`fullTextXML`) → **Unpaywall** best OA location (free) → **CORE** (set | ||
| `CORE_API_KEY`) → and if none, prints an **author-request email draft**. |
…fallback Ran scripts/fulltext_access.py over the 52 records lacking growth conditions: - 30/52 have a legal OA copy — 17 NEWLY accessible via Unpaywall (green/gold OA the earlier Europe-PMC-only sweep never checked), 13 already-seen europepmc_oa (natural/endosymbiont/computational — no cultivation conditions to gain). - 22 have no legal OA -> author-request email drafts in reports/author_requests/. Summary + lists in reports/fulltext_access_183.md. Also: add a CrossRef title/author fallback to the author-request draft so DOI-only records not in Europe PMC (e.g. older or in-press DOIs) still get a populated, send-ready request instead of an empty template. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Applied the access ladder to the 9 cultivable records among the 17 Unpaywall hits; 6 gained real growth conditions from OA full text the earlier Europe-PMC-only sweep had missed: - 000144 Altered Schaedler Flora — BHI/Schaedler's+serum, anaerobic, gnotobiotic (OUP PDF) - 000147 Trichodesmium/Alteromonas — YBC-II seawater, 26C, 14:10 light (PMC mirror) - 000098 Watermelon SynCom8 — NB medium, 37C, 170 rpm, greenhouse pots (BMC PDF) - 000139 Aalborg EBPR — wastewater sludge + CONTINUOUS BioDenipho plant (OUP mirror) - 000043 Maize root — 1/2 MS agar gnotobiotic, 30C, pH 6.0, 16:8 light (PMC mirror) - 000075 Trichoderma lactate — Mandels medium + FED_BATCH membrane bioreactor (EPFL PDF) All source-backed with verbatim evidence snippets; validate + validate-terms pass. 3 no-change (278/284/292): Unpaywall "OA" was a false positive (DOI resolver / repository metadata page, not real full text). Also harden try_unpaywall: reject a best_oa_location that is just the doi.org resolver (Unpaywall marks some Elsevier bronze/hybrid records is_oa but the location redirects to the paywall) — verified it now returns NO_LEGAL_OA for those instead of a false ACCESS. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… residuals scripts/access_sweep.py — goes one layer past the per-record ladder for the NO_LEGAL_OA residuals: queries OpenAlex (ALL oa locations, not just Unpaywall's single best_oa_location) + the bioRxiv/medRxiv API (catches 2025+ 10.64898-prefix preprints OpenAlex still lags on) + CORE when $CORE_API_KEY is set. Emits reports/missing_pdfs.md: recovered OA URLs where any exist, else the publisher landing URL as a last-resort pointer for a curator with institutional access. Result: 1/22 recovered (000281 bioRxiv preprint — OA but Cloudflare-gated to bots), 21/22 genuinely paywalled with landing URLs listed. Sci-Hub and other gray-area mirrors deliberately excluded (sibling CultureMech gates them off too). scripts/author_request_index.py — consolidates the 22 author-request drafts into reports/author_requests_index.md with resolved title/journal/likely-corresponding author/landing URL. Corresponding-author emails are NOT machine-resolvable (they live only in the paywalled full text; CrossRef/EPMC don't expose them), so the package is built for a human to open each landing URL, grab the email, and send. Nothing is sent by this script. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…full text The access sweep recovered this record's source as an OA bioRxiv preprint (doi:10.64898/2026.05.29.728681); added a growth_media block from the Methods: - eGAM (enhanced Gifu Anaerobic Medium) for routine culturing, adjusted M9 minimal medium for metabolic labelling; 37 C; anaerobic (5% CO2, 5% H2, N2 balance) in a Coy anaerobic vinyl chamber; 48 h strain cultures pooled into the six-member SynCom (P. vulgatus, B. breve, B. infantis, E. coli, L. gasseri, R. gnavus), abundance-equalised to OD 0.2 at 5 h. Three verbatim IN_VITRO evidence snippets. Schema + terms validation pass. Clears the one open item from reports/missing_pdfs.md (the sole OA-recovered residual); the remaining 21 stay genuinely paywalled. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rately) Policy: kb/communities/ holds only defined communities. Undefined ones (donor-derived faecal/environmental inocula, enrichments, unresolved consortia) are not added to the resource but tracked in reports/undefined_communities.md so the observation + cultivation conditions are preserved. First entry: the donor-derived infant faecal fermentation from the 000281 paper (Cryptobiotix bioreactors, M0017 medium, 5 g/L prebiotics, 37 C / 24 h anaerobic) — the undefined backdrop to the defined SynCom already curated as 000281. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
scripts/fulltext_access.py— a legal, no-gray-area full-text access ladder for the paywalled-record residual (#183): Europe PMC OA → Unpaywall → CORE → author-request draft. Wired into theadd-growth-conditionsskill so closed-access records get a legal access attempt (Unpaywall/CORE OA copies) plus a ready-to-send author-request email, instead of just being skipped. Directly implements the recommended stack from the deep-research on nonprofit OA access (Unpaywall + Europe PMC + CORE core; author requests for the residual; Sci-Hub excluded). CORE requires a freeCORE_API_KEY(public-research orgs qualify). Related: #183.