Skip to content

fix(medcat-trainer): Dont use addons in the pipeline for document processing#564

Closed
alhendrickson wants to merge 4 commits into
mainfrom
fix/trainer/relcat
Closed

fix(medcat-trainer): Dont use addons in the pipeline for document processing#564
alhendrickson wants to merge 4 commits into
mainfrom
fix/trainer/relcat

Conversation

@alhendrickson

@alhendrickson alhendrickson commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

MedCAT Addons are not needed when calling prepare-documents API

This fixes this error, along with any other issue found inside an addon that isn't used

INFO 2026-06-24 10:21:43,721 pipeline.py l:329:Running component ner:cat_ner for 16730 of text (94412370605648)
INFO 2026-06-24 10:21:43,738 cdb.py l:45:Resetting subnames
INFO 2026-06-24 10:21:47,591 pipeline.py l:329:Running component linking:medcat2_linker for 16730 of text (94412370605648)
Token indices sequence length is longer than the specified maximum sequence length for this model (4925 > 512). Running this sequence through the model will result in indexing errors
ERROR 2026-06-24 10:21:48,080 rel_dataset.py l:348:document id : 0 failed to process relation
Traceback (most recent call last):
  File "/home/.venv/lib/python3.12/site-packages/medcat/components/addons/relation_extraction/rel_dataset.py", line 343, in _create_relation_validation
    assert ent1_token_start_pos

This is definitely messy as it mutates the cached CAT, but seems like this is done other places anyway like cat.config.components.linking.filters.cuis = cuis

@alhendrickson alhendrickson changed the title fix(medcat-trainer): Dont load addons for document processing fix(medcat-trainer): Dont use addons in the pipeline for document processing Jun 24, 2026
@mart-r

mart-r commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

This isn't the best (or at least a complete) way of doing this.
There's an addons config option as well in the model pack (config.components.addons). That's used to build the pipe.
So what you've done here would keep the component configs in place while removing them from the pipe.

The other potential issue is that the current process can change the order of the addons. This shouldn't matter. But it might. Especially with MetaCAT reusing some of the data paths to avoid duplicate computations.

Ideally we'd want to fix this on the core lib side:

  1. create API to remove addon(s)
    1.1) potentially remove all of a type or all but a type
  2. create API to clear addons
  3. add api to include/exclude at load time

With that said, my preferred method to remove addons would generally be something like this:

cat = CAT.load_model_pack(path)
cat.config.components.addons.clear()# NOTE: can keep copy if needed
cat._recreate_pipe()

That way we ensure that everything is working as expected.

Each addon should have its own config so that could be used to add them back if/when needed.

PS:
I'm not saying this wouldn't work. It most likely will. But it's not the intended way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants