Skip to content

fix(import): RM-17159 AI Importer couchbase sidebar pages duplicated#21

Open
xavierandueza wants to merge 3 commits into
fix/rm-17170-ai-importer-warpdev-all-imported-page-slugs-contain-txtfrom
xavier/rm-17159-ai-importer-couchbase-sidebar-pages-duplicated-across-two
Open

fix(import): RM-17159 AI Importer couchbase sidebar pages duplicated#21
xavierandueza wants to merge 3 commits into
fix/rm-17170-ai-importer-warpdev-all-imported-page-slugs-contain-txtfrom
xavier/rm-17159-ai-importer-couchbase-sidebar-pages-duplicated-across-two

Conversation

@xavierandueza

Copy link
Copy Markdown
Contributor
🚥 Resolves RM-1759

🧰 Changes

This fixes 2 issues:

  1. For sites where we had to use an LLM to organize, the LLM would sometimes duplicate a page and we'd end up with a duplicated page in the results
  2. Fix: to dedupe after initial scrape from urls, and to dedupe again if LLM used to organize
  3. The threshold to use LLMS to organize was too strict - we checked if any section was over 50, but for large docs sites this is common. Now we check that the average is under a threshold of 200 and that we have at minimum 2x categories.

🧬 QA & Testing

Ran on couchbase, and verified its now got good categories - 49x for 3500 pages, with no duplicates.

Unit tests also created.

Also tested on:

All look good with new handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant