Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .changeset/provider-search-controls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
"@refkit/core": minor
"@refkit/mcp": minor
"@refkit/provider-unsplash": minor
"@refkit/provider-pexels": minor
"@refkit/provider-pixabay": minor
"@refkit/provider-flickr": minor
"@refkit/provider-brave": minor
"@refkit/provider-openverse": minor
"@refkit/provider-gutendex": minor
"@refkit/provider-poetrydb": minor
"@refkit/provider-wikimedia-commons": minor
"@refkit/provider-met": minor
"@refkit/provider-artic": minor
"@refkit/provider-smithsonian": minor
---

Add unified search controls, provider capability metadata, MCP controls input, search metadata/explanations, practical provider-specific `providerOptions` whitelists, and a core duplicate hook for agent-facing searches.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ node_modules/
dist/
out/
coverage/
.superpowers/
*.log
.DS_Store
.env
Expand Down
77 changes: 76 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,61 @@ for (const r of refs) {
const safe = await refkit.search({ query: 'forest', modalities: ['image'], gateFor: 'commercial-product' })
```

## Search controls

Use provider-neutral `controls` for the main path. refkit routes each control only to providers that declare support, and `searchWithMeta()` explains which providers applied or ignored each control:

```ts
await refkit.search({
query: 'brutalist library interior',
modalities: ['image'],
controls: {
orientation: 'landscape',
color: 'blue',
language: 'en-US',
sort: 'relevance',
safety: 'strict',
license: { commercial: true, modification: true },
media: { minWidth: 1200, minHeight: 800 },
},
})
```

Use `providerOptions` for provider-specific escape hatches that do not belong in the common contract. These are **typed whitelists**, not raw passthrough maps: each provider package translates the practical official search parameters it supports and ignores unsupported values.

```ts
await refkit.search({
query: 'forest path',
modalities: ['image'],
controls: { orientation: 'landscape', safety: 'strict' },
providerOptions: {
unsplash: { collections: ['abc', 'def'], page: 2 },
flickr: { tags: ['forest', 'path'], tagMode: 'all', minTakenDate: '2020-01-01' },
brave: { country: 'US', searchLang: 'en', spellcheck: false },
met: { departmentId: 11, isOnView: true },
gutendex: { topic: 'children', sort: 'popular' },
},
})
```

The provider package owns its native options surface, e.g. `UnsplashSearchOptions`, `FlickrSearchOptions`, `OpenverseImageSearchOptions`, `MetSearchOptions`, and `PoetryDbSearchOptions`. Response-format/debug parameters and auth-only knobs are intentionally omitted when they would break refkit's normalized `Reference` contract.

When an agent or UI needs to explain what happened, use `searchWithMeta`:

```ts
const { references, meta } = await refkit.searchWithMeta({
query: 'forest path',
modalities: ['image'],
controls: { orientation: 'landscape', color: 'green' },
gateFor: 'commercial-product',
})

console.log(meta.controls?.appliedByProvider)
console.log(meta.controls?.ignoredByProvider)
console.log(meta.providers)
console.log(meta.warnings)
```

## Ranking & rerank

By default, results are fused across sources with **Reciprocal Rank Fusion** — cross-source-orderable, but not query-aware. For sharper relevance, pass a **reranker**:
Expand Down Expand Up @@ -81,6 +136,19 @@ rerank: async ({ query, refs }) => myEmbeddingRerank(query, refs)

Rerank is **opt-in** — omit it for the default RRF order. It runs post-merge, before the `gateFor` license filter and the limit.

URL dedupe is built in, and perceptual hashes are supported when providers or hosts supply them. For host-computed fingerprints or embeddings, add a duplicate hook without making core fetch or decode media:

```ts
const refkit = createRefkit({
providers,
merge: {
isDuplicate: (candidate, existing) =>
(candidate.raw as { fingerprint?: string }).fingerprint ===
(existing.raw as { fingerprint?: string }).fingerprint,
},
})
```

## Providers

| Package | Source | Modality | Auth | License |
Expand Down Expand Up @@ -123,7 +191,14 @@ Audio/video are extra factories on existing packages: `openverseAudio()`, `pexel
- **No re-hosting** — keep `canonicalUrl` + thumbnails only; never store originals.
- **strict-deny** — when rights can't be determined, deny / needs-review (never fail-open). Unknown, NonCommercial, NoDerivatives and "no known copyright restrictions" never map to a usable license.

## Agent / MCP
## Agent usage

Agents can use refkit in two ways:

1. **SDK inside a host tool** — your app defines its own `search` tool, wires `createRefkit({ providers, fetch, cache })`, and controls keys, caching, retries, rerankers, filters, and provider-specific options.
2. **MCP adapter** — `@refkit/mcp` exposes the same license-normalized search over `search_references`, useful when you want a zero-glue tool that works across MCP-capable agents.

## MCP

`@refkit/mcp` exposes `search_references` over the [Model Context Protocol](https://modelcontextprotocol.io), so any MCP-capable agent can search license-normalized references with zero glue code.

Expand Down
84 changes: 84 additions & 0 deletions packages/core/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,73 @@ for (const r of refs) {
const safe = await refkit.search({ query: 'forest', modalities: ['image'], gateFor: 'commercial-product' })
```

## Search controls

Portable controls are expressed once and applied only to providers that declare support:

```ts
await refkit.search({
query: 'minimal workspace',
modalities: ['image'],
controls: {
orientation: 'landscape',
color: 'white',
language: 'en-US',
},
})
```

Provider-specific escape hatches go under `providerOptions`, keyed by provider id. Core routes only the matching entry; providers own typed whitelists for the practical official search parameters they translate:

```ts
await refkit.search({
query: 'mountain trail',
modalities: ['image'],
controls: { orientation: 'landscape', safety: 'strict' },
providerOptions: {
unsplash: { collections: ['abc', 'def'], page: 2 },
flickr: { sort: 'relevance', tags: ['mountain', 'trail'], tagMode: 'all' },
openverse: { source: ['flickr'], category: 'photograph', aspectRatio: 'wide' },
smithsonian: { sort: 'newest', rows: 25 },
},
})
```

`providerOptions` is not a raw upstream passthrough. Each provider package exports its own `*SearchOptions` interface and keeps response-format/debug/auth-only parameters out when they would conflict with normalized references or provider credentials.

Currently supported unified controls:

| Provider id | Unified controls |
|---|---|
| `unsplash` | `orientation`, `color`, `language`, `sort`, `safety` |
| `pexels` | `orientation`, `color`, `language`, `media.size`, `page` |
| `pexels-video` | `orientation`, `language`, `media.size`, `page` |
| `pixabay` | `orientation`, `color`, `language`, `sort`, `safety`, `media.kind`, `media.minWidth`, `media.minHeight` |
| `pixabay-video` | `language`, `sort`, `safety`, `media.kind`, `media.minWidth`, `media.minHeight` |
| `flickr` | `sort`, `safety`, `license.commercial`, `license.modification`, `license.allowUnknown`, `creator.id` |
| `brave` | `safety` |
| `openverse` | `license.commercial`, `license.modification`, `license.allowUnknown` |
| `openverse-audio` | `license.commercial`, `license.modification`, `license.allowUnknown` |
| `gutendex` | `language`, `text.copyright`, `page` |
| `poetrydb`, `wikimedia-commons`, `met`, `artic`, `smithsonian` | no unified controls in this release |

Use `searchWithMeta` when a host UI or agent needs the search explanation layer:

```ts
const { references, meta } = await refkit.searchWithMeta({
query: 'minimal workspace',
modalities: ['image'],
controls: { orientation: 'landscape', color: 'white' },
gateFor: 'commercial-product',
})

meta.controls?.appliedByProvider
meta.controls?.ignoredByProvider
meta.providers // provider status: fulfilled / failed / skipped
meta.gate // before/after/dropped counts when gateFor is used
meta.warnings // partial-result and gate/drop notes
```

## Ranking & rerank

Results are fused across sources with **Reciprocal Rank Fusion** (cross-source-orderable, not query-aware). Pass an optional `rerank`:
Expand All @@ -50,6 +117,23 @@ Rerank is opt-in and runs post-merge, before the `gateFor` license filter and th

Ranking is only as good as the candidate pool: `search` overfetches `limit × poolFactor` per provider (default 4×, capped per source) and narrows to `limit` after merge/rerank/gate — so dedup and ranking see a wide pool, not a source-truncated slice. Lower `poolFactor` when you query many providers.

## Dedupe hooks

Core dedupes exact canonical URLs by default and can dedupe equal-length perceptual hashes when `merge.hashThreshold` is set. Hosts that compute their own fingerprints or embeddings can add a sync duplicate predicate:

```ts
const refkit = createRefkit({
providers,
merge: {
isDuplicate: (candidate, existing) =>
(candidate.raw as { fingerprint?: string }).fingerprint ===
(existing.raw as { fingerprint?: string }).fingerprint,
},
})
```

The hook compares `Reference` objects only. Core still never fetches, decodes, or stores media.

## Invariants (enforced by `src/__tests__/no-network.test.ts`)

- **Zero network** — no `fetch` call, no hard-coded endpoint in this package. Hosts inject `ProviderContext.fetch`.
Expand Down
93 changes: 93 additions & 0 deletions packages/core/src/__tests__/client.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -171,4 +171,97 @@ describe('createRefkit', () => {
await rk.search({ query: 'x', modalities: ['image'], limit: 150 }) // > cap → fetch the limit itself, not less
expect(sink.limit).toBe(150)
})

it('forwards provider-specific search options only to the matching provider', async () => {
let seenA: unknown
let seenB: unknown
const a = defineProvider({
id: 'a',
modalities: ['image'],
queryFeatures: ['keyword'],
search: async (q) => { seenA = q.providerOptions; return [] },
})
const b = defineProvider({
id: 'b',
modalities: ['image'],
queryFeatures: ['keyword'],
search: async (q) => { seenB = q.providerOptions; return [] },
})
const rk = createRefkit({ providers: [a, b] })
await rk.search({
query: 'x',
modalities: ['image'],
providerOptions: { a: { orderBy: 'latest' }, b: { sort: 'relevance' } },
})
expect(seenA).toEqual({ orderBy: 'latest' })
expect(seenB).toEqual({ sort: 'relevance' })
})

it('searchWithMeta returns provider status, warnings, and gate summary', async () => {
const textOnly = defineProvider({
id: 'text',
modalities: ['text'],
queryFeatures: ['keyword'],
search: async () => [],
})
const rk = createRefkit({
providers: [
provider('ok', [ref('ok-1', 'https://ok/1', 'CC0-1.0'), ref('ok-2', 'https://ok/2', 'proprietary')]),
failing('bad'),
textOnly,
],
})
const out = await rk.searchWithMeta({ query: 'x', modalities: ['image'], gateFor: 'commercial-product' })

expect(out.references.map(r => r.canonicalUrl)).toEqual(['https://ok/1'])
expect(out.meta.providers).toEqual([
{ providerId: 'ok', status: 'fulfilled', returned: 2, accepted: 2, rejected: 0 },
{ providerId: 'bad', status: 'failed', error: 'boom' },
{ providerId: 'text', status: 'skipped', reason: 'unsupported-modality' },
])
expect(out.meta.gate).toEqual({ intent: 'commercial-product', before: 2, after: 1, dropped: 1 })
expect(out.meta.warnings).toContain('1 provider(s) failed; returning partial results.')
})

it('uses merge.isDuplicate to dedupe host-supplied fingerprints during search', async () => {
const a = { ...ref('a-1', 'https://a/1'), relevance: 0.2, raw: { fingerprint: 'same' } }
const b = { ...ref('a-2', 'https://a/2'), relevance: 0.9, raw: { fingerprint: 'same' } }
const rk = createRefkit({
providers: [provider('a', [b, a])],
merge: {
isDuplicate: (candidate, existing) =>
(candidate.raw as { fingerprint?: string }).fingerprint === (existing.raw as { fingerprint?: string }).fingerprint,
},
})
const out = await rk.search({ query: 'x', modalities: ['image'] })
expect(out.map(r => r.id)).toEqual(['a-2'])
})

it('searchWithMeta reports applied and ignored unified controls by provider', async () => {
const controlled = defineProvider({
id: 'controlled',
modalities: ['image'],
queryFeatures: ['keyword'],
capabilities: { controls: ['orientation', 'color'] },
search: async () => [ref('controlled-1', 'https://controlled/1')],
})
const plain = defineProvider({
id: 'plain',
modalities: ['image'],
queryFeatures: ['keyword'],
capabilities: { controls: [] },
search: async () => [ref('plain-1', 'https://plain/1')],
})
const rk = createRefkit({ providers: [controlled, plain] })
const out = await rk.searchWithMeta({
query: 'x',
modalities: ['image'],
controls: { orientation: 'landscape', color: 'blue', safety: 'strict' },
})
expect(out.meta.controls).toEqual({
requested: ['orientation', 'color', 'safety'],
appliedByProvider: { controlled: ['orientation', 'color'], plain: [] },
ignoredByProvider: { controlled: ['safety'], plain: ['orientation', 'color', 'safety'] },
})
})
})
12 changes: 12 additions & 0 deletions packages/core/src/__tests__/dedup.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,18 @@ describe('dedupeReferences', () => {
expect(out).toHaveLength(2)
})

it('uses a custom duplicate hook for host-supplied fingerprints', () => {
const out = dedupeReferences([
make({ id: 'a', canonicalUrl: 'https://x/1', relevance: 0.4, raw: { fingerprint: 'same' } }),
make({ id: 'b', canonicalUrl: 'https://y/2', relevance: 0.9, raw: { fingerprint: 'same' } }),
make({ id: 'c', canonicalUrl: 'https://z/3', relevance: 0.6, raw: { fingerprint: 'other' } }),
], {
isDuplicate: (candidate, existing) =>
(candidate.raw as { fingerprint?: string }).fingerprint === (existing.raw as { fingerprint?: string }).fingerprint,
})
expect(out.map(r => r.id)).toEqual(['b', 'c'])
})

it('stale byUrl fix: C(url=a) must not dedupe against hash-replaced B(url=b) via stale index', () => {
// Step 1: A(url=a, hash=ffff, rel=0.3) → pushed to kept[0]; byUrl = {url_a → 0}
// Step 2: B(url=b, hash=fffe, rel=0.9) → url_b not in byUrl; hash-distance(fffe,ffff)=1≤4 → merges.
Expand Down
11 changes: 11 additions & 0 deletions packages/core/src/__tests__/provider.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,15 @@ describe('ReferenceProvider / defineProvider', () => {
const p = defineProvider({ id: 'x', modalities: ['text'], queryFeatures: [], search: async () => [] })
expect(p.id).toBe('x')
})

it('allows providers to declare supported unified search controls', () => {
const p = defineProvider({
id: 'x',
modalities: ['image'],
queryFeatures: ['keyword'],
capabilities: { controls: ['orientation', 'color', 'safety'] },
search: async () => [],
})
expect(p.capabilities?.controls).toEqual(['orientation', 'color', 'safety'])
})
})
Loading
Loading