Any structure in. Cited answers out.
Drop in PDFs, spreadsheets, nested JSON, CSV, HTML, DOCX, scanned images — Knowledgebase detects the structure, infers schemas, extracts entities, and gives your agents cited search over the lot.
A query, answered with evidence
NVIDIA identified supply chain concentration in TSMC as a primary risk, citing dependency on a single foundry for all GPU manufacturing. The company also flagged U.S. export controls on AI accelerators as material to revenue.
Every format. Every structure.
One ingestion pipeline handles documents, spreadsheets, structured data, and images. No pre-processing, no ETL, no Python runtime — parsing is TypeScript-native on Cloudflare Workers.
Digital PDF text + table rows. Scanned PDF OCR via Cloudflare vision models. HTML tag stripping. DOCX paragraphs. PPTX slide text.
XLSX rows with headers. CSV with quoted fields and embedded commas. Each row becomes a searchable record with inferred field types.
Nested JSON flattened up to 6 levels deep. NDJSON/JSONL line-by-line. Arrays of records extracted automatically. Schema inference detects field types and relationships.
Vision OCR via Llama 3.2 Vision and Llama 4 Scout. Embedded PDF images extracted and transcribed. Markdown Conversion merges with vision output.
Import from URLs or pull SEC filings by ticker and form type. Auto-ingest stages jobs for queued processing.
Paste raw text or POST structured records directly via the API. Schema-free domain text ingestion for notes, transcripts, and unstructured knowledge.
Schemas inferred, not configured.
Upload a sample and Knowledgebase detects field types, entity types, and cross-record relationships — then extracts structured entities with provenance into a queryable graph.
[
{
"company": {
"name": "NVIDIA",
"ticker": "NVDA"
},
"filing": "10-K",
"risk_factors": [
"Supply chain concentration",
"Export controls"
]
}
] counterparty: NVIDIA filing: 10-K One API. Two endpoints.
Your agents call /v1/kb/search for cited retrieval
or /v1/kb/query for cited answers. Service-key auth, tenant isolation, streaming support.
curl -X POST https://knowledgebase.sarthakagrawal927.workers.dev/v1/kb/query \ -H "Authorization: Bearer $RAG_SERVICE_KEY" \ -H "Content-Type: application/json" \ -d '{ "domain": "sec", "question": "What risk factors did NVIDIA disclose?", "mode": "hybrid", "answer_mode": "extractive" }' # → { answer, citations: [{ chunk_id, score, document }], ... }
Why not generic chat RAG?
| Generic chat RAG | Knowledgebase | |
|---|---|---|
| Formats | PDF, TXT only | 12+ — PDF, DOCX, XLSX, JSON, CSV, images |
| Citations | Optional, often missing | Every answer, chunk-level |
| Schemas | Flat text chunks | Inferred, entity graph |
| Retrieval | Vector-only | Lexical + semantic + hybrid |
| Verification | Extra LLM call | Deterministic, zero AI |
| Latency | 2-5s typical | 99ms lexical p95 |
Give your agents a memory they can cite.
Create a domain, drop in any file — PDF, XLSX, nested JSON, CSV, scanned images — and get a cited search endpoint in under a minute. No ETL, no Python, no infrastructure.
Open the dashboard