12+ formats · nested JSON · schema inference

Any structure in. Cited answers out.

Drop in PDFs, spreadsheets, nested JSON, CSV, HTML, DOCX, scanned images — Knowledgebase detects the structure, infers schemas, extracts entities, and gives your agents cited search over the lot.

0ms lexical p95
0ms semantic p95
0% citation rate
0 embedding models

A query, answered with evidence

What risk factors did NVIDIA disclose in 2024?

NVIDIA identified supply chain concentration in TSMC as a primary risk, citing dependency on a single foundry for all GPU manufacturing. The company also flagged U.S. export controls on AI accelerators as material to revenue.

Sources
1 NVDA_10-K_2024.pdf chunk 47 · score 0.91
2 NVDA_10-K_2024.pdf chunk 52 · score 0.87

Every format. Every structure.

One ingestion pipeline handles documents, spreadsheets, structured data, and images. No pre-processing, no ETL, no Python runtime — parsing is TypeScript-native on Cloudflare Workers.

Documents
PDF HTML DOCX PPTX TXT MD

Digital PDF text + table rows. Scanned PDF OCR via Cloudflare vision models. HTML tag stripping. DOCX paragraphs. PPTX slide text.

Spreadsheets
XLSX CSV

XLSX rows with headers. CSV with quoted fields and embedded commas. Each row becomes a searchable record with inferred field types.

Structured data
JSON NDJSON JSONL

Nested JSON flattened up to 6 levels deep. NDJSON/JSONL line-by-line. Arrays of records extracted automatically. Schema inference detects field types and relationships.

Images & scanned PDFs
JPEG PNG WebP Scanned PDF

Vision OCR via Llama 3.2 Vision and Llama 4 Scout. Embedded PDF images extracted and transcribed. Markdown Conversion merges with vision output.

Live sources
URL SEC EDGAR

Import from URLs or pull SEC filings by ticker and form type. Auto-ingest stages jobs for queued processing.

Free-form text
Inline text Records

Paste raw text or POST structured records directly via the API. Schema-free domain text ingestion for notes, transcripts, and unstructured knowledge.

Schemas inferred, not configured.

Upload a sample and Knowledgebase detects field types, entity types, and cross-record relationships — then extracts structured entities with provenance into a queryable graph.

input.json nested, 3 levels
[
  {
    "company": {
      "name": "NVIDIA",
      "ticker": "NVDA"
    },
    "filing": "10-K",
    "risk_factors": [
      "Supply chain concentration",
      "Export controls"
    ]
  }
]
inferred_schema auto
Entity: Filing
company.name: string company.ticker: string filing: string risk_factors: string[]
Relationships
Filing Company
Queryable as counterparty: NVIDIA filing: 10-K

One API. Two endpoints.

Your agents call /v1/kb/search for cited retrieval or /v1/kb/query for cited answers. Service-key auth, tenant isolation, streaming support.

query.sh
curl -X POST https://knowledgebase.sarthakagrawal927.workers.dev/v1/kb/query \
  -H "Authorization: Bearer $RAG_SERVICE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "sec",
    "question": "What risk factors did NVIDIA disclose?",
    "mode": "hybrid",
    "answer_mode": "extractive"
  }'

# → { answer, citations: [{ chunk_id, score, document }], ... }

Why not generic chat RAG?

Generic chat RAG Knowledgebase
Formats PDF, TXT only 12+ — PDF, DOCX, XLSX, JSON, CSV, images
Citations Optional, often missing Every answer, chunk-level
Schemas Flat text chunks Inferred, entity graph
Retrieval Vector-only Lexical + semantic + hybrid
Verification Extra LLM call Deterministic, zero AI
Latency 2-5s typical 99ms lexical p95

Give your agents a memory they can cite.

Create a domain, drop in any file — PDF, XLSX, nested JSON, CSV, scanned images — and get a cited search endpoint in under a minute. No ETL, no Python, no infrastructure.

Open the dashboard