12+ formats · nested JSON · schema inference

Any structure in. Cited answers out.

Drop in PDFs, spreadsheets, nested JSON, CSV, HTML, DOCX, scanned images — Knowledgebase detects the structure, infers schemas, extracts entities, and gives your agents cited search over the lot.

0ms lexical p95

0ms semantic p95

0% citation rate

0 embedding models

Open the dashboard See it in action →

A query, answered with evidence

What risk factors did NVIDIA disclose in 2024?

NVIDIA identified supply chain concentration in TSMC as a primary risk, citing dependency on a single foundry for all GPU manufacturing. The company also flagged U.S. export controls on AI accelerators as material to revenue.

Sources

1 NVDA_10-K_2024.pdf chunk 47 · score 0.91

2 NVDA_10-K_2024.pdf chunk 52 · score 0.87

Every format. Every structure.

One ingestion pipeline handles documents, spreadsheets, structured data, and images. No pre-processing, no ETL, no Python runtime — parsing is TypeScript-native on Cloudflare Workers.

Documents

PDF HTML DOCX PPTX TXT MD

Digital PDF text + table rows. Scanned PDF OCR via Cloudflare vision models. HTML tag stripping. DOCX paragraphs. PPTX slide text.

Spreadsheets

XLSX CSV

XLSX rows with headers. CSV with quoted fields and embedded commas. Each row becomes a searchable record with inferred field types.

Structured data

JSON NDJSON JSONL

Nested JSON flattened up to 6 levels deep. NDJSON/JSONL line-by-line. Arrays of records extracted automatically. Schema inference detects field types and relationships.

Images & scanned PDFs

JPEG PNG WebP Scanned PDF

Vision OCR via Llama 3.2 Vision and Llama 4 Scout. Embedded PDF images extracted and transcribed. Markdown Conversion merges with vision output.

Live sources

URL SEC EDGAR

Import from URLs or pull SEC filings by ticker and form type. Auto-ingest stages jobs for queued processing.

Free-form text

Inline text Records

Paste raw text or POST structured records directly via the API. Schema-free domain text ingestion for notes, transcripts, and unstructured knowledge.

Schemas inferred, not configured.

Upload a sample and Knowledgebase detects field types, entity types, and cross-record relationships — then extracts structured entities with provenance into a queryable graph.

input.json nested, 3 levels

[
  {
    "company": {
      "name": "NVIDIA",
      "ticker": "NVDA"
    },
    "filing": "10-K",
    "risk_factors": [
      "Supply chain concentration",
      "Export controls"
    ]
  }
]

inferred_schema auto

Entity: Filing

company.name: string company.ticker: string filing: string risk_factors: string[]

Relationships

Filing Company

Queryable as counterparty: NVIDIA filing: 10-K

One API. Two endpoints.

Your agents call /v1/kb/search for cited retrieval or /v1/kb/query for cited answers. Service-key auth, tenant isolation, streaming support.

query.sh

curl -X POST https://knowledgebase.sarthakagrawal927.workers.dev/v1/kb/query \
  -H "Authorization: Bearer $RAG_SERVICE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "sec",
    "question": "What risk factors did NVIDIA disclose?",
    "mode": "hybrid",
    "answer_mode": "extractive"
  }'

# → { answer, citations: [{ chunk_id, score, document }], ... }

Why not generic chat RAG?

	Generic chat RAG	Knowledgebase
Formats	PDF, TXT only	12+ — PDF, DOCX, XLSX, JSON, CSV, images
Citations	Optional, often missing	Every answer, chunk-level
Schemas	Flat text chunks	Inferred, entity graph
Retrieval	Vector-only	Lexical + semantic + hybrid
Verification	Extra LLM call	Deterministic, zero AI
Latency	2-5s typical	99ms lexical p95

Give your agents a memory they can cite.

Create a domain, drop in any file — PDF, XLSX, nested JSON, CSV, scanned images — and get a cited search endpoint in under a minute. No ETL, no Python, no infrastructure.

Open the dashboard