knowledge

Document ingestion, chunking, embedding, and semantic search

Overview

The knowledge module provides a complete pipeline for working with documents: parse PDFs, chunk text, embed with any provider, store in a vector database, and search semantically. Wraps @llamaindex/liteparse for PDF parsing and integrates with the chain module for text splitting.

Peer dependencies: @llamaindex/liteparse

npm install @llamaindex/liteparse

yarn add @llamaindex/liteparse

pnpm add @llamaindex/liteparse

Quick Start

import { parseDocument, chunk } from '@jamaalbuilds/ai-toolkit/knowledge';

const doc = await parseDocument('./report.pdf');
const chunks = await chunk(doc.content, { chunkSize: 512, chunkOverlap: 50 });

API Reference

`parseDocument(source)`

Parse a document (PDF, text) into structured text.

function parseDocument(source: string | Buffer): Promise<KnowledgeDocument>

const doc = await parseDocument('./quarterly-report.pdf');
doc.content;  // extracted text
doc.metadata; // { source, pages, ... }

`chunk(text, options?)`

Split text into overlapping chunks for embedding.

function chunk(text: string, options?: ChunkOptions): Promise<DocumentChunk[]>

Parameter	Type	Description
`text`	`string`	Text to split
`options.chunkSize`	`number`	Maximum characters per chunk (default: 1000)
`options.chunkOverlap`	`number`	Character overlap between chunks (default: 200)

const chunks = await chunk(longText, { chunkSize: 500, chunkOverlap: 50 });
// [{ content: '...', metadata: {...} }, { content: '...', metadata: {...} }, ...]

`createKnowledge(config)`

Create a knowledge client that combines embedding + vector store for end-to-end RAG.

function createKnowledge(config: KnowledgeConfig): KnowledgeClient

Parameter	Type	Description
`config.embedder`	`EmbedFunction`	`async (texts: string[]) => number[][]`
`config.store`	`VectorStore`	Vector store with `upsert()` and `search()`

const knowledge = createKnowledge({
  embedder: async (texts) => embeddings.embed(texts),
  store: myVectorStore,
});

`ingest(input, embedder, store, options?)`

Parse, chunk, embed, and store a document in one call.

function ingest(
  input: string | Buffer | Uint8Array,
  embedder: EmbedFunction,
  store: VectorStore,
  options?: IngestOptions,
): Promise<IngestResult>

Parameter	Type	Description
`input`	`string \| Buffer \| Uint8Array`	Document source (path, raw text, or binary)
`embedder`	`EmbedFunction`	`async (texts: string[]) => number[][]`
`store`	`VectorStore`	Vector store with `upsert()` and `search()`
`options.metadata`	`Record<string, unknown>`	Metadata to attach to chunks
`options.chunkSize`	`number`	Override chunk size
`options.chunkOverlap`	`number`	Override chunk overlap

const result = await ingest(
  './report.pdf',
  async (texts) => embeddings.embed(texts),
  myVectorStore,
  { metadata: { source: 'quarterly-report', year: 2024 } },
);
// result.chunks — number of chunks stored

`search(query, embedder, store, options?)`

Semantic search across ingested documents. Also available as knowledge.search(query, options?) on a KnowledgeClient.

function search(
  query: string,
  embedder: EmbedFunction,
  store: VectorStore,
  options?: SearchOptions,
): Promise<SearchResult[]>

Parameter	Type	Description
`options.limit`	`number`	Maximum results (default: 5)
`options.threshold`	`number`	Minimum similarity score (0-1)

const results = await knowledge.search('revenue growth Q3', { limit: 5 });
for (const r of results) {
  r.chunk.content;  // matched text
  r.similarity;     // similarity score (0-1)
}

Types

KnowledgeClient — client with ingest() and search() methods
KnowledgeDocument — parsed document with text and metadata
DocumentChunk — chunk with text, index, and metadata
EmbedFunction — (texts: string[]) => Promise<number[][]>
VectorStore — interface with upsert() and search()
SearchResult — text, score, metadata
IngestResult — chunks count, metadata

Star on GitHub Report Issue Contribute

knowledge

Overview

Quick Start

API Reference

parseDocument(source)

chunk(text, options?)

createKnowledge(config)

ingest(input, embedder, store, options?)

search(query, embedder, store, options?)