Modulesknowledge
knowledge
Document ingestion, chunking, embedding, and semantic search
Overview
The knowledge module provides a complete pipeline for working with documents: parse PDFs, chunk text, embed with any provider, store in a vector database, and search semantically. Wraps @llamaindex/liteparse for PDF parsing and integrates with the chain module for text splitting.
Peer dependencies: @llamaindex/liteparse
npm install @llamaindex/liteparseyarn add @llamaindex/liteparsepnpm add @llamaindex/liteparseQuick Start
import { parseDocument, chunk } from '@jamaalbuilds/ai-toolkit/knowledge';
const doc = await parseDocument('./report.pdf');
const chunks = await chunk(doc.content, { chunkSize: 512, chunkOverlap: 50 });
API Reference
parseDocument(source)
Parse a document (PDF, text) into structured text.
function parseDocument(source: string | Buffer): Promise<KnowledgeDocument>
const doc = await parseDocument('./quarterly-report.pdf');
doc.content; // extracted text
doc.metadata; // { source, pages, ... }
chunk(text, options?)
Split text into overlapping chunks for embedding.
function chunk(text: string, options?: ChunkOptions): Promise<DocumentChunk[]>
| Parameter | Type | Description |
|---|---|---|
text | string | Text to split |
options.chunkSize | number | Maximum characters per chunk (default: 1000) |
options.chunkOverlap | number | Character overlap between chunks (default: 200) |
const chunks = await chunk(longText, { chunkSize: 500, chunkOverlap: 50 });
// [{ content: '...', metadata: {...} }, { content: '...', metadata: {...} }, ...]
createKnowledge(config)
Create a knowledge client that combines embedding + vector store for end-to-end RAG.
function createKnowledge(config: KnowledgeConfig): KnowledgeClient
| Parameter | Type | Description |
|---|---|---|
config.embedder | EmbedFunction | async (texts: string[]) => number[][] |
config.store | VectorStore | Vector store with upsert() and search() |
const knowledge = createKnowledge({
embedder: async (texts) => embeddings.embed(texts),
store: myVectorStore,
});
ingest(input, embedder, store, options?)
Parse, chunk, embed, and store a document in one call.
function ingest(
input: string | Buffer | Uint8Array,
embedder: EmbedFunction,
store: VectorStore,
options?: IngestOptions,
): Promise<IngestResult>
| Parameter | Type | Description |
|---|---|---|
input | string | Buffer | Uint8Array | Document source (path, raw text, or binary) |
embedder | EmbedFunction | async (texts: string[]) => number[][] |
store | VectorStore | Vector store with upsert() and search() |
options.metadata | Record<string, unknown> | Metadata to attach to chunks |
options.chunkSize | number | Override chunk size |
options.chunkOverlap | number | Override chunk overlap |
const result = await ingest(
'./report.pdf',
async (texts) => embeddings.embed(texts),
myVectorStore,
{ metadata: { source: 'quarterly-report', year: 2024 } },
);
// result.chunks — number of chunks stored
search(query, embedder, store, options?)
Semantic search across ingested documents. Also available as knowledge.search(query, options?) on a KnowledgeClient.
function search(
query: string,
embedder: EmbedFunction,
store: VectorStore,
options?: SearchOptions,
): Promise<SearchResult[]>
| Parameter | Type | Description |
|---|---|---|
options.limit | number | Maximum results (default: 5) |
options.threshold | number | Minimum similarity score (0-1) |
const results = await knowledge.search('revenue growth Q3', { limit: 5 });
for (const r of results) {
r.chunk.content; // matched text
r.similarity; // similarity score (0-1)
}
Types
KnowledgeClient— client withingest()andsearch()methodsKnowledgeDocument— parsed document with text and metadataDocumentChunk— chunk with text, index, and metadataEmbedFunction—(texts: string[]) => Promise<number[][]>VectorStore— interface withupsert()andsearch()SearchResult— text, score, metadataIngestResult— chunks count, metadata