AI Toolkit
Modulesknowledge

knowledge

Document ingestion, chunking, embedding, and semantic search

Overview

The knowledge module provides a complete pipeline for working with documents: parse PDFs, chunk text, embed with any provider, store in a vector database, and search semantically. Wraps @llamaindex/liteparse for PDF parsing and integrates with the chain module for text splitting.

Peer dependencies: @llamaindex/liteparse

npm install @llamaindex/liteparse
yarn add @llamaindex/liteparse
pnpm add @llamaindex/liteparse

Quick Start

import { parseDocument, chunk } from '@jamaalbuilds/ai-toolkit/knowledge';

const doc = await parseDocument('./report.pdf');
const chunks = await chunk(doc.content, { chunkSize: 512, chunkOverlap: 50 });

API Reference

parseDocument(source)

Parse a document (PDF, text) into structured text.

function parseDocument(source: string | Buffer): Promise<KnowledgeDocument>
const doc = await parseDocument('./quarterly-report.pdf');
doc.content;  // extracted text
doc.metadata; // { source, pages, ... }

chunk(text, options?)

Split text into overlapping chunks for embedding.

function chunk(text: string, options?: ChunkOptions): Promise<DocumentChunk[]>
ParameterTypeDescription
textstringText to split
options.chunkSizenumberMaximum characters per chunk (default: 1000)
options.chunkOverlapnumberCharacter overlap between chunks (default: 200)
const chunks = await chunk(longText, { chunkSize: 500, chunkOverlap: 50 });
// [{ content: '...', metadata: {...} }, { content: '...', metadata: {...} }, ...]

createKnowledge(config)

Create a knowledge client that combines embedding + vector store for end-to-end RAG.

function createKnowledge(config: KnowledgeConfig): KnowledgeClient
ParameterTypeDescription
config.embedderEmbedFunctionasync (texts: string[]) => number[][]
config.storeVectorStoreVector store with upsert() and search()
const knowledge = createKnowledge({
  embedder: async (texts) => embeddings.embed(texts),
  store: myVectorStore,
});

ingest(input, embedder, store, options?)

Parse, chunk, embed, and store a document in one call.

function ingest(
  input: string | Buffer | Uint8Array,
  embedder: EmbedFunction,
  store: VectorStore,
  options?: IngestOptions,
): Promise<IngestResult>
ParameterTypeDescription
inputstring | Buffer | Uint8ArrayDocument source (path, raw text, or binary)
embedderEmbedFunctionasync (texts: string[]) => number[][]
storeVectorStoreVector store with upsert() and search()
options.metadataRecord<string, unknown>Metadata to attach to chunks
options.chunkSizenumberOverride chunk size
options.chunkOverlapnumberOverride chunk overlap
const result = await ingest(
  './report.pdf',
  async (texts) => embeddings.embed(texts),
  myVectorStore,
  { metadata: { source: 'quarterly-report', year: 2024 } },
);
// result.chunks — number of chunks stored

search(query, embedder, store, options?)

Semantic search across ingested documents. Also available as knowledge.search(query, options?) on a KnowledgeClient.

function search(
  query: string,
  embedder: EmbedFunction,
  store: VectorStore,
  options?: SearchOptions,
): Promise<SearchResult[]>
ParameterTypeDescription
options.limitnumberMaximum results (default: 5)
options.thresholdnumberMinimum similarity score (0-1)
const results = await knowledge.search('revenue growth Q3', { limit: 5 });
for (const r of results) {
  r.chunk.content;  // matched text
  r.similarity;     // similarity score (0-1)
}

Types

  • KnowledgeClient — client with ingest() and search() methods
  • KnowledgeDocument — parsed document with text and metadata
  • DocumentChunk — chunk with text, index, and metadata
  • EmbedFunction(texts: string[]) => Promise<number[][]>
  • VectorStore — interface with upsert() and search()
  • SearchResult — text, score, metadata
  • IngestResult — chunks count, metadata
On this page

On this page