|
|
|
Docs
LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.
Hitting the limits of local parsing?
For complex documents (dense tables, multi-column layouts, charts, handwritten text, or
scanned PDFs), you'll get significantly better results with LlamaParse,
our cloud-based document parser built for production document pipelines. LlamaParse handles the
hard stuff so your models see clean, structured data and markdown.
Install globally via npm to use the lit command anywhere:
npm i -g @llamaindex/liteparse
Then use it:
lit parse document.pdf
lit screenshot document.pdf
For macOS and Linux users, liteparse can be also installed via brew:
brew tap run-llama/liteparse
brew install llamaindex-liteparse
You can clone the repo and install the CLI globally from source:
git clone https://github.com/run-llama/liteparse.git
cd liteparse
npm run build
npm pack
npm install -g ./liteparse-*.tgz
You can use liteparse as an agent skill, downloading it with the skills CLI tool:
npx skills add run-llama/llamaparse-agent-skills --skill liteparse
Or copy-pasting the SKILL.md file to your own skills setup.
# Basic parsing
lit parse document.pdf
# Parse with specific format
lit parse document.pdf --format json -o output.md
# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"
# Parse without OCR
lit parse document.pdf --no-ocr
You can also parse an entire directory of documents:
lit batch-parse ./input-directory ./output-directory
Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.
# Screenshot all pages
lit screenshot document.pdf -o ./screenshots
# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots
# Screenshot page range
lit screenshot document.pdf --target-pages "1-10" -o ./screenshots
Install as a dependency in your project:
npm install @llamaindex/liteparse
# or
pnpm add @llamaindex/liteparse
import { LiteParse } from '@llamaindex/liteparse';
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);
You can pass raw bytes directly instead of a file path. PDF buffers are parsed with zero disk I/O — no temp files are written:
import { LiteParse } from '@llamaindex/liteparse';
import { readFile } from 'fs/promises';
const parser = new LiteParse();
// From a file read
const pdfBytes = await readFile('document.pdf');
const result = await parser.parse(pdfBytes);
// From an HTTP response
const response = await fetch('https://example.com/document.pdf');
const buffer = Buffer.from(await response.arrayBuffer());
const result2 = await parser.parse(buffer);
Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:
const screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);
$ lit parse --help
Usage: lit parse [options] <file>
Parse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)
Options:
-o, --output <file> Output file path
--format <format> Output format: json|text (default: "text")
--ocr-server-url <url> HTTP OCR server URL (uses Tesseract if not provided)
--no-ocr Disable OCR
--ocr-language <lang> OCR language(s) (default: "en")
--num-workers <n> Number of pages to OCR in parallel (default: CPU cores - 1)
--max-pages <n> Max pages to parse (default: "10000")
--target-pages <pages> Target pages (e.g., "1-5,10,15-20")
--dpi <dpi> DPI for rendering (default: "150")
--no-precise-bbox Disable precise bounding boxes
--preserve-small-text Preserve very small text
--config <file> Config file (JSON)
-q, --quiet Suppress progress output
-h, --help display help for command
$ lit batch-parse --help
Usage: lit batch-parse [options] <input-dir> <output-dir>
Parse multiple documents in batch mode (reuses PDF engine for efficiency)
Options:
--format <format> Output format: json|text (default: "text")
--ocr-server-url <url> HTTP OCR server URL (uses Tesseract if not provided)
--no-ocr Disable OCR
--ocr-language <lang> OCR language(s) (default: "en")
--num-workers <n> Number of pages to OCR in parallel (default: CPU cores - 1)
--max-pages <n> Max pages to parse per file (default: "10000")
--dpi <dpi> DPI for rendering (default: "150")
--no-precise-bbox Disable precise bounding boxes
--recursive Recursively search input directory
--extension <ext> Only process files with this extension (e.g., ".pdf")
--config <file> Config file (JSON)
-q, --quiet Suppress progress output
-h, --help display help for command
$ lit screenshot --help
Usage: lit screenshot [options] <file>
Generate screenshots of PDF pages
Options:
-o, --output-dir <dir> Output directory for screenshots (default: "./screenshots")
--target-pages <pages> Page numbers to screenshot (e.g., "1,3,5" or "1-5")
--dpi <dpi> DPI for rendering (default: "150")
--format <format> Image format: png|jpg (default: "png")
--config <file> Config file (JSON)
-q, --quiet Suppress progress output
-h, --help display help for command
# Tesseract is enabled by default
lit parse document.pdf
# Specify language
lit parse document.pdf --ocr-language fra
# Disable OCR
lit parse document.pdf --no-ocr
By default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the TESSDATA_PREFIX environment variable to a directory containing pre-downloaded .traineddata files:
export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language eng
You can also pass tessdataPath in the library config:
const parser = new LiteParse({ tessdataPath: '/path/to/tessdata' });
For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:
You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).
The API requires:
/ocr endpointfile and language parameters{ results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }See the example servers in ocr/easyocr/ and ocr/paddleocr/ as templates.
For the complete OCR API specification, see OCR_API_SPEC.md.
LiteParse supports automatic conversion of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!
.doc, .docx, .docm, .odt, .rtf.ppt, .pptx, .pptm, .odp.xls, .xlsx, .xlsm, .ods, .csv, .tsvJust install the dependency and LiteParse will automatically convert these formats to PDF for parsing:
# macOS
brew install --cask libreoffice
# Ubuntu/Debian
apt-get install libreoffice
# Windows
choco install libreoffice-fresh # might require admin permissions
For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally
C:\Program Files\LibreOffice\program) to the environment variables and re-start the machine.
.jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svgJust install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):
# macOS
brew install imagemagick
# Ubuntu/Debian
apt-get install imagemagick
# Windows
choco install imagemagick.app # might require admin permissions
| Variable | Description |
|---|---|
TESSDATA_PREFIX |
Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet. |
LITEPARSE_TMPDIR |
Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (os.tmpdir()). Useful in containerized or read-only filesystem environments. |
You can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.
Create a liteparse.config.json file:
{
"ocrLanguage": "en",
"ocrEnabled": true,
"maxPages": 1000,
"dpi": 150,
"outputFormat": "json",
"preciseBoundingBox": true,
"preserveVerySmallText": false
}
For HTTP OCR servers, just add ocrServerUrl:
{
"ocrServerUrl": "http://localhost:8828/ocr",
"ocrLanguage": "en",
"outputFormat": "json"
}
Use with:
lit parse document.pdf --config liteparse.config.json
We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.
# Install dependencies
npm install
# Build TypeScript (Linux/macOs)
npm run build
# Build Typescript (Windows)
npm run build:windows
# Watch mode
npm run dev
# Test parsing
npm test
Apache 2.0
Built on top of: