Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.
No install needed - run directly:
uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx
brew tap thomaub/officemd [email protected]:ThomAub/officemd.git
brew install thomaub/officemd/officemd_cli
Installers generated by cargo-dist:
# macOS / Linux
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.ps1 | iex"
uv tool install officemd
Or add as a dependency:
uv add officemd
npm install office-md
# or
bun add office-md
cargo install officemd_cli
All three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).
officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docx
The Rust CLI has additional subcommands:
officemd stream report.docx # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md # write to file
officemd inspect report.pdf --output-format json --pretty
| Flag | Description |
|---|---|
--format |
Force document format (docx, xlsx, csv, pptx, pdf) |
--pages |
Select pages/slides/sheets by index (e.g. "1,3-5") |
--sheets |
Select sheets by name or index (e.g. "Sales,1-2") |
--include-document-properties |
Include document metadata in output |
--markdown-style |
Output style: compact (default) or human |
from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes
content = Path("report.docx").read_bytes()
print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))
import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";
const content = readFileSync("report.docx");
console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));
| Format | Extension | Markdown | JSON IR | Docling |
|---|---|---|---|---|
| Word | .docx | yes | yes | yes |
| Excel | .xlsx | yes | yes | yes |
| CSV | .csv | yes | yes | - |
| PowerPoint | .pptx | yes | yes | yes |
| yes | yes | - |
cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warnings
For JS and Python tests, see examples/README.md.
PDF extraction vendors pdf-inspector by Firecrawl (MIT).
PDF primitives lopdf by J-F-Liu (MIT).
Apache 2.0 - see LICENSE.