Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.
No install needed - run directly:
uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx
brew tap thomaub/officemd [email protected]:ThomAub/officemd.git
brew install thomaub/officemd/officemd_cli
Installers generated by cargo-dist:
# macOS / Linux
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.ps1 | iex"
uv tool install officemd
Or add as a dependency:
uv add officemd
npm install office-md
# or
bun add office-md
cargo install officemd_cli
All three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).
officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docx
The Rust CLI has additional subcommands:
officemd stream report.docx # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md # write to file
officemd inspect report.pdf --output-format json --pretty
| Flag | Description |
|---|---|
--format |
Force document format (docx, xlsx, csv, pptx, pdf) |
--pages |
Select pages/slides/sheets by index (e.g. "1,3-5") |
--sheets |
Select sheets by name or index (e.g. "Sales,1-2") |
--include-document-properties |
Include document metadata in output |
--markdown-style |
Output style: compact (default) or human |
from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes
content = Path("report.docx").read_bytes()
print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))
from pathlib import Path
import officemd
content = Path("report.docx").read_bytes()
patch = officemd.DocxPatch(
scoped_replacements=[
officemd.ScopedDocxReplace(
officemd.DocxTextScope.ALL_TEXT,
officemd.TextReplace("word", "term"),
)
]
)
# ALL_TEXT includes document content plus free-text metadata/app/custom fields.
single = officemd.patch_docx_with_report(content, patch)
print(single.report.replacements_applied)
batch = officemd.patch_docx_batch_with_report([content, content], patch, workers=4)
for item in batch:
print(item.report.parts_scanned, item.report.parts_modified, item.report.replacements_applied)
use officemd_core::{
patch_docx_batch_with_report, DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace,
};
let patch = DocxPatch {
set_core_title: None,
replace_body_title: None,
scoped_replacements: vec![ScopedDocxReplace {
scope: DocxTextScope::AllText,
replace: TextReplace::all("word", "term"),
}],
};
// AllText includes document content plus free-text metadata/app/custom fields.
let results = patch_docx_batch_with_report(vec![doc1_bytes, doc2_bytes], &patch, Some(4))?;
for item in results {
println!(
"parts_scanned={} parts_modified={} replacements_applied={}",
item.report.parts_scanned,
item.report.parts_modified,
item.report.replacements_applied
);
}
# Ok::<(), officemd_core::PatchError>(())
Patch scopes also support free-text metadata/comment fields:
MetadataCore, MetadataApp, MetadataCustom, MetadataAllCommentAuthors, MetadataCore, MetadataApp, MetadataCustom, MetadataAllComments, CommentAuthors, MetadataCore, MetadataApp, MetadataCustom, MetadataAllAllText now means all free-text fields: content + metadata/comment-author text.
Formatting-preserving replacement is available for OOXML content text:
use officemd_core::{DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace};
let patch = DocxPatch {
set_core_title: None,
replace_body_title: None,
scoped_replacements: vec![ScopedDocxReplace {
scope: DocxTextScope::Body,
replace: TextReplace::all("Confidential", "")
.with_preserve_formatting(true),
}],
};
Semantics:
import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";
const content = readFileSync("report.docx");
console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));
There is also a browser demo for the WASM bindings in crates/officemd_wasm/README.md. It serves a small page at http://localhost:8080/crates/officemd_wasm/www/ for drag-and-drop and sample-fixture testing.
| Format | Extension | Markdown | JSON IR | Docling |
|---|---|---|---|---|
| Word | .docx | yes | yes | yes |
| Excel | .xlsx | yes | yes | yes |
| CSV | .csv | yes | yes | - |
| PowerPoint | .pptx | yes | yes | yes |
| yes | yes | - |
cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warnings
For JS and Python tests, see examples/README.md.
PDF extraction vendors pdf-inspector by Firecrawl (MIT).
PDF primitives lopdf by J-F-Liu (MIT).
Apache 2.0 - see LICENSE.