thomaub/officemd

Turn any Office style document to markdown

OfficeMD

Fast Office document extraction for LLMs and agents. Converts DOCX, XLSX, CSV, PPTX, and PDF into clean markdown, structured JSON IR, and Docling output.

Native Rust core - fast, no runtime dependencies
Three output modes: markdown, structured JSON IR, Docling JSON
CLI and SDK for Python, Node/Bun, and Rust
Sheet, slide, and page selection
Document property extraction

Quick Start

No install needed - run directly:

uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx

Install

Homebrew (macOS / Linux)

brew tap thomaub/officemd [email protected]:ThomAub/officemd.git
brew install thomaub/officemd/officemd_cli

Shell / PowerShell (from GitHub Release)

Installers generated by cargo-dist:

# macOS / Linux
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://github.com/ThomAub/officemd/releases/latest/download/officemd_cli-installer.ps1 | iex"

Python

uv tool install officemd

Or add as a dependency:

uv add officemd

Node / Bun

npm install office-md
# or
bun add office-md

Rust

cargo install officemd_cli

CLI

All three surfaces expose a CLI named officemd (Python, Rust) or office-md (Node/Bun).

officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd render report.docx
officemd diff old.docx new.docx

The Rust CLI has additional subcommands:

officemd stream report.docx                    # stream to stdout (supports stdin via -)
officemd convert report.docx --output out.md   # write to file
officemd inspect report.pdf --output-format json --pretty

Common options

Flag	Description
`--format`	Force document format (docx, xlsx, csv, pptx, pdf)
`--pages`	Select pages/slides/sheets by index (e.g. "1,3-5")
`--sheets`	Select sheets by name or index (e.g. "Sales,1-2")
`--include-document-properties`	Include document metadata in output
`--markdown-style`	Output style: `compact` (default) or `human`

SDK

Python

from pathlib import Path
from officemd import extract_ir_json, markdown_from_bytes, docling_from_bytes

content = Path("report.docx").read_bytes()

print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))

Python patch reports and batch patching

from pathlib import Path
import officemd

content = Path("report.docx").read_bytes()
patch = officemd.DocxPatch(
    scoped_replacements=[
        officemd.ScopedDocxReplace(
            officemd.DocxTextScope.ALL_TEXT,
            officemd.TextReplace("word", "term"),
        )
    ]
)
# ALL_TEXT includes document content plus free-text metadata/app/custom fields.

single = officemd.patch_docx_with_report(content, patch)
print(single.report.replacements_applied)

batch = officemd.patch_docx_batch_with_report([content, content], patch, workers=4)
for item in batch:
    print(item.report.parts_scanned, item.report.parts_modified, item.report.replacements_applied)

Rust batch patching with reports

use officemd_core::{
    patch_docx_batch_with_report, DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace,
};

let patch = DocxPatch {
    set_core_title: None,
    replace_body_title: None,
    scoped_replacements: vec![ScopedDocxReplace {
        scope: DocxTextScope::AllText,
        replace: TextReplace::all("word", "term"),
    }],
};
// AllText includes document content plus free-text metadata/app/custom fields.

let results = patch_docx_batch_with_report(vec![doc1_bytes, doc2_bytes], &patch, Some(4))?;
for item in results {
    println!(
        "parts_scanned={} parts_modified={} replacements_applied={}",
        item.report.parts_scanned,
        item.report.parts_modified,
        item.report.replacements_applied
    );
}
# Ok::<(), officemd_core::PatchError>(())

Patch scopes also support free-text metadata/comment fields:

DOCX: MetadataCore, MetadataApp, MetadataCustom, MetadataAll
PPTX: CommentAuthors, MetadataCore, MetadataApp, MetadataCustom, MetadataAll
XLSX: Comments, CommentAuthors, MetadataCore, MetadataApp, MetadataCustom, MetadataAll

AllText now means all free-text fields: content + metadata/comment-author text.

Formatting-preserving replacement is available for OOXML content text:

use officemd_core::{DocxPatch, DocxTextScope, ScopedDocxReplace, TextReplace};

let patch = DocxPatch {
    set_core_title: None,
    replace_body_title: None,
    scoped_replacements: vec![ScopedDocxReplace {
        scope: DocxTextScope::Body,
        replace: TextReplace::all("Confidential", "")
            .with_preserve_formatting(true),
    }],
};

Semantics:

replacement can span multiple runs
the first matched run keeps the replacement text and therefore its formatting wins
later consumed runs are left empty in v1
metadata/comment-author fields still use simple text replacement

JavaScript

import { readFileSync } from "node:fs";
import { markdownFromBytes, extractIrJson, doclingFromBytes } from "office-md";

const content = readFileSync("report.docx");

console.log(markdownFromBytes(content, "docx"));
console.log(extractIrJson(content, "docx"));
console.log(doclingFromBytes(content, "docx"));

WebAssembly demo

There is also a browser demo for the WASM bindings in crates/officemd_wasm/README.md. It serves a small page at http://localhost:8080/crates/officemd_wasm/www/ for drag-and-drop and sample-fixture testing.

Supported Formats

Format	Extension	Markdown	JSON IR	Docling
Word	.docx	yes	yes	yes
Excel	.xlsx	yes	yes	yes
CSV	.csv	yes	yes	-
PowerPoint	.pptx	yes	yes	yes
PDF	.pdf	yes	yes	-

Development

cargo nextest run --workspace
cargo clippy --workspace --all-targets --exclude officemd_pdf -- -D warnings

For JS and Python tests, see examples/README.md.

Acknowledgements

PDF extraction vendors pdf-inspector by Firecrawl (MIT).

PDF primitives lopdf by J-F-Liu (MIT).

License

Apache 2.0 - see LICENSE.

OfficeMD

Quick Start

Install

Homebrew (macOS / Linux)

Shell / PowerShell (from GitHub Release)

Python

Node / Bun

Rust

CLI

Common options

SDK

Python

Python patch reports and batch patching

Rust batch patching with reports

JavaScript

WebAssembly demo

Supported Formats

Development

Acknowledgements

License

Clone repository

OfficeMD

Quick Start

Install

Homebrew (macOS / Linux)

Shell / PowerShell (from GitHub Release)

Python

Node / Bun

Rust

CLI

Common options

SDK

Python

Python patch reports and batch patching

Rust batch patching with reports

JavaScript

WebAssembly demo

Supported Formats

Development

Acknowledgements

License