tito-pdf
Convert .pdf / .docx documents to Markdown (and optionally extract tables).
Quickstart (recommended)
Write primary Markdown to an explicit path:
tito-pdf input.pdf --md-out out/input.md
Tables + audit JSON + assets JSON (typical integration run):
tito-pdf input.pdf \
--mode best \
--md-out out/input.md \
--tables-out out/input.tables.md \
--tables-audit-out out/input.tables.audit.json \
--assets-json out/input.assets.json
Convenience mode (no explicit outputs)
If you do not provide any explicit output paths, tito-pdf writes next to the input file by default:
tito-pdf /path/to/input.pdf
# => /path/to/input.md
Write into a directory:
tito-pdf input.pdf --out-dir out
Tables:
tito-pdf input.pdf --tables --out-dir out
# => out/input.tables.md
Text + tables:
tito-pdf input.pdf --all --out-dir out
# => out/input.md + out/input.tables.md
Two output styles (contract)
There are exactly two output styles:
1) Explicit output mode (recommended / integration) If you set any of:
--md-out PATH--raw-text-out PATH--tables-out PATH--tables-audit-out PATH(requires--tables-out)--assets-json PATH
…then tito-pdf writes only to the paths you requested (creating parent directories and using atomic writes). It does not create extra output folders.
2) Convenience mode (human) If no explicit paths are set:
- Default: writes
<stem>.mdnext to the input. --out-dir DIRwrites intoDIR.--tables/--allalso writes<stem>.tables.md.
Documentation
Start here: Docs index.
Core references:
- Install: Install
- Usage: Usage
- CLI flags (by parameter): CLI
- Output contract: Output contract
- Design rationale (why multiple tools): Rationale
- Implementation details (thresholds + heuristics): Implementation
- Pipeline (how it works): Pipeline
- OCR: OCR
- Tables: Tables
- Assets JSON: Assets JSON
- Troubleshooting: Troubleshooting
- Development/testing: Development
- FAQ: FAQ
- Español: Guía rápida
Sanity check (installed CLI)
command -v tito-pdf
tito-pdf --help