Tables
tito-pdf can extract tables from PDFs and DOCX and write them as Markdown.
Tables are intentionally treated as optional output:
- table extraction is harder and noisier than text extraction
- strict heuristics reduce false positives
Outputs
To request tables you either:
- use convenience mode:
tito-pdf input.pdf --tables --out-dir out
# => out/input.tables.md
- or use explicit output paths:
tito-pdf input.pdf \
--tables-out out/input.tables.md \
--tables-audit-out out/input.tables.audit.json
Notes:
--tables-audit-outrequires--tables-out.- If no tables are detected, the tables Markdown output is:
(No tables detected.)\n
Strategy order (PDF)
For PDFs, table extraction uses multiple deterministic strategies.
Order:
- PyMuPDF table finder (primary)
- Camelot (optional; only if installed)
- pdfplumber (fallback)
The implementation stops early if an earlier strategy produces at least one accepted table.
1) PyMuPDF (primary)
PyMuPDF is already required for layout-aware text extraction, so it is the primary table detector.
Strict mode:
- uses
Page.find_tables(vertical_strategy="lines", horizontal_strategy="lines")
Lenient mode:
- also tries combinations involving
"text"strategies.
2) Camelot (optional)
If camelot is installed in the runtime environment, tito-pdf will try it.
Notes:
- Camelot is not installed by default in
requirements.txt. - It can work well for vector tables, but adds heavier dependencies.
3) pdfplumber (fallback)
If strict PyMuPDF fails to produce tables, tito-pdf may fall back to pdfplumber.
Strict vs lenient
Strict is the default because it avoids many false positives.
Ways to get lenient behavior:
1) Explicitly:
tito-pdf input.pdf --tables --out-dir out --tables-lenient
2) Automatically (best mode fallback):
tito-pdf input.pdf --tables --out-dir out --mode best
In --mode best, if strict detection yields zero accepted tables, tito-pdf retries in lenient mode.
Why false positives happen
Common false positives:
- multi-column academic PDFs, where a “table finder” can interpret prose columns as a grid
- title blocks and page furniture (small sparse grids)
- header/footer regions
tito-pdf combats these with acceptance filters.
Acceptance filters (what gets rejected)
The goal is to accept “table-like” structures and reject page furniture.
Examples of hard filters:
- must be at least 2 rows × 2 columns
- reject very sparse grids (mostly empty)
- reject extremely large tables (too many rows/cols)
- reject tiny header/footer blocks
- reject near-full-page detections (almost always a false positive)
There are also extra guards for:
- narrow single-column detections from text-based strategies
These filters are intentionally conservative.
Implementation details (current heuristics)
The exact thresholds live in the should_accept(...) helper.
Some key rules (as of the current implementation):
- Size limits:
- reject
rows < 2orcols < 2 - reject
cols > 30 - reject
rows > 500
- reject
- Sparsity:
- reject tables with
empty_ratio > 0.85 - special case: for 2-row grids with 3+ columns, reject unless ~fully populated
- reject tables with
- Bounding box guards (when bbox is available):
- reject tiny header/footer blocks (short height, near top/bottom, low numeric density)
- reject tiny sparse blocks (common in page furniture)
- reject near-full-page detections (
area_ratioandheight_ratioboth very high)
- Multi-column PDF guard:
- PyMuPDF detections confined to a single text column (
width_ratio < ~0.75) are treated as suspicious unless strongly table-like
- PyMuPDF detections confined to a single text column (
- Text-strategy guard:
- for text-based strategies, require a reasonable bbox (wide enough; not page-like unless strongly numeric)
These rules are designed to minimize “tables that are actually prose”.
Audit JSON (how to read it)
With --tables-audit-out, you get a JSON payload describing accepted tables.
Fields you’ll commonly see:
index: table number in the outputpage: PDF page number (PDF only)tool:pymupdf,camelot, orpdfplumbermode: extractor mode string (e.g.pymupdf/lines/lines)rows,colsempty_ratio: how sparse the table isdigit_ratio: how numeric the table lookssha1: content signature used for deduplication
Some extractors also include bounding-box ratios (width/height/area relative to page):
width_ratio: table width / page widthheight_ratio: table height / page heightarea_ratio: table area / page areatop_ratio,bottom_ratio: vertical position (useful for header/footer filtering)
Debugging tips
- Limit pages while iterating:
tito-pdf input.pdf --tables --out-dir out --max-pages 10
- Use
--tables-audit-outto see what the extractor accepted. - If you get no tables, try
--mode bestor--tables-lenient. - If you get too many false positives, go back to strict mode (disable lenient).
DOCX tables
DOCX tables are extracted via python-docx.
- The output is deterministic.
- The audit is simple (table index, rows/cols).
See: Pipeline.