tito-PDF Documentation

Tables

tito-pdf can extract tables from PDFs and DOCX and write them as Markdown.

Tables are intentionally treated as optional output:

Outputs

To request tables you either:

tito-pdf input.pdf --tables --out-dir out
# => out/input.tables.md
tito-pdf input.pdf \
  --tables-out out/input.tables.md \
  --tables-audit-out out/input.tables.audit.json

Notes:

Strategy order (PDF)

For PDFs, table extraction uses multiple deterministic strategies.

Order:

  1. PyMuPDF table finder (primary)
  2. Camelot (optional; only if installed)
  3. pdfplumber (fallback)

The implementation stops early if an earlier strategy produces at least one accepted table.

1) PyMuPDF (primary)

PyMuPDF is already required for layout-aware text extraction, so it is the primary table detector.

Strict mode:

Lenient mode:

2) Camelot (optional)

If camelot is installed in the runtime environment, tito-pdf will try it.

Notes:

3) pdfplumber (fallback)

If strict PyMuPDF fails to produce tables, tito-pdf may fall back to pdfplumber.

Strict vs lenient

Strict is the default because it avoids many false positives.

Ways to get lenient behavior:

1) Explicitly:

tito-pdf input.pdf --tables --out-dir out --tables-lenient

2) Automatically (best mode fallback):

tito-pdf input.pdf --tables --out-dir out --mode best

In --mode best, if strict detection yields zero accepted tables, tito-pdf retries in lenient mode.

Why false positives happen

Common false positives:

tito-pdf combats these with acceptance filters.

Acceptance filters (what gets rejected)

The goal is to accept “table-like” structures and reject page furniture.

Examples of hard filters:

There are also extra guards for:

These filters are intentionally conservative.

Implementation details (current heuristics)

The exact thresholds live in the should_accept(...) helper.

Some key rules (as of the current implementation):

These rules are designed to minimize “tables that are actually prose”.

Audit JSON (how to read it)

With --tables-audit-out, you get a JSON payload describing accepted tables.

Fields you’ll commonly see:

Some extractors also include bounding-box ratios (width/height/area relative to page):

Debugging tips

tito-pdf input.pdf --tables --out-dir out --max-pages 10

DOCX tables

DOCX tables are extracted via python-docx.

See: Pipeline.