Implementation details
This page documents code-level behavior in tito-pdf (the Python script), including thresholds and heuristics.
- The CLI contract is still
tito-pdf --help. - The stable behavior contract is described in Output contract.
- This page is intentionally “low level”: it exists so maintainers can reason about why outputs look the way they do.
Where logic lives
Key functions in the tito-pdf script:
- CLI + orchestration:
main() - Output mode resolution:
explicit_output_modeblock inmain() - PDF prep:
prepare_pdf() - OCR:
ocr_pdf() - PDF text extraction:
extract_lines_layout()→drop_repeated_headers_footers()→lines_to_markdown()/lines_to_text() - PDF tables:
extract_tables()(plusshould_accept()inside it) - DOCX:
extract_docx_markdown()/extract_docx_text()/extract_docx_tables()
Output mode resolution (exact behavior)
main() computes:
md_out,raw_text_out,tables_out,tables_audit_out,assets_json_outexplicit_output_mode = any([md_out, raw_text_out, tables_out, tables_audit_out, assets_json_out])
Explicit output mode
If explicit_output_mode is true:
- What is generated is determined only by which explicit paths are set:
do_md = (md_out is not None)do_raw_text = (raw_text_out is not None)do_tables = (tables_out is not None) or (tables_audit_out is not None)
- Validation rules:
- If
tables_audit_outis set buttables_outis missing → error (--tables-audit-out requires --tables-out). - If no content outputs are requested (md/raw/tables) → error.
- If
- Convenience flags (
--text,--tables,--all) are ignored. --out-diris ignored.
Convenience mode
If no explicit paths are set:
out_dirdefaults toinput_path.parent, or uses--out-dirif provided.do_mdanddo_tablescome from convenience toggles:--allenables both- otherwise
--tablesenables tables - otherwise
--textenables markdown - if none are provided, default is Markdown only
- Convenience output filenames are derived from
Path.stem:- text:
<stem>.md - tables:
<stem>.tables.md
- text:
There is no convenience-mode plaintext output file; --raw-text-out is explicit-only.
Mode + override resolution (exact mapping)
main() treats --mode as a high-level knob that sets defaults.
Inputs:
mode∈{fast, robust, best}- explicit override flags:
--no-ocr--force-ocr--tables-lenient
Resolution logic (as implemented):
- Start with
no_ocr = args.no_ocr,force_ocr = args.force_ocr,tables_lenient = args.tables_lenient. -
tables_auto_fallback = False. - If
mode == "fast"and the user did not explicitly set--no-ocror--force-ocr:- set
no_ocr = True.
- set
- If
mode == "best"and the user did not explicitly set--no-ocror--force-ocr:- set
force_ocr = True.
- set
- If
mode == "best"and the user did not explicitly set--tables-lenient:- set
tables_auto_fallback = True.
- set
- If both
no_ocrandforce_ocrare true:- print a warning
- force
force_ocr = False(so--no-ocrwins).
PDF preparation (prepare_pdf)
prepare_pdf(input_pdf, output_pdf) produces a normalized working copy for downstream parsing.
qpdf step
- Detect
qpdfviashutil.which("qpdf"). - If missing: the run stops with an error (PDF conversion requires
qpdf). - Runs:
qpdf --decrypt input.pdf output.pdf. - If
qpdffails: the run stops with an error.
OCR stage (ocr_pdf)
ocr_pdf(input_pdf, output_pdf, force):
- Prefers
ocrmypdfCLI entrypoint. - If the entrypoint is missing but the Python package is installed:
- falls back to
python -m ocrmypdf.
- falls back to
Flags used:
--quiet--output-type pdf- either:
--skip-text(default)- or
--force-ocr(when forced)
Failure behavior:
- If OCR fails,
tito-pdfprints a warning and continues using the non-OCR PDF.
PDF line extraction (extract_lines_layout)
PyMuPDF provides layout metadata.
Implementation notes:
- The extractor calls
page.get_text("dict")and walks:- blocks → lines → spans
- It creates a
PdfLinewith:text: normalized span text joined togethersize: median of span sizes (if available)bbox: the line bounding box from PyMuPDFbold:Trueif any span font name contains “bold”
--max-pages N limits how many pages are processed.
Header/footer dropping (drop_repeated_headers_footers)
This is a heuristic filter intended to remove repeated page furniture.
Rules:
- If the document has fewer than 3 pages:
- only page numbers are removed.
- Otherwise:
- candidate strings are counted when:
- normalized length ≤ 80 characters
- and they are near the top or bottom of the page:
- top:
y1 <= 0.12 * page_h - bottom:
y0 >= 0.88 * page_h
- top:
- candidate strings are counted when:
- A string is dropped if it appears on at least:
thresh = max(2, int(num_pages * 0.6))
Page numbers:
- removed always when they match:
^\d{1,4}$- or
^page\s+\d{1,4}(\s+of\s+\d{1,4})?$(case-insensitive)
Markdown reconstruction (lines_to_markdown)
Lines are sorted in reading order:
(page, y0, x0)
Body font size inference (infer_body_font_size)
The “body size” is the mode of rounded font sizes from “normal” lines:
- ignores sizes <= 0 or > 72
- only considers lines where
len(text) >= 30 - rounds sizes to the nearest 0.5
If nothing qualifies, body defaults to 12.0.
Heading detection (_is_heading)
A line is treated as a heading if:
- Length is between 2 and 140, and
- It is not a long sentence ending in a period:
- reject if
text.endswith('.')andlen(text) > 40
- reject if
Primary rule:
line.size >= body_size * 1.35
Fallbacks (for OCR/uniform-font PDFs):
- bold line and
len(text) <= 80 - centered line and
len(text) <= 80and:- text is uppercase, or
- matches a numbered heading like
1.2 Title
- uppercase line where
5 <= len(text) <= 60
Centered is defined as:
- the line center is within
0.12 * page_wof the page center.
Heading level (_heading_level)
Size-based levels:
>= 1.7 * body→ H1>= 1.5 * body→ H2>= 1.35 * body→ H3
Fallback numbering rule:
- if the text matches
^\d+(?:\.\d+){0,2}\s+...:- count the dots (up to two)
- return
min(4 + dots, 6)
List detection (_is_list_item)
A list item is detected if the line matches:
- bullet:
^[-*•]\s+\S+ - numbered:
^\d+[\).]\s+\S+
Normalization:
- numbered items are rewritten to
1. ...(Markdown auto-numbering) - bullets are rewritten to
- ...
Paragraph joining
Paragraphs are joined until a vertical gap indicates a new paragraph.
- New paragraph if the gap is greater than:
max(6.0, body_size * 0.9)
Hyphenation repair:
- if the current paragraph ends with
-and the next line starts with a lowercase letter:- the hyphen is removed and the word is joined.
Plaintext reconstruction (lines_to_text)
Plaintext export keeps paragraph joining and hyphenation repair but does not produce headings or lists.
Differences from Markdown:
- It flushes paragraphs on page breaks (page change always starts a new paragraph).
Tables extraction (extract_tables)
This stage produces:
- tables Markdown
- tables audit JSON payload (in memory; written if requested)
Strategy order (PDF)
1) PyMuPDF table finder
- always tries
lines/lines - in lenient mode, also tries
lines/text,text/lines,text/text
2) Camelot (optional)
- tries
latticeandstream
3) pdfplumber fallback
- strategy list mirrors the PyMuPDF strategy list
The function returns early when a strategy yields at least one accepted table.
Table normalization + dedup
Deduplication is done by a content signature:
- normalize each cell via
_norm_text - hash the table content (sha1)
Tables with duplicate signatures are not emitted twice.
Markdown conversion rules:
- empty rows are dropped
- tables with < 2 columns are rejected
- if the header row is empty, it is replaced with
Col1..ColN - cells escape
|and newlines are flattened
Acceptance filters (exact thresholds)
The core filters live in should_accept(...).
Hard size limits:
- reject
rows < 2orcols < 2 - reject
cols > 30 - reject
rows > 500
Two-row sparse grids:
- if
rows == 2andcols >= 3:- reject unless cell fill ratio is at least
0.90
- reject unless cell fill ratio is at least
Sparsity:
- reject if
empty_ratio > 0.85
Bounding box guards (when bbox is available):
- reject “tiny header/footer” blocks:
height_ratio < 0.05androws <= 6and near top/bottom (top_ratio < 0.12orbottom_ratio > 0.88) anddigit_ratio < 0.35
- reject “tiny sparse block” furniture:
height_ratio < 0.05andarea_ratio < 0.05andempty_ratio > 0.55anddigit_ratio < 0.50
PyMuPDF multi-column guard:
- for
tool == pymupdfandmode == pymupdf/lines/lines:- if
width_ratio < 0.75, reject unless:digit_ratio >= 0.60andempty_ratio <= 0.60androws >= 4andcols >= 3
- if
Text-strategy guard:
- for modes containing
text:- bbox must exist
- reject page-like detections unless strongly numeric:
- if
area_ratio > 0.60orheight_ratio > 0.60anddigit_ratio < 0.25→ reject
- if
- reject huge sparse page-like grids:
- if
height_ratio > 0.85andempty_ratio > 0.55anddigit_ratio < 0.35→ reject
- if
- reject narrow tables:
- if
width_ratio < 0.75→ reject
- if
Near-full-page hard stop:
- if
area_ratio > 0.92andheight_ratio > 0.85→ reject
Audit JSON fields
The tables audit JSON includes:
rows,cols,cells_total,cells_nonemptyempty_ratio,digit_ratiosha1(dedup signature)- bbox ratios when available:
width_ratio,height_ratio,area_ratio,top_ratio,bottom_ratio toolandmode
DOCX extraction details
DOCX uses python-docx.
Headings
DOCX headings are detected by paragraph style name matching:
^Heading\s+(\d+)\b
The level is clamped to 1..6 and mapped to Markdown headings.
Tables
Tables are converted to Markdown using the first row as the header and a --- separator row.
Assets JSON (toolchain capture)
When --assets-json is requested, tito-pdf captures:
- system tool paths + versions (best-effort)
- Python package versions (best-effort)
- stage timings and basic metrics
System tool versions are read by executing --version and capturing the first line (with a short timeout).
See: Assets JSON.