pdfs are hostile input
Most document pipelines fail for boring reasons, not exotic model failures. The input is garbage. A procurement contract exported from SAP has invisible text layered over a raster scan, the footer repeats a confidentiality notice on every page, page 7 switches to two columns, page 11 is rotated 90 degrees, and one appendix was embedded with a font map so broken that pdfminer.six happily returns '(cid:31)(cid:52)(cid:92)' instead of words. If your ingestion path is extract text -> chunk -> embed -> pray, you won't notice the damage until search quality drops, answers get vague, or an invoice parser starts swapping totals and tax IDs.
I've seen teams blame the retriever, then the model, then the prompt, while the actual problem sat two layers lower, where the text extraction step had merged left and right columns into one sentence salad and the chunker had dutifully preserved the corruption. Embeddings are very good at making broken text look plausibly searchable. That's why these failures are expensive. They degrade gradually, they don't throw exceptions, and they produce enough decent output to keep everybody calm while recall gets worse every week.
A sane pipeline treats PDFs as hostile input and makes every transformation observable. At Steezr we've built document processing systems for customer portals, internal ERP workflows, and OCR-heavy backoffice tools, and the pattern stays the same, the teams that survive production are the ones that stop treating ingestion as a one-off preprocessing script. You need a reproducible stack, explicit layout handling, canonical chunking rules that don't drift across reruns, and verification hooks that fail closed. Otherwise a dependency bump from sentence-transformers==2.2.2 to 3.0.1, or a Tesseract language pack update, quietly changes your chunks and poisons every downstream index.
extract with layout first
The first decision is whether the PDF already contains usable text. Don't OCR everything by default, that's lazy and it destroys signal. Start with pdfminer.six==20231228 and inspect per-page character density, font maps, bounding boxes, and extraction confidence proxies. If a page yields sane text with stable coordinates, keep it. If it yields gibberish, sparse output, or suspicious repeated coordinates, fall back to OCR for that page only.
A practical stack is pdfminer.six for native text, layoutparser==0.3.4 for structure inference on page images, and either tesseract==5.3.x or Google Vision for OCR. Tesseract is cheaper and easier to keep on-prem, Vision is better on ugly scans and mixed layouts, especially receipts and skewed invoices. I'd wire both behind the same page-level interface and route by document class plus failure heuristics. Something like this in Python:
page = load_pdf_page(pdf_path, page_num)
text_blocks = extract_pdfminer_blocks(page)
if low_text_density(text_blocks) or broken_encoding(text_blocks):
image = render_page(page, dpi=300)
layout = detect_layout(image) # columns, tables, headers
ocr_blocks = run_tesseract(image, psm=1, lang="eng+deu")
blocks = align_blocks_with_layout(ocr_blocks, layout)
else:
blocks = normalize_pdfminer_blocks(text_blocks)The layout pass matters more than people admit. Two-column text without reading order reconstruction will ruin retrieval. A bad join operation can turn a sentence from the left column and a bullet from the right column into one chunk, then your embedder encodes nonsense with total confidence. We usually sort blocks by detected region, then y-coordinate, then x-coordinate within region, with explicit handling for tables and sidebars. layoutparser with a PubLayNet-backed detector works decently for generic business docs, though for invoices and forms I'd add custom region heuristics because generic document detectors miss boxes, stamps, and line-item grids.
Also, render pages at a fixed DPI. Pick 300 and freeze it. OCR output changes with rasterization, and once you care about reproducibility, you stop letting ImageMagick defaults decide your corpus.
canonical chunks or chaos
Chunking is where many pipelines become non-deterministic without anyone noticing. People talk about chunk size as if the only question is 512 tokens or 1024. That matters, sure, yet the harder problem is making the same source document produce the same chunk boundaries tomorrow, after a rerun, after a parser upgrade, after one page got re-OCRed because Vision timed out last week.
The fix is canonicalization before embedding. Strip headers and footers using repetition analysis across pages, normalize whitespace, collapse soft hyphenation, preserve table row boundaries when possible, and attach page metadata directly into the chunk identity, not into the text body fed to the model. I like a two-pass approach. First build page-level canonical text, then derive token-bounded chunks from that canonical form with stable overlap rules.
A simple header/footer stripper can hash the top N and bottom N lines on each page, then remove lines whose normalized form appears on more than, say, 60 percent of pages. Normalized means lowercase, digits masked, whitespace collapsed. That catches Confidential, Page 7 of 42, timestamp noise, and those dreadful ERP export banners. Then compute a page hash from the cleaned canonical text plus block coordinates rounded to a tolerance, maybe 5 px if you're OCRing. After that, chunk with a deterministic tokenizer, for example sentence-transformers/all-MiniLM-L6-v2 using a fixed Hugging Face tokenizer version, fixed max length, and overlap expressed in tokens, not characters.
canon = canonicalize_page(blocks)
page_sha = sha256(canon.encode("utf-8")).hexdigest()
chunks = chunk_by_tokens(
canon,
tokenizer_name="sentence-transformers/all-MiniLM-L6-v2",
max_tokens=350,
overlap_tokens=60,
break_on=["\n\n", "\n", ". "]
)
for idx, chunk in enumerate(chunks):
chunk_id = sha256(f"{doc_id}:{page_num}:{page_sha}:{idx}:{chunk}".encode()).hexdigest()This looks obsessive until you need to reindex 4 million chunks after discovering that one OCR worker had tessdata mounted from a different image. Then deterministic chunk IDs save your week. You can diff old and new ingestion runs, page by page, chunk by chunk, and decide whether the changes are expected or a rollback event.
verification has to be boring
Verification should be deterministic, cheap, and rude. If it waits for a human to notice search got worse, you don't have verification, you have hope. Every ingestion run should emit artifacts you can compare mechanically, text extraction stats, page hashes, chunk counts, OCR fallback rates, mean token length, language distribution, and a small set of document-specific assertions.
For example, on invoices, you usually know there must be exactly one invoice number candidate, at least one currency amount, and a total near the bottom half of the document. On policy manuals, you know section numbering should be monotonic and headers should mostly disappear after stripping. These aren't AI checks, they're deterministic sanity tests. Write them like tests.
We store a manifest per document version in PostgreSQL and object storage. The manifest contains parser versions, OCR engine, language packs, rasterization DPI, tokenizer version, chunking config, plus hashes for raw file, per-page canonical text, and final chunk list. During reingestion we compare the new manifest to the previous one. If changed_chunk_ratio > 0.15 and the raw file hash didn't change, fail the job. If OCR fallback jumped from 8 percent to 73 percent after a deploy, fail the batch. If a page that used to produce 1,200 characters now produces 74, fail immediately.
A lot of teams skip this because they think they can inspect outputs manually. They can't. Once you have tens of thousands of documents, silent drift wins every time. The failure mode is always the same, no alarms, no exception, just slightly worse retrieval and a Slack thread full of vague complaints.
One more thing, version every dependency that touches text. Pin pdfminer.six, pin pytesseract, pin your Docker image, pin tokenizer files, pin language data. If you call Google Vision, record the API version and response schema assumptions. Reproducibility disappears the second one worker runs a different image.
retrieval quality starts upstream
People love to benchmark retrievers and rerankers, then feed them malformed chunks and wonder why MRR looks unstable. Retrieval quality is mostly decided before the embedder runs. A clean chunk with preserved reading order, stripped boilerplate, and stable semantic boundaries will make even a modest model perform well. A dirty chunk forces expensive models to compensate for extraction mistakes they were never meant to fix.
For general searchable corpora, sentence-transformers/all-mpnet-base-v2 is still a solid baseline if latency isn't brutal. For lighter deployments, all-MiniLM-L6-v2 is fine, and easier to serve. Freeze the model revision and write it into the manifest. If you later migrate to bge-small-en-v1.5 or a multilingual model, treat that as a corpus migration, not a quiet config tweak. Re-embed intentionally, compare retrieval on a fixed evaluation set, and keep old indexes long enough to roll back.
The same upstream discipline helps document QA and invoice automation. If your invoice line items are chunked without table awareness, your extractor will start pairing quantities with the wrong descriptions. If your knowledge base chunks include repeated footers every 300 tokens, semantic search will over-index legal boilerplate. If page numbers drift into the body text, citations become noisy. None of this is glamorous work. It is the work.
Our default pattern for messy business docs is pretty boring by design, Django workers, PostgreSQL manifests, S3-compatible object storage, page images cached with content hashes, deterministic chunking in Python, embeddings generated in a separate versioned job, and a Next.js admin view where ops can diff two ingestion runs for the same file. That last bit matters. Engineers trust systems they can inspect, and PDF pipelines need inspection more than cleverness.
a stack i'd ship
If I had to build this next week for a document QA system or a searchable customer portal, I'd ship a pipeline that looks like this. Ingest raw PDFs into immutable storage, hash them on arrival with SHA-256, enqueue page extraction jobs, attempt native text extraction first with pdfminer.six==20231228, render failed or suspicious pages at 300 DPI using pdftoppm from Poppler 24.x, run layout detection with layoutparser==0.3.4, OCR with Tesseract 5.3 and eng, deu, or whatever languages your corpus actually contains, then canonicalize, chunk, embed, verify, and only then publish into the search index.
The data model matters. Keep documents, document_versions, pages, chunks, and ingestion_manifests as separate tables. Store raw extracted text, canonical text, and final chunk text separately. Yes, that duplicates data. Disk is cheap, debugging isn't. A chunks row should include document_version_id, page_start, page_end, chunk_index, chunk_sha, embedding_model, and canonical_config_version. That gives you enough lineage to answer the only question that matters during incidents, why did this chunk exist, and why did it change.
You also want a dead simple replay path. Given document_version_id=8421, rerun extraction with the exact historical manifest, compare outputs, and publish a diff report. No hidden defaults, no current-config magic. If the report says:
page 14 canonical hash changed
old chars: 1832
new chars: 611
ocr_fallback: false -> true
header_strip_lines_removed: 2 -> 19you know where to look. Maybe a parser regression, maybe a bad page render, maybe the header stripper got too aggressive. The point is that you know.
Messy PDFs never become clean. You can still build a pipeline that behaves predictably, and predictable beats clever every single time.
