Document Extraction Pipeline Architecture: Why the Document Is the Hard Part
A frequency assignment. Three agencies. Three filing formats. In the first, it is a structured field in an HTML table, labeled, parseable, deterministic. In the second, it is embedded in the third paragraph of a free-text technical narrative, expressed as a range with non-standard notation. In the third, it does not exist at all. The filing is metadata-only, a few structured fields with no document body.
The same data point. Three extraction strategies.
We have written about how to measure extraction quality and about what extraction looks like in production. This piece is about what comes before both. The documents themselves, and why they are the hardest part of the problem. Not the model. Not the infrastructure. The input.
Why Are Regulatory Documents Harder to Extract Than Other Content?
The conventional framing of document extraction focuses on model selection: which LLM, how many parameters, what temperature, how much context. That framing misses the bottleneck. The limiting factor in regulatory extraction is format variation. The sheer diversity of document structures a system must handle to produce reliable output.
Here is a partial taxonomy of what we encounter across FCC and ITU filings.
STA modifications are free-text narratives with quantitative data embedded in prose. A frequency assignment is not in a labeled field. It is in a sentence that reads something like “the applicant requests temporary authority to operate on 14.0 to 14.5 GHz for a period not to exceed 60 days.” The numbers you need are embedded in language. Extracting them requires reasoning about sentence structure, not pattern matching.
ITU metadata-only entries present the opposite problem. A few structured fields (network name, administration, date) with no document body at all. You cannot run the same extraction logic on a filing that has no content to extract from. These entries need an entirely different pipeline path.
Debris mitigation data hides in narrative sections with no consistent field locations. One filing puts disposal orbit parameters in a technical appendix. Another buries them in a legal argument about compliance with the FCC’s deorbit rule. A third references an attached engineering statement that may or may not be available as a separate document.
Bond values, financial data in grant conditions, appear in varying formats across filing types and years. Dollar amounts with different precision, expressed as conditions or requirements, sometimes referencing other proceedings entirely.
Format changes across time compound everything. A 2018 FCC space station application looks nothing like a 2024 one. Field names change. Layouts shift. Sections are added, removed, or reorganized. A system trained on current formats degrades on historical filings, and historical filings are exactly what a regulatory intelligence platform needs to cover.
No public APIs exist for most agency filing systems. No established evaluation frameworks exist for regulatory document extraction. This is greenfield territory. Every team operating here is building their own solutions from scratch.
When Should You Use LLMs vs. Deterministic Methods?
LLMs are good at handling variation. That is their core strength for extraction. They can reason about document structure, interpret context, and map unfamiliar layouts to a known schema. A filing that puts the licensee name in a different field or uses non-standard terminology for a frequency band does not break an LLM the way it breaks a rule-based parser.
But LLMs are probabilistic. Run the same input through the same model twice and you can get different outputs. For fields where correctness is binary (a filing number is right or it is not), this is a fundamental problem. Traditional methods do not have this failure mode.
Deterministic methods win where the task is well-defined. Filing number extraction is regex. FRN validation is a lookup. Filing type classification, in many cases, is a straightforward categorical problem that does not require a language model at all. These approaches are fast, cheap, auditable, and incapable of hallucination. When the task can be solved deterministically, it should be.
LLMs win where the task requires reasoning. Mapping a free-text description of orbital parameters to structured schema fields. Interpreting a frequency assignment expressed in narrative prose. Handling a filing format the system has never seen before. These tasks require understanding, not pattern matching. That is where language models earn their cost.
The production reality is hybrid.
| Approach | Best For | Failure Mode | Cost Profile |
|---|---|---|---|
| Deterministic (regex, lookups, rules) | Identifiers, dates, categorical fields, known formats | Brittle on unseen formats; cannot reason | Cheap, fast, auditable |
| LLM-only | Free-text narratives, novel layouts, semantic mapping | Probabilistic; can fabricate plausible values | Expensive per call; non-deterministic |
| Hybrid (OCR plus LLM plus validation) | Heterogeneous regulatory filings at scale | Pipeline complexity; multiple failure surfaces | Tuned per stage; cost follows confidence |
Schema-compliant output does not mean correct output. We have covered that point in depth in our piece on verification engineering. The engineering challenge is designing a pipeline that uses each tool where it is strongest: deterministic methods for what is predictable, LLMs for what is not, and verification layers that catch what both miss.
What Does an Extraction Pipeline Architecture Look Like?
Here is the pipeline at a conceptual level:
Raw Filing → Type Classification → Chunking → Extraction → Normalization → Validation → Structured Data
Each stage solves a different problem. Each stage introduces its own failure modes.
Type classification determines how every subsequent stage operates. An STA modification, a new license application, and a metadata-only ITU entry all require different extraction strategies. Different prompts, different field expectations, different validation rules. Get classification wrong and the rest of the pipeline is running the wrong playbook. This stage is often the best candidate for deterministic methods. Classification is a bounded categorical problem with known labels.
Chunking is where document length meets model context. Short filings process in a single pass. Long filings, and regulatory filings regularly exceed 50 pages, need to be split. The question is where to split them.
Page-level chunking is simple but naive. A table that spans pages gets split in half, and the model sees two fragments that make no sense independently. Semantic chunking, splitting on topic boundaries, preserves meaning better but requires understanding the document before you have processed it. Document-type-aware chunking uses classification output to apply chunking strategies tuned to specific filing structures: split technical appendices from legal narratives, keep frequency tables intact, preserve cross-referenced sections. The chunking strategy matters as much as the model choice.
Extraction is the LLM reasoning step. The model reads document content and maps it to schema fields. This is where the document variation taxonomy hits hardest. Clean structured filings extract well. Free-text narratives with embedded quantitative data require more reasoning and introduce more error. The model needs to distinguish between data and boilerplate, identify relevant fields in unfamiliar layouts, and recognize when a field is absent rather than filling the gap with a plausible guess.
Normalization maps LLM output variants to canonical values. When a model extracts a frequency band, it might return “Ka”, “Ka-band”, “ka band”, or “K-above”, all meaning the same thing.
Detailed numbers are in the benchmarking write-up.
Validation checks extracted values against source systems. An FCC filing number should resolve to a real record in IBFS. An FRN should correspond to a registered entity. A frequency assignment should fall within an allocated band. Extractions that do not validate get rejected. Not silently corrected. Not soft-labeled. Rejected. The verification architecture that catches what schema validation misses is a separate engineering discipline.
Confidence-based routing is the pipeline’s response to document complexity. When the system’s confidence in an extraction is low (because the source document is poorly formatted, a field is ambiguous, or the filing type is uncommon), it flags the extraction for human review rather than filling gaps. This is not a verification pattern. That is a separate concern. It is a document-complexity response: some documents are harder than others, and the pipeline should say so.
Why Are PDF Tables the Hardest Extraction Subproblem?
Tables in PDFs are the hardest extraction subproblem. And regulatory filings are full of them. Frequency assignment tables, orbital parameter summaries, station technical data, financial schedules.
The difficulty is structural. PDF is a presentation format, not a data format. A table that looks perfectly organized on screen may have no underlying grid structure in the file. Cells that appear aligned are just text blocks positioned at specific coordinates. Merged cells, multi-row headers, tables that span page breaks. All of these are trivial for a human reader and genuinely hard for automated extraction.
Open-source tools like IBM’s Docling have pushed table extraction accuracy above 97% on structured documents (corporate reports, academic papers, standardized forms). But regulatory filings are not structured documents. They are semi-structured at best, with table formatting that varies by agency, by year, and by the individual who prepared the filing. A frequency table in one FCC application may use gridlines, headers, and consistent column alignment. The same conceptual table in another filing may be a series of tab-separated values with no visual structure at all.
Table extraction from regulatory PDFs is its own engineering challenge. One that sits at the intersection of document understanding, layout analysis, and domain-specific heuristics. We will cover it in detail in a future piece.
What We Learned Building Extraction at Scale
Four takeaways from building extraction at Orbit Sentinel.
Document variation is the bottleneck, not model capability. We spent more engineering time on chunking strategies, format handling, and document-type-aware pipeline routing than on model selection or prompt engineering. The model is a component. The document landscape is the problem.
Extraction that flags uncertainty is more valuable than extraction that fills gaps. A missing field flagged for human review is a research task. A fabricated field that passes review is a liability. This connects to what we wrote in The Trust Problem. In regulatory intelligence, false precision is more dangerous than acknowledged uncertainty. The pipeline should be designed to produce gaps, not guesses.
The pipeline before the LLM matters as much as the model. Classification, chunking, and pre-processing determine what the model sees. A well-chunked document with intact tables and correctly identified section boundaries extracts better with a mediocre model than a poorly chunked document extracts with a frontier model. Garbage in, garbage out applies to context windows too.
Schema design determines the extraction ceiling. A schema that conflates distinct concepts (combining frequency band and transmission direction into a single field, or treating debris mitigation as one blob of text instead of discrete parameters) makes even a capable model produce ambiguous output. The schema is a contract between the pipeline and every system downstream. A bad contract produces bad data regardless of what sits in between.
These lessons point in the same direction: extraction quality is a systems problem, not a model problem. The model matters. But the documents, the pipeline, and the schema matter more.
Further Reading
- 10,000 Regulatory Filings Through Rented GPUs. What extraction looks like in production.
- Benchmarking LLMs for Domain-Specific Extraction. How we measure extraction quality.
- Building Trustworthy AI for Regulatory Intelligence. How verification layers catch what extraction misses.
- Vector Search and Embeddings for Regulatory Filings. The downstream semantic search and RAG retrieval that sits on top of structured extraction output.
- The Trust Problem. Why regulatory AI needs architectural safeguards.
- Why Space Regulatory Intelligence Is an Engineering Problem. The architectural case for purpose-built infrastructure.
- Space Regulatory Glossary. FCC, ITU, IBFS, FRN, and 100-plus more regulatory terms.
Frequently Asked Questions
- What makes regulatory documents harder to extract than other content?
- Format variation across agencies, years, and filing types. No standardized templates. Mixed content types including structured tables, free-text narratives, and metadata-only entries with no document body. A single data point like a frequency assignment can appear in three different places across three different formats from three different agencies. There are no public APIs for most systems, and no established evaluation frameworks for regulatory document extraction.
- Can LLMs replace OCR for document extraction?
- Not entirely. LLMs excel at flexible layouts and reasoning about context but are probabilistic. Identical inputs can produce different outputs. Production systems combine OCR for deterministic text extraction with LLMs for semantic understanding. The OCR layer handles what should never vary. The LLM layer handles what always does.
- Why does AI extraction accuracy vary across document types?
- Document structure determines extraction difficulty more than model capability. Clean structured filings with consistent field locations extract well. Free-text narratives with embedded quantitative data require more reasoning and introduce more error. Metadata-only entries with no document body need entirely different handling. The same model can perform well on one filing type and poorly on another.
- What is hybrid OCR plus LLM extraction?
- A production architecture that combines deterministic OCR for text extraction with LLM-based reasoning for semantic understanding and field mapping. The OCR layer provides reliable text from documents and cannot hallucinate. The LLM layer handles variation in layout and terminology. Together they balance reliability with flexibility across heterogeneous document formats.