Why don't general LLM benchmarks predict extraction performance?

General benchmarks like MMLU and HumanEval measure broad reasoning and coding ability, not whether a model can reliably extract a filing date, frequency assignment, or licensee name from an unstructured regulatory document. A model that scores well on general knowledge can still hallucinate field values, misparse tables, or fabricate plausible-looking data when applied to domain-specific extraction tasks. The only way to know if a model works for your use case is to test it on your data.

How many labeled examples do you need for a domain-specific evaluation set?

Research from NVIDIA and others shows that 100 to 200 hand-labeled examples are sufficient to meaningfully differentiate model performance on specialized extraction tasks. The key is coverage - your evaluation set should include clean filings, messy filings, edge cases, and documents where fields are absent. Quality of labels matters more than quantity.

Can a smaller open-source model outperform GPT-4 on extraction?

Yes. Multiple studies have shown that fine-tuned models in the 7B to 14B parameter range can match or exceed frontier models on domain-specific extraction when the task is well-defined and training data is representative. Smaller models are also significantly cheaper and faster to run at scale - often 80 to 90 percent less expensive than API calls to frontier models.

What metrics should you use for extraction evaluation?

Field-level precision, recall, and F1 are the standard metrics. Use exact match for structured fields like dates, identifiers, and categorical values. Use fuzzy or partial matching for free-text fields like entity names where minor variations are acceptable. Always report both field-level and document-level accuracy - the gap between them reveals whether errors are concentrated in specific field types or distributed across documents.

What is the most common extraction failure mode?

Confident wrong answers - the model extracts a value that looks plausible but is factually incorrect, without indicating any uncertainty. This is more dangerous than outright failures or malformed output because it passes casual review. Common examples include fabricating values for fields that don't exist in the source document, misattributing data from one section to another, and normalizing values incorrectly.

Benchmarking LLMs for Domain-Specific Extraction

A model scores 87% on MMLU. It writes clean Python. It passes the bar exam. You point it at an FCC satellite license application and ask it to extract the orbital parameters, frequency assignments, and licensee name.

It returns valid JSON. Every field is populated. The schema is perfect.

Half the values are wrong.

This is the gap between general capability and domain-specific extraction - and no leaderboard will tell you where your model falls. The only way to know whether an LLM can extract structured data from your documents is to build your own evaluation. Here’s what we’ve learned doing it.

Why General Benchmarks Fail

General benchmarks measure the wrong thing. MMLU tests knowledge breadth. HumanEval tests code generation. HellaSwag tests commonsense reasoning. None of them test whether a model can look at a 47-page FCC filing and correctly identify that the applicant is requesting Ka-band downlink frequencies between 17.8 and 18.6 GHz with an EIRP of 54.2 dBW - and not hallucinate a C-band uplink that appears nowhere in the document.

Domain-specific extraction requires a different set of capabilities: parsing complex document structures, distinguishing data from boilerplate, handling absent fields without fabrication, and maintaining consistency across long documents. These capabilities don’t correlate well with general benchmark scores.

The academic community has recognized this. FinBen, a financial LLM benchmark spanning 36 datasets across 24 tasks, found that GPT-4 excels at information extraction and stock trading but struggles with forecasting and complex generation tasks - ranking well on some financial tasks while failing on others despite dominating general benchmarks. LegalBench, a collaborative effort by roughly 40 legal professionals across multiple institutions, demonstrated that legal reasoning tasks require domain practitioners to design evaluations, not NLP researchers alone.

The pattern is consistent across domains: general capability is necessary but not sufficient. You need domain-specific evaluation to know if a model actually works.

Building a Golden Dataset

The foundation of any extraction evaluation is a set of documents with known-correct outputs - what the industry calls a golden dataset. For regulatory filing extraction, this means hand-labeling documents with the exact field values you expect your system to extract.

Here’s what we’ve found matters in practice:

Coverage over volume. A hundred carefully labeled filings that span your document types, edge cases, and failure modes are worth more than a thousand easy examples. Include clean filings where every field is present and parseable. Include messy filings with unusual formatting, multi-page tables, and inconsistent terminology. Include filings where key fields are genuinely absent - because a model’s behavior on missing data is one of the strongest signals of extraction quality.

Label the absence. When a field doesn’t exist in the source document, your golden dataset must explicitly mark it as absent - not leave it blank. The distinction matters because the most dangerous extraction failure is fabrication: the model invents a plausible value for a field that isn’t there. If your evaluation doesn’t test for this, you won’t catch it until production.

Use domain experts. Cleanlab’s research found that existing structured output benchmarks contain substantial annotation errors in their ground truth. When your ground truth is wrong, your evaluation is measuring noise. For regulatory filings, the people labeling your golden dataset need to understand what a frequency assignment looks like, what an orbital parameter means, and where in a filing to find them. This is not a task for general-purpose annotation services.

Version your labels. As your extraction schema evolves - you add fields, change normalizations, refine categories - your golden dataset must evolve with it. Treat it like code: versioned, reviewed, and tested.

What to Measure

Not all metrics are equally useful for extraction. Here’s the hierarchy we’ve found most informative.

Field-Level Precision and Recall

The core metric. For each field in your schema, measure how often the extracted value matches the ground truth (precision) and how often the ground truth value was successfully extracted (recall). F1 combines both.

Use exact match for structured fields: dates, identifiers, frequencies, categorical values like orbit type. These have canonical forms - either the value is right or it isn’t.

Use fuzzy matching for free-text fields: entity names, debris mitigation summaries, station descriptions. A model that extracts “SpaceX Services Inc.” when the ground truth says “SpaceX Services, Inc.” shouldn’t be penalized the same way as one that extracts the wrong entity entirely.

The critical insight: report per-field metrics, not just aggregate accuracy. A model with 92% overall accuracy might be perfect on dates and disastrous on frequency assignments. If you only look at the aggregate, you’ll ship a system that silently fails on the fields that matter most.

Document-Level Accuracy

What percentage of documents had every field extracted correctly? The gap between document-level and field-level accuracy reveals whether errors are concentrated or distributed. If field-level F1 is 95% but document-level accuracy is 60%, your model is making small errors on many documents - not catastrophic errors on a few.

Fabrication Rate

The percentage of extractions where the model returned a value for a field that doesn’t exist in the source document. This is extraction-specific and arguably the most important safety metric. A high fabrication rate means the model is inventing data - and in regulatory intelligence, invented data is worse than missing data.

Confidence Calibration

If your model produces confidence scores, measure whether they’re actually predictive. Plot accuracy against reported confidence. A well-calibrated model should be right 90% of the time when it reports 90% confidence.

In practice, LLM self-reported confidence is poorly calibrated. We saw this directly in our benchmark: GPT-4o-mini reports 16% average confidence on filings it successfully extracts every time. Grok-4.20-beta reports 96% average confidence on the same filings, with the same success rate. The numbers are model-internal calibration artifacts, not meaningful measures of extraction quality. Cross-model confidence comparison is meaningless.

The more dangerous variant: high confidence paired with missing data. Grok-4.20-beta reports 96% confidence while failing to extract spectrum allocations that are explicitly present in the filing. A model that says “I found nothing” with 96% certainty is harder to catch than one that says “I found nothing” with 16% certainty. The low-confidence model at least signals that its output deserves review. The high-confidence model discourages it.

Self-reported confidence is a starting point, not a solution - which is why external confidence scoring layers like Cleanlab’s TLM exist. Their trust scores detect extraction errors with roughly 25% greater precision than alternatives like LLM-as-a-judge or raw log probabilities.

The Normalization Problem

Here’s something the benchmarks don’t tell you: a significant portion of extraction quality lives in the normalization layer, not the model.

When you ask an LLM to extract a frequency band designation, it might return “C”, “c band”, “cband”, “C-band”, “C Band”, or “c-band”. All of these mean the same thing. Your evaluation needs to decide whether that variation counts as correct - and your production system needs to handle it regardless.

In Orbit Sentinel’s extraction pipeline, our normalization layer handles 44 band designation variants that models produce for 11 canonical values (C-band through EHF). For transmission direction - a field with exactly three valid values (uplink, downlink, inter-satellite) - we’ve mapped 13 explicit phrasings plus fuzzy fallbacks: “space-to-earth”, “earth to space”, “earth-to-space (uplink)”, “inter-satellite link”, “ISL”, “bidirectional”, and others. Every one of these appeared in real extraction output from production filings.

This isn’t a model failure. It’s a schema enforcement problem. The model understood the document correctly - it just expressed the answer differently than your schema expects. The engineering response is a normalization layer that maps variants to canonical values. The evaluation response is to score normalized outputs, not raw outputs - otherwise you’re penalizing comprehension for formatting.

Valid JSON Is Not Correct JSON

Structured output is effectively solved for JSON. Grammar-constrained decoding achieves 100% schema compliance in most frameworks. Anthropic’s structured outputs compile your JSON schema into a grammar and restrict token generation at inference time - the model literally cannot produce schema-violating tokens.

This creates a dangerous false confidence. The JSON is always valid. The schema is always followed. But the values inside it can still be fabricated, misattributed, or nonsensical. In our benchmark, an 8B parameter model extracted “FortiGate migration” as a frequency band designation and “Link Up” as a spectrum allocation band from an ECFS filing about rural broadband. Both were program names mentioned in the filing text, not frequency data. The JSON was valid. The schema was followed. The values were fabricated from contextual noise.

A well-formed JSON object containing a hallucinated frequency assignment is more dangerous than a malformed response that triggers an error - because the malformed response gets caught automatically while the hallucinated value passes through. For more on why this matters in regulatory intelligence specifically, see The Space Industry’s Trust Problem.

When we evaluate models, we separate structural compliance from factual accuracy. A model that produces valid JSON 100% of the time but gets field values right only 80% of the time is not a 100% accurate model. It’s an 80% accurate model with good formatting.

Multi-Model Reality

No single model wins every extraction task. This aligns with findings across the industry - SpotDraft, a legal tech company, benchmarked GPT-4, Claude, and Gemini on contract analysis and found that GPT-4 led contract review, Gemini led summarization, and GPT-4o mini led party extraction. Different tasks favor different architectures.

We see the same pattern at Orbit Sentinel. We started with 13 models across three providers: xAI (Grok-3, Grok-4 variants), OpenRouter (Llama 3.3 70B, Gemma 27B, Mistral Nemo, Liquid LFM2, and others), and Concentrate (GPT-4o Mini, Claude Haiku, DeepSeek). We ran all of them against the same 20 medium-length ECFS filings (10-100K characters each), using identical extraction prompts, the same 4,000-character recursive chunking with 250-character overlap, and the same JSON schema.

Most of them couldn’t do the job. Three free-tier models hit rate limits before completing a single filing. A 1.2B parameter model couldn’t produce valid JSON at all. An 8B model timed out on anything above 25K characters. Two models understood the extraction task perfectly but prefixed their JSON with conversational text (“Here is the extracted data:”) that broke our parser until we added a four-line fix to strip prefixes and find the first {.

After culling the non-contenders and fixing our response parser, five models survived:

Model	Success Rate	Avg Latency	Confidence	Spectra Found	Cost (20 filings)
GPT-4o-mini	20/20	17s	16%	48	$0.10
Claude Haiku	18/20	22s	78%	26	$0.24
Grok-3-mini	20/20	70s	90%	2	$0.10
Grok-4-1-fast	20/20	39s	94%	2	$0.57
Grok-4.20-beta	20/20	35s	96%	1	$1.69

Every number in that table came from running our production extraction pipeline against real FCC filings from our database. Raw benchmark results are available for download.

The results surprised us. No single model leads across all dimensions, and the most expensive model is not the best extractor.

For regulatory filing extraction, the relevant variables are:

Context window. Regulatory filings can exceed 100 pages. Models with larger context windows handle these without chunking - which eliminates an entire class of errors where information split across chunks gets misattributed or lost. But long-context performance degrades in the middle of documents, a well-documented phenomenon. Orbit Sentinel uses recursive character splitting (trying double-newline, newline, period-space, space, and finally character-by-character boundaries) with 250-rune overlap between chunks to preserve context at split points. We then merge chunk results - union of spectrum allocations deduplicated by frequency range and direction, highest-confidence orbital parameters, first non-null debris mitigation plan. The merging strategy matters as much as the chunking strategy.

Structured output consistency. Some models maintain near-perfect schema consistency across runs; others produce significant variation even at low temperatures. Amazon’s STED framework for comparing JSON outputs found that Claude maintains structural consistency even at temperature 0.9, while other models degrade substantially. For production extraction where you need deterministic schemas, this matters more than raw accuracy.

Cost at scale. GPT-4o-mini processed 20 filings for $0.10. Grok-4.20-beta processed the same 20 for $1.69. Both achieved 100% success. At our current queue of 37,000 pending filings, that’s the difference between $185 and $3,120 for identical success rates. The 17x cost gap buys higher self-reported confidence (96% vs 16%) but not more successful extractions. When you’re processing tens of thousands of filings, the model that’s 17x cheaper with the same success rate is the right default, with a more capable model reserved for filings where the cheap model flags uncertainty.

The pattern that works at scale is tiered: a fast, cheap model handles the 70 to 80 percent of filings that are straightforward, and a more capable model handles the rest. The evaluation framework is what tells you where that boundary falls.

The Parser Tax

One finding we didn’t expect: the benchmark measures your pipeline, not just the model. Claude Haiku and DeepSeek both understood the extraction task and produced correct JSON, but they wrapped it in conversational text. “Here is the extracted structured data from the filing:” followed by perfectly valid JSON. Our parser rejected it because it expected raw JSON, not a response that started with “Here.”

A four-line fix to find the first { in the response and parse from there turned Haiku from 0% success to 90%. DeepSeek failed for an unrelated reason (incorrect model identifier at the API layer), but the parser fix would have helped it too.

This matters because it changes how you interpret benchmark results. A model that “fails” your extraction pipeline might be producing correct output in a format your parser doesn’t expect. Before you discard a model based on success rate, check the raw responses. The gap between “model can’t extract” and “pipeline can’t parse the model’s output” is the parser tax, and it’s easy to mistake one for the other.

When Smaller Models Win

The conventional assumption - bigger model, better extraction - breaks down faster than you’d expect.

Industry experience consistently shows that fine-tuned smaller models can reach parity with frontier LLMs on specialized tasks with relatively modest training data - on the order of 100 to 200 labeled examples. The economics are stark: self-hosting an open-source model cuts inference costs 80 to 90 percent compared to API calls, and latency drops proportionally.

For regulatory extraction specifically, the advantage of smaller models compounds. A fine-tuned 14B parameter model that has seen thousands of FCC filings develops an implicit understanding of document structure, boilerplate patterns, and field locations that a general-purpose frontier model lacks. It doesn’t need to reason about what a frequency assignment looks like - it has seen enough of them to extract by pattern.

The tradeoff is inflexibility. A fine-tuned model that excels at FCC satellite applications may fail on ITU spectrum coordination notices because the document structure is fundamentally different. The general-purpose model handles both - just not as well as a specialist handles one.

The evaluation framework resolves this by measuring performance per document type, not in aggregate. If your fine-tuned model scores 97% on satellite applications and 45% on ITU notices, that’s useful information. Run the specialist where it excels and fall back to the generalist where it doesn’t.

Building Your Evaluation Pipeline

Here’s the practical framework:

1. Assemble Your Golden Dataset

Start with 100 to 200 documents. Cover every document type your system processes. Include edge cases and documents where fields are absent. Have domain experts label them. Budget more time for this than you think - it’s the foundation everything else depends on.

2. Define Your Metrics

Choose exact match or fuzzy match per field. Set your fabrication detection rules. Decide whether you’re measuring raw or normalized output. Document these decisions - they’re as important as the results.

3. Run the Benchmark

Test every model you’re considering under production-like conditions. Same prompts, same chunking strategy, same temperature settings. Capture accuracy, latency, token usage, and cost per document. Save the raw outputs - you’ll want to analyze failure patterns, not just scores.

At Orbit Sentinel, we built a CLI benchmark tool that pulls real filings from our database (filterable by size: small under 10K characters, medium 10-100K, large over 100K), runs each model’s extraction against the same documents, and outputs a comparison table with success/failure counts, average latency, confidence percentages, spectrum allocations found, parse warnings, and estimated cost. Every run saves raw JSON results for deeper analysis. The tool supports any provider behind an OpenAI-compatible API, so adding a new model to the comparison takes one line of configuration.

4. Analyze Failure Patterns

The aggregate score tells you which model is best. The failure patterns tell you why and how to improve. Cluster errors by type: fabrication, misattribution, formatting, missing extraction. Look for systematic patterns - if a model consistently fails on table data, that’s a prompt engineering or chunking problem, not a model problem.

5. Set Your Threshold

Decide what “good enough” means for your use case. In regulatory intelligence, fabrication tolerance is near zero - invented data creates compliance risk. But a 5% miss rate on a non-critical field might be acceptable if it’s flagged. Your threshold should vary by field and by consequence.

6. Monitor in Production

Your evaluation doesn’t end at deployment. Models drift, document formats change, and edge cases you didn’t anticipate appear. Sample production extractions for human review on an ongoing basis. Track your metrics over time. When accuracy drops, update your golden dataset and re-evaluate.

What We’ve Learned

Four takeaways from running extraction benchmarks at Orbit Sentinel.

The cheapest model was the best extractor. GPT-4o-mini matched the most expensive model’s success rate at 1/17th the cost. But the real gap wasn’t cost - it was extraction quality. GPT-4o-mini found 48 spectrum allocations that three Grok models missed entirely. The expensive models passed the benchmark while failing the actual task.

The normalization layer matters as much as the model. Roughly half of what looks like model error is actually schema enforcement. An 8B model that hallucinated “FortiGate migration” as a band designation was misreading context, but a model that returns “c band” instead of “C-band” is doing fine. Our normalization layer handles 44 of those variants. Before you switch models, check whether better normalization would fix the problem.

Your pipeline is part of the benchmark. A four-line parser fix turned Claude Haiku from 0% to 90%. If we’d stopped at run 1, we would have concluded Haiku can’t do extraction. It can. Our parser couldn’t read its output. Test the pipeline, not just the model.

Confidence scores are model-specific noise. Same filings, same extractions, same success rate: 16% confidence from one model, 96% from another. Don’t use self-reported confidence to compare models. Use it only within a single model to rank its own uncertain extractions for human review.

Orbit Sentinel’s extraction pipeline processes filings from four federal agencies in real time. Request early access to see verified regulatory intelligence in action.

Benchmarking LLMs for Domain-Specific Extraction

Why General Benchmarks Fail

Building a Golden Dataset

What to Measure

Field-Level Precision and Recall

Document-Level Accuracy

Fabrication Rate

Confidence Calibration

The Normalization Problem

Valid JSON Is Not Correct JSON

Multi-Model Reality

The Parser Tax

When Smaller Models Win

Building Your Evaluation Pipeline

1. Assemble Your Golden Dataset

2. Define Your Metrics

3. Run the Benchmark

4. Analyze Failure Patterns

5. Set Your Threshold

6. Monitor in Production

What We’ve Learned

Further Reading

Frequently Asked Questions