Skip to main content
Viventine Space Systems
VIVENTINE SPACE SYSTEMS
The Downlink

Mission Log: Running 10,000 FCC and ITU Filings Through Rented Consumer GPUs

What Is Batch LLM Extraction on Rented GPUs?

Batch LLM extraction on rented GPUs is the practice of renting consumer-grade GPU hardware from a marketplace (such as the dozens of operators tracked by Hugging Face’s inference cost surveys), deploying an open-source language model inside a container, processing a fixed batch of documents to structured output, and tearing down the infrastructure when the batch completes. It is the opposite of reserved cloud capacity. There is no idle compute, no managed inference endpoint, and no proprietary API in the loop. For workloads that are bursty (thousands of documents at a time, then nothing for hours or days) it is the cheapest way to run extraction at scale that we have measured.

We needed structured data from thousands of FCC and ITU filings: orbital parameters, frequency assignments, entity names, dates. Cloud GPU pricing made batch processing prohibitive. So we built the pipeline that this Mission Log describes. The rest of this entry covers the problem, the architecture, the throughput math, the accuracy ceiling, and the confidence-scoring layer that keeps the output trustworthy. The infrastructure described here sits on top of the pipeline architecture that processes these filings and turns variable regulatory documents into structured data.

The Problem: Regulatory Filings Are Semi-Structured at Best

Regulatory filings are semi-structured at best. FCC IBFS entries (the International Bureau Filing System under 47 CFR Part 25) range from clean tabular forms to free-text narratives with no consistent field structure. ITU filings, including SNS database records under the Radio Regulations, are often metadata-only, containing a few fields with no document body to parse. The data we need is buried in inconsistent formats across agencies and years.

We weren’t dealing with ten filings. We needed extraction at scale. Thousands of documents, each one potentially structured differently from the last. Manual review doesn’t work at that volume. Neither does regex. The variation is too high and the formats too inconsistent for rule-based approaches. We needed a model that could reason about document structure and extract fields even when the layout changed between filings, the same conclusion the legal and financial NLP communities reached in projects like LegalBench and FinBen.

The Architecture: Containerized, Ephemeral, Marketplace-Sourced

The economics of cloud GPUs don’t work for batch workloads that run for hours and then go idle. Reserved instances mean paying for capacity you aren’t using. On-demand pricing at the major providers adds up fast: high-VRAM GPU instances at hyperscalers typically run several dollars per hour, and that is before egress.

We went a different direction: a GPU marketplace where you rent consumer-grade hardware by the hour. Sub-dollar hourly rates. A full batch of thousands of filings runs on a single-digit budget. The pipeline is containerized:

  1. Spin up a GPU instance on the marketplace, selecting by VRAM and price.
  2. Pull the container, which packages the model weights, an inference server (modern batch inference benefits from continuous batching as described in Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”, SOSP 2023, the vLLM paper), and the extraction logic.
  3. Stream the batch of filings through the model, validating each output against a JSON schema.
  4. Persist structured records plus per-field confidence scores to durable storage.
  5. Tear down the instance the moment the batch completes.

No cloud lock-in. No reserved instances. No idle capacity. Moving to a different marketplace, a different cloud, or on-prem hardware requires changing one environment variable.

Throughput: Hundreds of Filings Per Hour Per GPU

Processing throughput lands in the hundreds of filings per hour per GPU. The open-source reasoning model handles the extraction, reading each filing, identifying the relevant fields, and producing structured output. The entire run is ephemeral. When the batch is done, the infrastructure disappears.

Three variables move throughput more than anything else:

  • Document length. ECFS narrative filings of 50,000 to 100,000 characters take longer per document than IBFS form data. We chunk long filings with overlap (4,000-character chunks with 250-character overlap is what we landed on in our benchmarking work) so the model sees coherent context windows without truncation losses.
  • Schema complexity. Extracting twelve fields per filing is faster than extracting forty. Schema design is a throughput lever, not just an accuracy lever.
  • Batch size and continuous batching. The vLLM-style continuous batching approach (Kwon et al., 2023) packs requests dynamically and is the single largest throughput multiplier we have measured on consumer GPUs.

If you are working on regulatory filing extraction or any high-variance document workload at scale, we’d like to compare notes.

What Surprised Us: Document Variation Was the Hard Part

The GPU was the easy part. Document variation was the hard part.

Filing formats vary by agency, by year, by filing type. A 2018 FCC space station application looks nothing like a 2024 one. Field names change. Layouts shift. Some filings have structured tables; others bury the same data in paragraph text. ITU metadata-only entries need completely different handling than FCC narrative filings because you can’t run the same extraction logic on a document that has no document body.

Extraction accuracy varies accordingly. Some filing types extract cleanly, with field-level accuracy above 90 percent. Others, particularly older filings with inconsistent formatting or filings that combine narrative and tabular data, require human review.

Confidence Scoring: Refusal Beats Fabrication

The system needs to know when it is unsure. More importantly, it needs to say so.

This connects directly to what we wrote about in The Trust Problem: confidence boundaries matter. When extraction confidence is low, the system should flag it explicitly rather than filling gaps with plausible guesses. A regulatory intelligence platform that silently invents a frequency assignment or misattributes an orbital parameter is worse than one that says “I’m not sure.”

The failure mode has names in the standards literature:

  • OWASP LLM09: Overreliance in the OWASP Top 10 for LLM Applications names overreliance on unverified model output as a top risk category.
  • NIST AI RMF Measure function in the NIST AI Risk Management Framework (AI 100-1) calls for measurable accuracy and uncertainty metrics on AI system outputs before deployment.
  • Fabrication rate as a domain-specific metric, distinct from generic factuality, is the metric we track most carefully. It measures how often the model returns a value for a field that does not exist in the source document. In regulatory intelligence, invented data is worse than missing data.

We built confidence scoring into the extraction pipeline from the start. Every field that ships downstream carries a confidence tier. Low-confidence extractions route to human review before they enter the database that customers query.

What We Deliberately Did Not Do

A few choices in the negative space are worth naming, because they push back on common defaults:

  • No managed inference API. We did not route the workload through a frontier-model API. Cost, throughput control, and the ability to keep filing content inside our own perimeter all favored open-source models on rented hardware. Our benchmarking work showed that smaller open-source models in the 7B to 14B parameter range can match or exceed frontier models on well-defined extraction tasks.
  • No fine-tuning, yet. The first version of the pipeline runs off-the-shelf weights with prompt engineering and schema validation. Fine-tuning is a future optimization, not a prerequisite. Coverage of edge cases in the golden dataset moved accuracy more than model swaps did.
  • No silent ingestion. Every field that enters the production database has provenance: source filing identifier, extraction timestamp, model version, confidence tier. This is a hard requirement for any AI-assisted compliance workflow and aligns with NIST AI RMF’s Map function.

What’s Next in the Mission Log

Extraction gives us structured fields. The next question is what you can do with them: semantic search across filings, entity resolution to connect related records, trend detection across regulatory cycles. Structured data is the foundation. What you build on top of it is where it gets interesting.

How we evaluated which model to use, including the benchmarking methodology, the tradeoffs between speed and accuracy, and what worked and what didn’t, is covered in Benchmarking LLMs for Domain-Specific Extraction. Future Mission Log entries will cover entity resolution across FCC, ITU, NOAA, and FAA records, the eight anti-hallucination rules in production, and the cost curve as we scale beyond 100,000 filings.

Further Reading

If you’re working on regulatory filing extraction, batch LLM inference economics, or any high-variance document workload at scale, we’d like to hear from you.

Frequently Asked Questions

Why rent consumer GPUs instead of using cloud GPU instances?
Reserved cloud GPU instances mean paying for capacity even when the batch isn't running. On-demand pricing at major hyperscalers for high-VRAM GPUs typically runs several dollars per hour. Consumer GPU marketplaces offer sub-dollar hourly rates for the same hardware class. For ephemeral batch workloads that run for a few hours and then go idle, the marketplace model collapses infrastructure cost by an order of magnitude while keeping the workload containerized and portable.
What is the cost to process 10,000 regulatory filings on a rented GPU?
A single batch of thousands of filings runs on a single-digit dollar budget when using a consumer GPU marketplace at sub-dollar hourly rates. Throughput lands in the hundreds of filings per hour per GPU, so a 10,000-filing batch completes in tens of GPU-hours. Total cost is dominated by GPU rental, not storage or egress, because the entire run is ephemeral and torn down after completion.
What accuracy can you expect from open-source LLM extraction on FCC filings?
Field-level accuracy above 90 percent is achievable on clean, structured filings such as recent FCC space station applications. Accuracy degrades on older filings with inconsistent formatting, on filings that mix narrative and tabular data, and on ITU SNS records that are metadata-only with no document body. Production systems must classify each filing into a confidence tier and route low-confidence extractions to human review rather than treating the model output as ground truth.
How do you prevent hallucination in batch extraction?
Three controls applied at every stage. First, structured output constraints (JSON schema validation) reject malformed responses before they enter the database. Second, confidence scoring at the field level flags low-confidence extractions for review rather than treating model output as ground truth. Third, fabrication-rate testing on a golden dataset measures how often the model invents values for fields that don't exist in the source document. OWASP's LLM09 Overreliance and NIST's AI RMF Measure function both call out this failure mode explicitly.
Is this pipeline reproducible without vendor lock-in?
Yes. The pipeline is containerized: a Docker image with the model weights, an inference server, and the extraction logic. Any GPU host that can run a container can run the pipeline. No managed-service APIs, no proprietary inference endpoints, no reserved instances. Moving to a different marketplace, a different cloud, or local hardware requires changing one environment variable.

Anthony Caracappa

Founder, Viventine Space Systems. Building Orbit Sentinel.