What we walked into.
Export inspection certificates under the U.S. Grain Standards Act arrive as unstructured PDFs with varying layouts, inconsistent unit formats ("LBS/BU" vs "lb/bu"), and moisture-basis details buried in the text. The client needed them as structured data, in batch, despite API rate limits.
The solution.
A Python pipeline using PyPDF2 for text extraction and GPT-4o for intelligent parsing.
A structured prompt with strict extraction rules: identification numbers, dates, 50+ canonical quality parameters, normalized units, and moisture basis.
Retry logic with exponential backoff for rate limits; whole folders processed into a single CSV.
What changed.
Unstructured certificates become structured CSV automatically — across soybean, grain, oil-content, and protein analysis certificate types.
50+ quality parameter names standardized to one canonical format for downstream analysis.
Edge cases handled, including multiple moisture-basis readings for the same parameter.
Manual data entry reduced from hours to minutes per batch, with output ready for database import.