Commit Graph

3 Commits

Author SHA1 Message Date
Peter Foster
73dcf9367b Fix text_id to use date+name instead of PDF filename
Early OFAC years use batch PDFs where one document covers many penalty
cases (e.g. 56 rows sharing the same PDF in 2003). Deriving text_id from
the PDF filename caused all rows sharing a document to overwrite each other
in the DB, reducing 1061 rows to 348.

Fix: text_id = yyyyMMdd_{slugified_name}, which is unique per table row.
Also add ofac-scrape-only command for fast table-only scraping without PDF downloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 16:14:52 +01:00
Peter Foster
11b1e79348 Fix .gitignore: exclude bin/, obj/, logs/, data/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:29:17 +01:00
Peter Foster
ad7c5d55eb Initial OFAC Civil Penalties scraper
Scrapes https://ofac.treasury.gov/civil-penalties-and-enforcement-information
for all years 2003-present. Downloads PDF documents and exports metadata.json
per CGSH Publication spec (v3) to S3 experimental bucket under ofac/ prefix.

Commands: ofac-full (all years), ofac-daily (current year incremental).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:29:00 +01:00