cgsh-ofac

3 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
Peter Foster	73dcf9367b	Fix text_id to use date+name instead of PDF filename Early OFAC years use batch PDFs where one document covers many penalty cases (e.g. 56 rows sharing the same PDF in 2003). Deriving text_id from the PDF filename caused all rows sharing a document to overwrite each other in the DB, reducing 1061 rows to 348. Fix: text_id = yyyyMMdd_{slugified_name}, which is unique per table row. Also add ofac-scrape-only command for fast table-only scraping without PDF downloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:14:52 +01:00
Peter Foster	11b1e79348	Fix .gitignore: exclude bin/, obj/, logs/, data/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:29:17 +01:00
Peter Foster	ad7c5d55eb	Initial OFAC Civil Penalties scraper Scrapes https://ofac.treasury.gov/civil-penalties-and-enforcement-information for all years 2003-present. Downloads PDF documents and exports metadata.json per CGSH Publication spec (v3) to S3 experimental bucket under ofac/ prefix. Commands: ofac-full (all years), ofac-daily (current year incremental). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:29:00 +01:00

Author

SHA1

Message

Date

Peter Foster

73dcf9367b

Fix text_id to use date+name instead of PDF filename

Early OFAC years use batch PDFs where one document covers many penalty
cases (e.g. 56 rows sharing the same PDF in 2003). Deriving text_id from
the PDF filename caused all rows sharing a document to overwrite each other
in the DB, reducing 1061 rows to 348.

Fix: text_id = yyyyMMdd_{slugified_name}, which is unique per table row.
Also add ofac-scrape-only command for fast table-only scraping without PDF downloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-09 16:14:52 +01:00

Peter Foster

11b1e79348

Fix .gitignore: exclude bin/, obj/, logs/, data/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-09 15:29:17 +01:00

Peter Foster

ad7c5d55eb

Initial OFAC Civil Penalties scraper

Scrapes https://ofac.treasury.gov/civil-penalties-and-enforcement-information
for all years 2003-present. Downloads PDF documents and exports metadata.json
per CGSH Publication spec (v3) to S3 experimental bucket under ofac/ prefix.

Commands: ofac-full (all years), ofac-daily (current year incremental).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-09 15:29:00 +01:00