# TenderRadar Scrapers This directory contains scrapers for UK public procurement tender sources. ## Scrapers ### 1. Contracts Finder (`contracts-finder.js`) - **Source**: https://www.contractsfinder.service.gov.uk - **Coverage**: England and non-devolved territories - **Method**: JSON API - **Frequency**: Every 4 hours (0, 4, 8, 12, 16, 20:00) - **Data Range**: Last 30 days - **Status**: ✅ Working ### 2. Find a Tender (`find-tender.js`) - **Source**: https://www.find-tender.service.gov.uk - **Coverage**: UK-wide above-threshold procurement notices - **Method**: HTML scraping with pagination (5 pages) - **Frequency**: Every 4 hours (0:10, 4:10, 8:10, 12:10, 16:10, 20:10) - **Status**: ✅ Working ### 3. Public Contracts Scotland (`pcs-scotland.js`) - **Source**: https://www.publiccontractsscotland.gov.uk - **Coverage**: Scottish public sector tenders - **Method**: HTML scraping - **Frequency**: Every 4 hours (0:20, 4:20, 8:20, 12:20, 16:20, 20:20) - **Status**: ✅ Working ### 4. Sell2Wales (`sell2wales.js`) - **Source**: https://www.sell2wales.gov.wales - **Coverage**: Welsh public sector tenders - **Method**: HTML scraping - **Frequency**: Every 4 hours (0:30, 4:30, 8:30, 12:30, 16:30, 20:30) - **Status**: ✅ Working ## Database Schema All scrapers insert into the `tenders` table with the following key fields: - `source`: Identifier for the data source (contracts_finder, find_tender, pcs_scotland, sell2wales) - `source_id`: Unique identifier from the source (used for deduplication via UNIQUE constraint) - `title`: Tender title - `description`: Full description - `summary`: Shortened description - `authority_name`: Publishing authority - `location`: Geographic location - `published_date`: When the tender was published - `deadline`: Application deadline - `notice_url`: Link to full notice - `status`: open/closed based on deadline ## Running Scrapers ### Individual Scraper ```bash cd /home/peter/tenderpilot node scrapers/contracts-finder.js node scrapers/find-tender.js node scrapers/pcs-scotland.js node scrapers/sell2wales.js ``` ### All Scrapers ```bash cd /home/peter/tenderpilot ./run-all-scrapers.sh ``` ## Cron Schedule The scrapers run automatically every 4 hours, staggered by 10 minutes: ```cron 0 */4 * * * cd /home/peter/tenderpilot && node scrapers/contracts-finder.js >> /home/peter/tenderpilot/scraper.log 2>&1 10 */4 * * * cd /home/peter/tenderpilot && node scrapers/find-tender.js >> /home/peter/tenderpilot/scraper.log 2>&1 20 */4 * * * cd /home/peter/tenderpilot && node scrapers/pcs-scotland.js >> /home/peter/tenderpilot/scraper.log 2>&1 30 */4 * * * cd /home/peter/tenderpilot && node scrapers/sell2wales.js >> /home/peter/tenderpilot/scraper.log 2>&1 ``` ## Monitoring Check logs: ```bash tail -f /home/peter/tenderpilot/scraper.log ``` Check database: ```bash PGPASSWORD=tenderpilot123 psql -h localhost -U tenderpilot -d tenderpilot -c "SELECT source, COUNT(*) FROM tenders GROUP BY source;" ``` ## Rate Limiting & Ethical Scraping All scrapers implement: - Proper User-Agent headers identifying the service - Rate limiting (2-5 second delays between requests) - Pagination limits where applicable - Respectful request patterns ## Dependencies - axios: HTTP client - cheerio: HTML parsing (for web scrapers) - pg: PostgreSQL client - dotenv: Environment variables ## Maintenance - Scrapers use `ON CONFLICT (source_id) DO NOTHING` to avoid duplicates - Old scrapers can update existing records if needed - Monitor for HTML structure changes on scraped sites - API endpoints (Contracts Finder) are more stable than HTML scraping ## Last Updated 2026-02-14 - Initial deployment with all four scrapers