scrapers/README.md

# TenderRadar Scrapers

This directory contains scrapers for UK public procurement tender sources.

## Scrapers

### 1. Contracts Finder (`contracts-finder.js`)
- **Source**: https://www.contractsfinder.service.gov.uk
- **Coverage**: England and non-devolved territories
- **Method**: JSON API
- **Frequency**: Every 4 hours (0, 4, 8, 12, 16, 20:00)
- **Data Range**: Last 30 days
- **Status**: ✅ Working

### 2. Find a Tender (`find-tender.js`)
- **Source**: https://www.find-tender.service.gov.uk
- **Coverage**: UK-wide above-threshold procurement notices
- **Method**: HTML scraping with pagination (5 pages)
- **Frequency**: Every 4 hours (0:10, 4:10, 8:10, 12:10, 16:10, 20:10)
- **Status**: ✅ Working

### 3. Public Contracts Scotland (`pcs-scotland.js`)
- **Source**: https://www.publiccontractsscotland.gov.uk
- **Coverage**: Scottish public sector tenders
- **Method**: HTML scraping
- **Frequency**: Every 4 hours (0:20, 4:20, 8:20, 12:20, 16:20, 20:20)
- **Status**: ✅ Working

### 4. Sell2Wales (`sell2wales.js`)
- **Source**: https://www.sell2wales.gov.wales
- **Coverage**: Welsh public sector tenders
- **Method**: HTML scraping
- **Frequency**: Every 4 hours (0:30, 4:30, 8:30, 12:30, 16:30, 20:30)
- **Status**: ✅ Working

## Database Schema

All scrapers insert into the `tenders` table with the following key fields:

- `source`: Identifier for the data source (contracts_finder, find_tender, pcs_scotland, sell2wales)
- `source_id`: Unique identifier from the source (used for deduplication via UNIQUE constraint)
- `title`: Tender title
- `description`: Full description
- `summary`: Shortened description
- `authority_name`: Publishing authority
- `location`: Geographic location
- `published_date`: When the tender was published
- `deadline`: Application deadline
- `notice_url`: Link to full notice
- `status`: open/closed based on deadline

## Running Scrapers

### Individual Scraper
```bash
cd /home/peter/tenderpilot
node scrapers/contracts-finder.js
node scrapers/find-tender.js
node scrapers/pcs-scotland.js
node scrapers/sell2wales.js
```

### All Scrapers
```bash
cd /home/peter/tenderpilot
./run-all-scrapers.sh
```

## Cron Schedule

The scrapers run automatically every 4 hours, staggered by 10 minutes:

```cron
0 */4 * * * cd /home/peter/tenderpilot && node scrapers/contracts-finder.js >> /home/peter/tenderpilot/scraper.log 2>&1
10 */4 * * * cd /home/peter/tenderpilot && node scrapers/find-tender.js >> /home/peter/tenderpilot/scraper.log 2>&1
20 */4 * * * cd /home/peter/tenderpilot && node scrapers/pcs-scotland.js >> /home/peter/tenderpilot/scraper.log 2>&1
30 */4 * * * cd /home/peter/tenderpilot && node scrapers/sell2wales.js >> /home/peter/tenderpilot/scraper.log 2>&1
```

## Monitoring

Check logs:
```bash
tail -f /home/peter/tenderpilot/scraper.log
```

Check database:
```bash
PGPASSWORD=tenderpilot123 psql -h localhost -U tenderpilot -d tenderpilot -c "SELECT source, COUNT(*) FROM tenders GROUP BY source;"
```

## Rate Limiting & Ethical Scraping

All scrapers implement:
- Proper User-Agent headers identifying the service
- Rate limiting (2-5 second delays between requests)
- Pagination limits where applicable
- Respectful request patterns

## Dependencies

- axios: HTTP client
- cheerio: HTML parsing (for web scrapers)
- pg: PostgreSQL client
- dotenv: Environment variables

## Maintenance

- Scrapers use `ON CONFLICT (source_id) DO NOTHING` to avoid duplicates
- Old scrapers can update existing records if needed
- Monitor for HTML structure changes on scraped sites
- API endpoints (Contracts Finder) are more stable than HTML scraping

## Last Updated

2026-02-14 - Initial deployment with all four scrapers
feat: visual polish, nav login link, pricing badge fix, cursor fix, button contrast - Hero mockup: enhanced 3D perspective and shadow - Testimonials: illustrated SVG avatars - Growth pricing card: visual prominence (scale, gradient, badge) - Most Popular badge: repositioned to avoid overlapping heading - Nav: added Log In link next to Start Free Trial - Fixed btn-primary text colour on anchor tags (white on blue) - Fixed cursor: default on all non-interactive elements - Disabled user-select on non-form content to prevent text caret 2026-02-14 14:17:15 +00:00			`# TenderRadar Scrapers`

			`This directory contains scrapers for UK public procurement tender sources.`

			`## Scrapers`

			### 1. Contracts Finder (`contracts-finder.js`)
			`- Source: https://www.contractsfinder.service.gov.uk`
			`- Coverage: England and non-devolved territories`
			`- Method: JSON API`
			`- Frequency: Every 4 hours (0, 4, 8, 12, 16, 20:00)`
			`- Data Range: Last 30 days`
			`- Status: ✅ Working`

			### 2. Find a Tender (`find-tender.js`)
			`- Source: https://www.find-tender.service.gov.uk`
			`- Coverage: UK-wide above-threshold procurement notices`
			`- Method: HTML scraping with pagination (5 pages)`
			`- Frequency: Every 4 hours (0:10, 4:10, 8:10, 12:10, 16:10, 20:10)`
			`- Status: ✅ Working`

			### 3. Public Contracts Scotland (`pcs-scotland.js`)
			`- Source: https://www.publiccontractsscotland.gov.uk`
			`- Coverage: Scottish public sector tenders`
			`- Method: HTML scraping`
			`- Frequency: Every 4 hours (0:20, 4:20, 8:20, 12:20, 16:20, 20:20)`
			`- Status: ✅ Working`

			### 4. Sell2Wales (`sell2wales.js`)
			`- Source: https://www.sell2wales.gov.wales`
			`- Coverage: Welsh public sector tenders`
			`- Method: HTML scraping`
			`- Frequency: Every 4 hours (0:30, 4:30, 8:30, 12:30, 16:30, 20:30)`
			`- Status: ✅ Working`

			`## Database Schema`

			All scrapers insert into the `tenders` table with the following key fields:

			- `source`: Identifier for the data source (contracts_finder, find_tender, pcs_scotland, sell2wales)
			- `source_id`: Unique identifier from the source (used for deduplication via UNIQUE constraint)
			- `title`: Tender title
			- `description`: Full description
			- `summary`: Shortened description
			- `authority_name`: Publishing authority
			- `location`: Geographic location
			- `published_date`: When the tender was published
			- `deadline`: Application deadline
			- `notice_url`: Link to full notice
			- `status`: open/closed based on deadline

			`## Running Scrapers`

			`### Individual Scraper`
			```bash
			`cd /home/peter/tenderpilot`
			`node scrapers/contracts-finder.js`
			`node scrapers/find-tender.js`
			`node scrapers/pcs-scotland.js`
			`node scrapers/sell2wales.js`
			```

			`### All Scrapers`
			```bash
			`cd /home/peter/tenderpilot`
			`./run-all-scrapers.sh`
			```

			`## Cron Schedule`

			`The scrapers run automatically every 4 hours, staggered by 10 minutes:`

			```cron
			`0 /4 * * cd /home/peter/tenderpilot && node scrapers/contracts-finder.js >> /home/peter/tenderpilot/scraper.log 2>&1`
			`10 /4 * * cd /home/peter/tenderpilot && node scrapers/find-tender.js >> /home/peter/tenderpilot/scraper.log 2>&1`
			`20 /4 * * cd /home/peter/tenderpilot && node scrapers/pcs-scotland.js >> /home/peter/tenderpilot/scraper.log 2>&1`
			`30 /4 * * cd /home/peter/tenderpilot && node scrapers/sell2wales.js >> /home/peter/tenderpilot/scraper.log 2>&1`
			```

			`## Monitoring`

			`Check logs:`
			```bash
			`tail -f /home/peter/tenderpilot/scraper.log`
			```

			`Check database:`
			```bash
			`PGPASSWORD=tenderpilot123 psql -h localhost -U tenderpilot -d tenderpilot -c "SELECT source, COUNT(*) FROM tenders GROUP BY source;"`
			```

			`## Rate Limiting & Ethical Scraping`

			`All scrapers implement:`
			`- Proper User-Agent headers identifying the service`
			`- Rate limiting (2-5 second delays between requests)`
			`- Pagination limits where applicable`
			`- Respectful request patterns`

			`## Dependencies`

			`- axios: HTTP client`
			`- cheerio: HTML parsing (for web scrapers)`
			`- pg: PostgreSQL client`
			`- dotenv: Environment variables`

			`## Maintenance`

			- Scrapers use `ON CONFLICT (source_id) DO NOTHING` to avoid duplicates
			`- Old scrapers can update existing records if needed`
			`- Monitor for HTML structure changes on scraped sites`
			`- API endpoints (Contracts Finder) are more stable than HTML scraping`

			`## Last Updated`

			`2026-02-14 - Initial deployment with all four scrapers`