feat: three major improvements - stable sources, archival, email alerts
1. Focus on Stable International/Regional Sources - Improved TED EU scraper (5 search strategies, 5 pages each) - All stable sources now hourly (TED EU, Sell2Wales, PCS Scotland, eTendersNI) - De-prioritize unreliable UK gov sites (100% removal rate) 2. Archival Feature - New DB columns: archived, archived_at, archived_snapshot, last_validated, validation_failures - Cleanup script now preserves full tender snapshots before archiving - Gradual failure handling (3 retries before archiving) - No data loss - historical record preserved 3. Email Alerts - Daily digest (8am) - all new tenders from last 24h - High-value alerts (every 4h) - tenders >£100k - Professional HTML emails with all tender details - Configurable via environment variables Expected outcomes: - 50-100 stable tenders (vs 26 currently) - Zero 404 errors (archived data preserved) - Proactive notifications (no missed opportunities) - Historical archive for trend analysis Files: - scrapers/ted-eu.js (improved) - cleanup-with-archival.mjs (new) - send-tender-alerts.mjs (new) - migrations/add-archival-fields.sql (new) - THREE_IMPROVEMENTS_SUMMARY.md (documentation) All cron jobs updated for hourly scraping + daily cleanup + alerts
This commit is contained in:
44
test-ted-detail.mjs
Normal file
44
test-ted-detail.mjs
Normal file
@@ -0,0 +1,44 @@
|
||||
import { chromium } from 'playwright';
|
||||
|
||||
const browser = await chromium.launch({ headless: true });
|
||||
const page = await browser.newPage();
|
||||
|
||||
const url = 'https://ted.europa.eu/en/search/result?q=&page=1&placeOfPerformanceCountry=GBR';
|
||||
console.log('Loading:', url);
|
||||
|
||||
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
|
||||
await page.waitForTimeout(3000);
|
||||
|
||||
// Extract full tender data
|
||||
const tenders = await page.evaluate(() => {
|
||||
const results = [];
|
||||
const rows = document.querySelectorAll('tr[data-notice-id], .notice-row, tbody tr');
|
||||
|
||||
rows.forEach((row, idx) => {
|
||||
if (idx > 5) return; // Limit to first 5 for testing
|
||||
|
||||
try {
|
||||
const link = row.querySelector('a[href*="/notice/"]');
|
||||
if (!link) return;
|
||||
|
||||
const cells = row.querySelectorAll('td');
|
||||
const allText = row.textContent;
|
||||
|
||||
results.push({
|
||||
href: link.href,
|
||||
noticeId: link.textContent.trim(),
|
||||
rowText: allText.trim().substring(0, 500),
|
||||
cellCount: cells.length,
|
||||
cellTexts: Array.from(cells).map(c => c.textContent.trim().substring(0, 100))
|
||||
});
|
||||
} catch (e) {
|
||||
// Skip
|
||||
}
|
||||
});
|
||||
|
||||
return results;
|
||||
});
|
||||
|
||||
console.log('\nExtracted tenders:', JSON.stringify(tenders, null, 2));
|
||||
|
||||
await browser.close();
|
||||
Reference in New Issue
Block a user