News Scraper
A production-grade paginated news archive scraper with AI-powered semantic keyword filtering, Google Sheets export, and an interactive live dashboard.
01 The Problem
Media monitoring and investigative journalism teams track hundreds of articles published daily across dozens of news archives. Each archive has its own pagination structure, HTML layout, and access patterns. The manual workflow is unsustainable: open each site, scan headlines, copy-paste what's relevant into a spreadsheet, repeat.
Off-the-shelf scraping tools exist, but they share a fundamental flaw: keyword matching is literal. A filter for "climate policy" misses articles titled "EPA announces new emissions rules" because the words don't match. Journalists either accept noisy results (80% irrelevant) or spend hours curating manually. What they need is a scraper that understands meaning, not just strings.
The core challenge: build an automated system that scrapes paginated news archives at scale, filters articles by what they actually mean (not just what they say), and delivers clean, structured data directly into Google Sheets — with no manual cleaning required.
02 Architecture
The system follows a three-stage pipeline: Scrape → Filter → Export. Each stage is an independent module with a clear interface, making the system adaptable to any news archive by simply swapping the site-specific CSS selectors in the configuration.
News Archive (paginated) │ ▼ Stage 1: Scraper Engine ├── Navigates paginated archives via "next page" links ├── Extracts headline, URL, publication date, metadata ├── Configurable CSS selectors per site └── Rate-limited requests with error isolation │ ▼ Stage 2: AI/NLP Semantic Filter ├── Sentence-transformers (all-MiniLM-L6-v2) ├── Embeds both keywords and articles into 384-dim vectors ├── Cosine similarity matching — catches meaning, not just strings ├── Configurable threshold (0.0 - 1.0) └── Each matched article gets a category + relevance score │ ▼ Stage 3: Export Layer ├── CSV export (immediate, zero setup) ├── Google Sheets API (service account auth) └── Optional: FastAPI web dashboard with search/filter UI
The architecture is deliberately site-agnostic. The config.yaml holds all site-specific selectors, so re-targeting a different news archive is a config change, not a code change.
03 Tech Stack
| Technology | Role |
|---|---|
| Python 3.14 | Core runtime — scraping, filtering, API |
| BeautifulSoup 4 + lxml | HTML parsing — article extraction from archive pages |
| Requests + Session | HTTP client with rate limiting and retry logic |
| sentence-transformers | all-MiniLM-L6-v2 — 384-dim embeddings for semantic keyword matching |
| numpy | Cosine similarity computation between keyword and article vectors |
| FastAPI + Uvicorn | REST API for the interactive dashboard backend |
| gspread + google-auth | Google Sheets API — auto-export filtered results |
| PyYAML | Externalized configuration (target sites, keywords, thresholds) |
| Netlify | Static dashboard hosting — live demo accessible without setup |
⚡ Code Highlight
The semantic filter engine pre-computes keyword embeddings once, then matches every article via cosine similarity in a single matrix operation — no per-article model inference.
class SemanticFilter:
def _semantic_filter(self, articles: list) -> list:
self._load_model()
texts = []
for a in articles:
text = a.headline
if a.summary:
text += ". " + a.summary[:500]
texts.append(text)
article_embeddings = self._model.encode(
texts, normalize_embeddings=True
)
results = []
for i, article in enumerate(articles):
art_emb = article_embeddings[i]
similarities = np.dot(self._keyword_embeddings, art_emb)
best_idx = int(np.argmax(similarities))
best_score = float(similarities[best_idx])
if best_score >= self.threshold:
article.category = self.keywords[best_idx]
article.relevance_score = round(best_score, 4)
results.append(article)
return results
04 Key Challenges
📦 1. Generic Pagination Handling
Every news archive paginates differently. Some use "Next →" links, others use numbered page buttons (1, 2, 3...), and a few use infinite scroll with dynamic URL parameters. The scraper needed to be generic enough to handle all patterns while remaining simple to configure.
The solution is a CSS-selector-driven architecture: each site's config specifies an article_container, title_selector, and next_page_selector as standard CSS selectors. To target a new site, you update these three values in config.yaml — no code changes. The scraper follows the "next page" selector recursively until the config's max_pages limit is reached or no more pages exist.
The selector-based approach was chosen over a regex or URL-pattern heuristic because news sites are structurally inconsistent. A selector-based config means the system works on any site with a predictable HTML structure — which includes essentially all major news archives. Per-article error isolation ensures one broken link doesn't stop the entire crawl.
⚡ 2. Semantic Threshold Tuning
Setting the cosine similarity threshold is a precision-recall tradeoff. A high threshold (0.45) catches only very close matches but misses conceptually related articles. A low threshold (0.10) catches everything but reintroduces noise. The optimal value depends on the keyword breadth and article corpus size.
The demo uses a threshold of 0.20 with 8 diverse keywords against 60 HN articles. This produced 34 matches — enough to demonstrate the concept while still filtering out 43% of noise. The web dashboard lets users adjust this in real-time and immediately see the impact on results.
📈 3. Rate Limiting Without Hard Blocking
Production news archives actively block aggressive scrapers. HN returns HTTP 429 (Too Many Requests) after roughly 1 request/second. The system implements configurable delays between page requests and uses browser-like User-Agent headers to appear as a standard visitor. The error handling is per-request: a single page timeout doesn't abort the entire crawl — it logs the failure and continues with the next page.
🌐 4. Client-Facing Demo Without Backend
The full system requires a Python backend to run. For the client demo, I needed something they could click immediately without installing anything. The solution: a static dashboard with pre-computed results deployed on Netlify. The full FastAPI backend (for live scraping) is available in the repo for self-hosting, but the static demo gives instant visual proof.
05 Results
The demo run against Hacker News (2 pages, 8 keywords, 0.20 threshold) produced:
Key semantic matching examples that demonstrate the AI filtering in action:
- "Apple reveals new AI architecture built around Google Gemini models" → artificial intelligence (0.34) — the headline never says "AI" in the keyword list but the embedding catches the meaning
- "OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision" → open source (0.25) — no mention of "open source" in the title, but the model recognizes OpenCV's nature
- "Are you expected to run five Python type-checkers now?" → python (0.45) — strong direct match, high confidence
- "Surveillance is not safety: A statement on the UK's latest threat to privacy" → security (0.45) — privacy/security semantic cluster
- "xAI is looking more like a datacentre REIT than a frontier lab" → data (0.32) — "datacentre" maps to data infrastructure
The full interactive dashboard is live at cozy-clafoutis-6fdea8.netlify.app — searchable, filterable by category, with CSV export.
🌐 View Live Dashboard →06 What I Learned
Semantic filtering beats keyword matching by an order of magnitude for news. News articles deliberately use varied vocabulary to avoid repetition — a single story about AI might use "machine learning," "neural networks," "deep learning," and "artificial intelligence" across different paragraphs. Semantic embeddings collapse these into the same vector neighborhood, so you catch the story regardless of which synonym the writer chose. A keyword approach would require an ever-growing list of synonyms; embeddings handle it implicitly.
A working demo is worth a thousand words in a proposal. When clients evaluate freelancers on Upwork, they read 15-20 proposals. Most are text promises. Sending a live URL where they can interact with your work before hiring you changes the dynamic from "can this person do it?" to "when can they start?" The 2-hour investment in building the static dashboard and deploying to Netlify returned more credibility than any proposal paragraph could.
Design matters even for backend tools. The web dashboard wasn't strictly required by the project scope, but presenting the output in a dark-themed, searchable, filterable interface immediately signals production quality. It tells the client: "I don't just write code that works — I build systems that people actually want to use." For a data extraction tool that will be used daily by journalists, the UX of the output matters as much as the accuracy of the scrape.