I Built a RAG-Powered Search Engine on 1.1 Million Live News Articles

I just shipped something I’ve been quietly building for months.

A RAG pipeline sitting on top of 1.1 million news articles — live, in production, for one of Bangladesh’s most recognized news publishers.

News is not a normal retrieval problem.

Most RAG systems are built for documents that don’t expire. News expires every hour. This is what building for that actually looks like.

The core problem with standard semantic search on news

In standard RAG, you embed your documents, store them in a vector DB, run cosine similarity at query time, and call it a day. That works fine for documentation or knowledge bases. It completely breaks for news.

Imagine someone searches: “Dhaka flood 2024”

A pure vector search might surface a beautifully written analysis piece about climate change from 18 months ago. Semantically? Very relevant. Practically? Useless. The reporter needed last Tuesday’s breaking coverage.

Recency is not a nice-to-have in news. It is the product.

Chunking strategy — the decision most teams get wrong

How you chunk your documents before embedding determines retrieval precision more than almost any other variable. It’s also the decision most teams make in five minutes and never revisit.

The naive approach is fixed-size chunking — split every article at 512 tokens, overlap by 50–100 tokens, embed each chunk. It’s easy to implement. It fails on news for a specific structural reason.

News articles have a defined internal structure. The lede — the first one to three sentences — contains the who, what, when, where, and why. The body contains context, background, and development. The tail contains quotes, secondary sources, and related information. Fixed-size chunking destroys this structure constantly, splitting a single fact across two chunks that will never be retrieved together.

The chunking approach I landed on after testing:

1. Semantic chunking over fixed-size chunking. Split at semantic boundaries — paragraph breaks, topic shifts detected via sentence embedding similarity drops — rather than token counts. This keeps related information together and significantly improves the coherence of retrieved chunks.

2. Universal header prefix. Every chunk gets the article headline and publication timestamp prepended before embedding. This bakes temporal and topical context into the vector representation itself, not just into metadata. Two chunks on the same topic from different dates embed differently because the header is part of the text. This improves recency-aware retrieval at the vector level before any scoring logic runs.

3. First-paragraph priority chunk. The opening paragraph of every article is always its own dedicated chunk, regardless of length. The lede is the highest information-density section of any news article. Making it independently retrievable dramatically improves recall on short factual queries where a journalist needs the core facts, not the full context.

Query intent recognition — the layer that makes everything else work

Before any retrieval runs, every query passes through a lightweight intent recognizer. This is the component that makes adaptive scoring possible.

The recognizer classifies queries along several dimensions:

1. Length and structure. Short queries are almost always entity lookups or named searches. Long queries (eight or more tokens with verb structure) are almost always natural language questions. This single signal already tells you a great deal about what kind of results the user wants.

2. Entity density. Queries heavy with proper nouns, organization names, or geographic references benefit from keyword anchoring. Queries without recognizable entities benefit from semantic expansion.

3. Temporal markers. Queries containing words like “yesterday,” “last week,” “latest,” or specific date references signal that recency should dominate the ranking.

4. Question type. “What happened” queries want recent event coverage. “Why did” and “how does” queries often want analytical background pieces where a slightly older in-depth article is more useful than yesterday’s brief update.

The recognizer is intentionally lightweight — a small classifier running on CPU, completing in under 5 milliseconds. The goal is to add a trivial amount of latency before retrieval to save significant latency and improve quality throughout the rest of the pipeline.

Hybrid scoring with adaptive weights

The core scoring formula:

final_score = w1 × semantic_score
            + w2 × recency_score
            + w3 × keyword_score
            + ... other parameters based on query intent

The weights w1, w2, and w3 are determined per query by the intent recognizer output:

A short entity query like “Yunus government” → high w3, moderate w1, low w2
A natural language question about economic trends → high w1, moderate w2, low w3
A breaking news query with temporal markers → high w2, moderate w1, low w3

This means the same scoring infrastructure handles fundamentally different query types correctly without separate retrieval pipelines per query category.

Recency scoring with exponential decay

recency = baseline + e^(-age_days / half_life) × (1 - baseline)

The half-life is a tunable parameter calibrated on actual query logs from the newsroom. Breaking news queries have a short effective half-life — content older than 72 hours loses value rapidly. Background and explainer queries have a longer half-life — a six-month-old investigative analysis remains highly relevant.

Articles published within 48 hours receive an explicit recency boost on top of the decay curve. Articles beyond two years receive a floor penalty that prevents them surfacing on recency-sensitive queries regardless of how strong their semantic match is.

The baseline parameter sets a minimum recency score so that very old articles on highly specific topics aren’t completely suppressed when they’re genuinely the only relevant content in the index.

Score dampening

When semantic similarity crosses a high confidence threshold — calibrated at approximately 0.92 cosine similarity in production — recency and keyword scores are compressed before being added to the final score.

The reasoning is straightforward. If a result is already a near-perfect semantic match, applying full-weight secondary signals introduces noise rather than improving ranking. The strongest signal should win cleanly.

This insight came from debugging journalist feedback over several weeks. The symptom was specific — top results were correct, but the ordering within the top five was frequently wrong. A journalist would report that the second result was consistently better than the first. Score dampening on high-confidence semantic matches fixed the ordering problem without affecting overall recall.

Neural reranking — and when not to use it

The optional reranking stage uses a cross-encoder model that reads the query and each candidate document together in a single forward pass, producing a relevance score significantly more accurate than bi-encoder cosine similarity.

Why not run it on every query? Latency. A cross-encoder cannot use pre-computed embeddings — it runs fresh inference on every query-document pair. Reranking twenty candidates adds meaningful latency on every request.

The solution is a flag-gated reranking path:

Precision-critical paths — analyst research, editorial deep search, report generation — enable reranking
Latency-sensitive paths — autocomplete, live feed ranking, mobile quick search — skip it entirely

Deduplication

The same article exists in the index as multiple chunks. Without deduplication, a single highly relevant article floods the top results with four to six chunks from the same source, pushing all other relevant content out of the visible result set.

Before returning results, the system keeps only the highest-scoring chunk per article URL. Everything else from the same article is discarded. Simple to implement, easy to skip in early development, and noticeably important in production.

What the retrieval layer is part of

The search system is one component inside a larger newsroom intelligence platform. Beyond retrieval, the platform tracks competitor publication patterns, monitors social media signals from news channels and journalists, and surfaces analytics about story momentum, topic velocity, coverage gaps, and source activity.

An editor can see which topics are accelerating in competitor coverage before their own newsroom has assigned a reporter. A journalist can pull all contextual background on a developing story in seconds rather than spending twenty minutes on manual search.

Search finds an article. A knowledge graph finds the story behind the story — every person, place, and event connected across 1.1 million articles.

The RDF knowledge graph layer is an integral part of the broader platform, but it deserves its own writeup. That’s coming next.

What I would do differently

Implement semantic chunking from day one. Fixed-size chunking was fast to ship and created retrieval quality problems that took weeks to properly diagnose because the failure mode was subtle — results were relevant but not quite right.

Build the intent recognizer before the scoring layer, not after. The adaptive weight system was added post-launch based on observed failure patterns in production. Building it first would have caught edge cases before journalists encountered them.

Instrument everything earlier. Query-level logging of which signals dominated the final score, which chunks were retrieved, and where reranking changed the order would have cut debugging time significantly. Observability in retrieval systems is as important as observability in any other production service.

Building something in a similar space? I’m happy to go deeper on any of these components — the intent classifier, half-life calibration, reranker gating logic, or chunking strategy. Reach out or connect on LinkedIn.