Methodology

How OSFeed produces intelligence

OSFeed is not a black box. Every decision in the pipeline — from source selection to event merging — is logged, explainable, and auditable. This page documents exactly how raw Telegram messages become structured geopolitical events.

Design Principles

Full Transparency

Every event links back to its original sources. Every channel has a published reliability profile. Every merge decision is logged. No black boxes.

Multi-Perspective by Design

When sources contradict each other on facts — not just framing — we flag the event as contested and present all perspectives. We don't pick sides.

Source Provenance

Each event carries a complete audit trail: which channel reported first, when corroboration arrived, and exactly what each source contributed.

The 7-Layer Pipeline

Each message passes through seven processing layers before reaching the analyst. Here is exactly what happens at each stage.

Layer 1

Real-Time Collection

OSFeed monitors 44+ public Telegram channels using the Telegram User Account API (Telethon). Unlike bot APIs, this provides full access to channel content — media, documents, edit history, and reactions. Messages are captured the moment they are published.

  • Channels grouped by topic: Ukraine-Russia, Middle East, Sahel & West Africa, USA & Global Power Shifts
  • 10–20 curated channels per topic, selected for coverage breadth and source independence
  • Automatic rate-limit handling and persistent session management
  • Media capture (images, videos, documents) with size controls
Layer 2

Contextual Translation

Every message is translated to canonical English through a geopolitically-aware pipeline. This is not Google Translate — the system injects a domain-specific glossary covering military terminology, regional acronyms, and context-sensitive vocabulary.

  • Geopolitical glossary: IDF, FSB, GRU, PMC Wagner, JNIM — plus regional military jargon
  • Context-aware translations: 'обстрел' → 'shelling' (not 'shot'), 'منطقة عازلة' → 'buffer zone' (not 'isolated zone')
  • Multi-stage language detection: character-set heuristics, lingua library, Telegram metadata fallback
  • Display translations cached per-language (7-day TTL) for 13 supported languages
Layer 3

Event Detection

Not every Telegram message is an event. Most are commentary, reposts, or opinions. OSFeed classifies each message using a multi-signal heuristic scoring system, with LLM fallback for borderline cases.

  • 6-weighted criteria: action verbs, specific locations/entities, temporal keywords, named entity density, absence of hedging, non-trivial content
  • Adaptive threshold: shorter text requires lower scores to prevent noise
  • Borderline cases (score 0.5–0.7) escalated to LLM for contextual classification
  • Only messages classified as event-bearing proceed to deduplication
Layer 4

Semantic Deduplication

Each event-bearing message is embedded into a 1536-dimensional vector space using OpenAI's text-embedding-3-small model. The embedding is compared against existing event centroids to determine whether this message reports a known event or something new.

  • Embeddings are always computed from the English translation — ensuring cross-language consistency
  • Strong match (similarity > 0.82): message is linked to the existing event as an additional source
  • Grey zone (0.65–0.82): LLM reviews both events' entity fingerprints and content to decide
  • No match (< 0.65): message creates a new event with LLM-generated title and summary
Layer 5

Multi-Signal Merge Scoring

Events that may refer to the same real-world development are evaluated using a 7-signal scoring system. This prevents both false merges (combining unrelated events) and fragmentation (the same event split across multiple entries).

  • Centroid similarity (25% weight): cosine similarity of averaged source embeddings
  • Title/summary similarity (20%): cosine similarity of event summaries
  • Temporal proximity (15%): time-decay scoring over a 72-hour window
  • Geographic similarity (12%): semantic embedding of location entities
  • Entity overlap (10%): Jaccard overlap of extracted persons, organizations, locations, weapons
  • Actor similarity (8%): semantic embedding of persons and organizations
  • Category bonus (5%): same event type (military, diplomatic, humanitarian) receives a boost
  • Adaptive weight redistribution when specific signal embeddings are unavailable
Layer 6

Entity Extraction & Canonicalization

Every event is analyzed to extract structured entities: persons, locations, organizations, weapons, and event type. An alias resolution system maps variant names to canonical forms.

  • LLM-extracted entity fingerprint: persons, locations, organizations, weapons, event_type
  • Alias resolution: 'ISIS' → 'Islamic State', 'Zelensky' → 'Volodymyr Zelenskyy'
  • Entity fingerprints drive merge scoring (entity_overlap signal) and search
  • Entities cached and refreshed every 5 minutes for consistency
Layer 7

Contradiction Detection

When sources disagree on facts — not just framing or perspective — OSFeed flags the event as contested. The summary presents all factual claims attributed to their sources, without editorializing.

  • Factual contradiction: Ukrainian channels report 31 of 47 drones intercepted; Russian channels claim all targets hit
  • Perspective difference (not flagged): one source calls it a 'military operation,' another calls it an 'attack'
  • Contested events carry a contradiction note explaining the specific disagreement
  • Multi-perspective summaries preserve each source's factual claims without synthesis

Summary Integrity Rules

Event summaries follow strict editorial constraints enforced at the LLM prompt level. These rules ensure summaries report facts, not interpretations.

  • No analysis or commentary ('This indicates...', 'This suggests...')
  • No meta-observations ('Multiple sources confirm...', 'This represents...')
  • No hedging beyond what sources state ('possibly', 'seemingly')
  • No explaining relationships between sources
  • No concluding sentences or editorial framing
  • Geographic context preserved: city + country minimum

Continuous Improvement

The pipeline improves over time through structured feedback loops.

Merge Feedback

Analysts can manually merge or split events. Every correction is logged in a feedback table used to retrain merge scoring weights.

Channel Profiling

Source reliability profiles are refreshed periodically as channels evolve. Profiles are LLM-generated from recent message samples and validated against historical accuracy.

Score Logging

Every merge scoring decision — the 7 individual signal scores and the composite result — is logged for audit and analysis.

Coverage Gaps

When events are detected significantly later than expected (e.g., the Burkina Faso defense pact arriving 4 hours after Russian state channels), new channels are added to close coverage gaps.

Get the intelligence delivered

Subscribe to our newsletter for curated weekly briefings — structured from the same pipeline documented above.

Or join the beta for full access