How this archive is built

This project connects grassroots documentation to a durable public collection. Media and text move through scraping, ingestion, AI-assisted enrichment, and heuristic, rules-based curation, then through human review, including side-by-side visual inspection for near-duplicate items and deliberate decisions on how people are named, before anything is published. The aim is traceable provenance, bilingual consistency, and one coherent catalog that stays open to readers everywhere, so witness to repression and violence is harder to lose to the next news cycle or to be quietly forgotten.

Overview

Every item is meant to be findable by people, place, event, and theme, across English and Persian. Machine learning accelerates labeling, but sensitive merges and conflict resolution stay with reviewers so the catalog does not silently absorb mistakes. Items from link intake and similar sources do not enter the public export until a curator marks them verified.

Ingestion is context-aware: each link or post is normalized, then matched to domain-specific prompt packs and to retrieval hits from the published catalog before vision tagging runs. That keeps labeling consistent with source type (Telegram channel, memorial roster, X post, open-web image) and with what the archive already documents.

End-to-end flow from sources to the live site

%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
  subgraph sources [Sources]
    TG["Grassroots<br/>channels"]
    MR["Memorial<br/>rosters"]
    SOC["Hashtag<br/>harvest"]
    LI["Link<br/>intake"]
  end
  subgraph corePipe [Core pipeline]
    IN["Ingest"]
    ST["Catalog"]
    RAG["RAG<br/>index"]
    ROUTE["Prompt<br/>routing"]
    TAG["Vision<br/>tagger"]
    FACE["Face<br/>match"]
  end
  subgraph qa [Quality assurance]
    CUR["Curator<br/>review"]
    DUP["Dedup<br/>review"]
    NAMES["Name<br/>review"]
    EDIT["Overrides"]
  end
  subgraph out [Publication]
    PKG["Export"]
    WEBP["Static<br/>site"]
    SRCH["Search<br/>index"]
    GRAPH["Graph<br/>beta"]
  end
  TG --> IN
  MR --> IN
  SOC --> IN
  LI --> IN
  IN --> ST
  ST --> RAG
  ST --> ROUTE
  RAG --> ROUTE
  ROUTE --> TAG
  TAG --> FACE
  FACE --> CUR
  CUR --> DUP
  CUR --> NAMES
  DUP --> EDIT
  NAMES --> EDIT
  EDIT --> PKG --> WEBP
  PKG --> SRCH
  PKG --> GRAPH

Sources and connectors

The archive draws heavily on Telegram, where much firsthand Iranian protest media is posted with captions and threads. Long-running public channels remain a backbone; a separate discovery intake path (monitored bot and backfill) queues URLs and attachments submitted from the field into the same catalog as bulk scrapes.

Additional structured memorial or roster sources supply identity fields and portrait imagery in a more uniform way than typical chats. Those feeds can be enqueued like grassroots links and then follow the same ingest, enrich, and review steps.

Cross-platform harvest is part of the same architecture: tracked hashtags drive searches that pull in public posts and media from networks such as X (Twitter), alongside pages from the open web. That material is merged into memorial and item context so profiles are grounded in more than a single channel, but it always sits alongside the primary Telegram or roster record and its caption, never in place of it.

Where material enters the pipeline

%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
  subgraph tel [Telegram]
    CH["Public<br/>channels"]
    CAP["Captions"]
    BOT["Discovery<br/>intake"]
  end
  subgraph mem [Memorial feeds]
    RO["Roster data"]
    PT["Portraits"]
  end
  subgraph cross [Social and web]
    HASH["Hashtag<br/>search"]
    XPM["Posts<br/>from X"]
    WEB["Web pages"]
  end
  CH --> CAP
  RO --> PT
  BOT --> urlDec["URL<br/>normalize"]
  urlDec --> linkQ["Link<br/>inbox"]
  CAP --> catalog["Catalog"]
  PT --> catalog
  HASH --> XPM
  HASH --> WEB
  XPM --> ctx["Cross-channel<br/>context"]
  WEB --> ctx
  ctx --> catalog
  linkQ --> catalog

Storage

A working catalog on the operator side holds every media item, captions, AI fields, review flags, and provenance (connector, channel, message id). That database is the system of record while material is being curated; the public site never queries it directly.

Object storage holds originals and thumbnails on CDN-backed storage. Stable public URLs in exported JSON point readers at media, not at a single machine.

Export snapshots turn the verified catalog into bilingual JSON for items, tags, events, people, places, and جاویدنامان profiles. Versioned snapshots support reproducible static builds and incremental rebuilds when only part of the collection changed.

Retrieval chunks (embeddings over export text) live beside the catalog so ingestion and verification can query similar memorial and item context without shipping the full archive into every prompt.

AI enrichment

Google Gemini powers vision-first passes over photos and sampled video frames: full descriptions in English and Persian, semantic tags, people and places, and event hints, including detection of جاویدنام (Javidnam) memorial content where appropriate. Separate text calls condense many captions into structured memorial profiles (age, death context, hometown, and narrative bio) so the جاویدنامان index can show more than a single photo. Evidence from the social and web harvest layer is folded into that same enrichment pass as supporting context.

Context routing works like selecting specialized editorial skills for each item. After a URL is decoded (Telegram, X, or direct image), the pipeline records connector, channel, and caption on the catalog row. A router then layers the right instruction packs: a shared bilingual tagging base, plus add-ons for link-inbox Telegram or X, memorial rosters, era-specific Twitter rules, or named channels with their own naming and event conventions. The result is one composed prompt per item, not a single generic template.

In parallel, a RAG-style index built from published export text is queried on ingest. Top matching chunks (existing items, memorial rows, related captions) support duplicate checks, link-inbox verification, and reviewer context. Retrieval and prompt packs address the same goal, grounding new material in prior catalog evidence, through vector search and through explicit source-specific rules.

The vision tagger receives that composed prompt, the media (photo or sampled video frames), and optional curator hints. It returns structured English and Persian descriptions, tags, people, places, and event fields that are merged with caption-derived facts and post-processing rules before anything is stored.

How URL metadata, domain prompt packs, and RAG retrieval feed vision tagging

%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
  subgraph intake [Intake]
    URL["URL or<br/>caption"]
    DEC["Normalize<br/>and fetch"]
    META["Item<br/>metadata"]
  end
  subgraph context [Context assembly]
    RAG["RAG top-k<br/>chunks"]
    PACK["Prompts<br/>Skills"]
    MERGE["Composed<br/>prompt"]
  end
  subgraph model [Vision tagging]
    GEM["Gemini<br/>vision pass"]
    OUT["Labels<br/>and text"]
  end
  URL --> DEC --> META
  META --> PACK
  META --> RAG
  PACK --> MERGE
  RAG --> MERGE
  MERGE --> GEM --> OUT

AI output is treated as assistive: it speeds curation but does not bypass editorial judgment on politically or emotionally charged material, and it does not replace curator sign-off on sensitive intake.

Quality, deduplication, and naming

Face detection and matching runs on incoming photos and sampled video frames: faces are embedded and compared to a catalog gallery built from verified portraits and memorial imagery. Matches surface in verification reports and link-inbox review as hints (likely same person, roster overlap, or duplicate portrait), alongside text retrieval from the RAG index. They assist reviewers, they do not auto-publish sensitive material.

Link inbox review surfaces unverified items across the archive with verification reports, duplicate hints, and face or name panels. Reviewers can verify, edit, remove, or hide items for later without dropping them from the underlying queue.

Near-duplicate media is flagged with perceptual hashing (pHash, robust to recompress and minor crops) and with OpenCLIP-style image embeddings compared in a FAISS-style vector index for semantic “same scene” neighbors. Candidate groups and tag disagreements surface in a unified browser review workspace; approved decisions are written back into the catalog, never silently.

Person names in English often fragment across small spelling variants. Fuzzy matching (RapidFuzz-class scores) proposes merge groups; reviewers pick a canonical spelling and aligned Persian tags so people, tag slugs, and memorial pages stay consistent.

Hand corrections to the exported catalog can be preserved across rebuilds so one-off editorial fixes are not lost the next time data is regenerated.

Parallel QA tracks converging on one canonical catalog

%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
  subgraph inboxTrack [Link inbox]
    VER["Auto<br/>checks"]
    FD["Face<br/>detect"]
    FM["Gallery<br/>match"]
    RV0["Human<br/>review"]
    VER --> RV0
    FD --> FM --> RV0
  end
  subgraph mediaTrack [Near duplicates]
    PH["pHash"]
    EM["CLIP<br/>embed"]
    VX["FAISS<br/>search"]
    RV1["Human<br/>review"]
    PH --> cand["Candidates"]
    EM --> cand
    VX --> cand
    cand --> RV1
  end
  subgraph nameTrack [Names]
    FZ["Fuzzy<br/>match"]
    RV2["Canonical<br/>pick"]
    FZ --> RV2
  end
  subgraph editTrack [Editorial]
    MAN["Hand<br/>edits"]
  end
  RV0 --> canonCat["Catalog"]
  RV1 --> canonCat
  RV2 --> canonCat
  MAN --> canonCat
  canonCat --> pubSite["Public site"]

Publication

The public site is a static build: pre-rendered HTML, no server-side database in front of visitors. Browsing runs through the archive grid, tag index, event index, item detail pages, and the جاویدنامان memorial roll with profile pages. Bilingual fields in the export drive English and Persian labels where both exist.

Full-text search uses an index generated from exported content and hosted on CDN alongside media, separate from the HTML deploy so search can be refreshed without rebuilding every page.

A graph network (beta) is exported from co-occurrence and shared-item links between people, places, and archived media. Explore views on جاویدنامان profiles show related names and connections; the feature is experimental and may change as the catalog grows.

Deployments flow from the project repository to edge-hosted static pages, with media served from object storage. Curated JSON and HTML are committed after human review passes complete, keeping the live site aligned with verified catalog state.

Glossary

Javidnam / جاویدنام: Eternal-name memorial framing for people killed in connection with the protest movement; items and profiles can carry this designation.
Link inbox: Intake queue for new URLs and media awaiting curator review before they are included in the public export.
Curator verified: Human approval flag; required for link-inbox-class and similar intake sources in the published build.
Domain prompt pack: Source-specific instruction text (channel, connector, era) composed with the shared tagging base before a vision model run.
Context routing: Rule-based selection of prompt packs from catalog metadata after URL normalize and fetch, analogous to choosing an editorial skill for the item.
RAG index (ingestion): Embedded text chunks from the published catalog, retrieved during ingest to supply neighborhood context for verification and review.
Perceptual hash (pHash): A fingerprint resilient to minor editing; useful to cluster visually similar images without byte-identical files.
Embedding (CLIP-class): A numeric vector summarizing image content so “same scene” pairs can be found even when pixels differ strongly.
Vector index (FAISS-style): Approximate nearest-neighbor search over millions of vectors to retrieve similar images quickly.
Face gallery match: Detected faces compared to embedded portraits in the catalog for verification hints and memorial linkage.
Graph network (beta): Exported links between people and shared archive items, shown on Explore memorial views; still evolving.

← Back to home