Note: This archive may include graphic documentation of violence and harm. Learn more

How this archive is built

This project connects grassroots documentation to a durable public collection. Media and text move through scraping, ingestion, AI-assisted enrichment, and heuristic, rules-based curation, then through human review, including side-by-side visual inspection for near-duplicate items and deliberate decisions on how people are named, before anything is published. The aim is traceable provenance, bilingual consistency, and one coherent catalog that stays open to readers everywhere, so witness to repression and violence is harder to lose to the next news cycle or to be quietly forgotten.

Overview

Every item is meant to be findable by people, place, event, and theme, across English and Persian. Machine learning accelerates labeling, but sensitive merges and conflict resolution stay with reviewers so the catalog does not silently absorb mistakes.

End-to-end flow from sources to the live site
flowchart TB
  subgraph sources [Sources]
    TG["Grassroots channels"]
    MR["Memorial and rosters"]
    SOC["Hashtag social and web harvest"]
  end
  subgraph corePipe [Core pipeline]
    IN["Ingest and normalize"]
    ST["Catalog and metadata"]
    AIN["AI enrichment"]
  end
  subgraph qa [Quality assurance]
    DUP["Media dedup and review"]
    NAMES["Name normalization review"]
    EDIT["Editorial overrides"]
  end
  subgraph out [Publication]
    PKG["Curated export"]
    WEBP["Static site and CDN"]
  end
  TG --> IN
  MR --> IN
  SOC --> IN
  IN --> ST --> AIN
  AIN --> DUP
  AIN --> NAMES
  DUP --> EDIT
  NAMES --> EDIT
  EDIT --> PKG --> WEBP

Sources and connectors

The archive draws heavily on Telegram, where much firsthand Iranian protest media is posted with captions and threads. Additional structured memorial or roster sources supply identity fields and portrait imagery in a more uniform way than typical chats.

Cross-platform harvest is part of the same architecture: tracked hashtags drive searches that pull in public posts and media from networks such as X (Twitter), alongside pages from the open web. That material is merged into memorial and item context so profiles are grounded in more than a single channel, but it always sits alongside the primary Telegram or roster record and its caption, never in place of it.

Where material enters the pipeline
flowchart LR
  subgraph tel [Telegram ecosystem]
    CH["Public and activist channels"]
    CAP["Captions and thread context"]
  end
  subgraph mem [Structured memorial feeds]
    RO["Curated roster metadata"]
    PT["Portrait and identity hints"]
  end
  subgraph cross [Social and open web harvest]
    HASH["Hashtag-led searches"]
    XPM["Posts and media from X"]
    WEBSRC["Web pages and mirrors"]
  end
  CH --> CAP
  RO --> PT
  CAP --> primaryIngest["Primary ingest"]
  PT --> primaryIngest
  HASH --> XPM
  HASH --> WEBSRC
  XPM --> socialCtx["Cross channel context"]
  WEBSRC --> socialCtx
  socialCtx --> primaryIngest

AI enrichment

Google Gemini powers vision-first passes over photos and sampled video frames: full descriptions in English and Persian, semantic tags, people and places, and event hints, including detection of جاویدنام (Javidnam) memorial content where appropriate. Separate text calls condense many captions into structured memorial profiles (age, death context, hometown, and narrative bio) so the جاویدنامان index can show more than a single photo. Evidence from the social and web harvest layer is folded into that same enrichment pass as supporting context.

AI output is treated as assistive: it speeds curation but does not bypass editorial judgment on politically or emotionally charged material.

Quality, deduplication, and naming

Near-duplicate media is flagged with perceptual hashing (pHash — robust to recompress and minor crops) and with OpenCLIP-style image embeddings compared in a FAISS-style vector index for semantic “same scene” neighbors. Candidate groups and tag disagreements surface in a dedicated browser review tool; approved decisions are written back into the catalog, never silently.

Person names in English often fragment across small spelling variants. Fuzzy matching (RapidFuzz-class scores) proposes merge groups; reviewers pick a canonical spelling and aligned Persian tags so people, tag slugs, and memorial pages stay consistent.

Hand corrections to the exported catalog can be preserved across rebuilds so one-off editorial fixes are not lost the next time data is regenerated.

Parallel QA tracks converging on one canonical catalog
flowchart TB
  subgraph mediaTrack [Near duplicate media]
    PH["Perceptual hashes"]
    EM["Image embeddings CLIP class"]
    VX["Vector similarity search"]
    RV1["Human review UI"]
    PH --> candidates["Candidate groups"]
    EM --> candidates
    VX --> candidates
    candidates --> RV1
  end
  subgraph nameTrack [Person name consistency]
    FZ["Fuzzy English name match"]
    RV2["Human merge and canonical pick"]
    FZ --> RV2
  end
  subgraph editTrack [Editorial]
    MAN["Hand corrections to catalog"]
  end
  RV1 --> canonCat["Canonical catalog"]
  RV2 --> canonCat
  MAN --> canonCat
  canonCat --> pubSite["Public archive"]

Publication

The public site is a static build: pre-rendered HTML, no server-side database in front of visitors. Browsing runs through the archive grid, tag index, event index, and the جاویدنامان memorial roll with profile pages.

Glossary

Javidnam / جاویدنام
Eternal-name memorial framing for people killed in connection with the protest movement; items and profiles can carry this designation.
Perceptual hash (pHash)
A fingerprint resilient to minor editing; useful to cluster visually similar images without byte-identical files.
Embedding (CLIP-class)
A numeric vector summarizing image content so “same scene” pairs can be found even when pixels differ strongly.
Vector index (FAISS-style)
Approximate nearest-neighbor search over millions of vectors to retrieve similar images quickly.

← Back to home