How this archive is built
This project connects grassroots documentation to a durable public collection. Media and text move through scraping, ingestion, AI-assisted enrichment, and heuristic, rules-based curation, then through human review, including side-by-side visual inspection for near-duplicate items and deliberate decisions on how people are named, before anything is published. The aim is traceable provenance, bilingual consistency, and one coherent catalog that stays open to readers everywhere, so witness to repression and violence is harder to lose to the next news cycle or to be quietly forgotten.
Overview
Every item is meant to be findable by people, place, event, and theme, across English and Persian. Machine learning accelerates labeling, but sensitive merges and conflict resolution stay with reviewers so the catalog does not silently absorb mistakes.
flowchart TB
subgraph sources [Sources]
TG["Grassroots channels"]
MR["Memorial and rosters"]
SOC["Hashtag social and web harvest"]
end
subgraph corePipe [Core pipeline]
IN["Ingest and normalize"]
ST["Catalog and metadata"]
AIN["AI enrichment"]
end
subgraph qa [Quality assurance]
DUP["Media dedup and review"]
NAMES["Name normalization review"]
EDIT["Editorial overrides"]
end
subgraph out [Publication]
PKG["Curated export"]
WEBP["Static site and CDN"]
end
TG --> IN
MR --> IN
SOC --> IN
IN --> ST --> AIN
AIN --> DUP
AIN --> NAMES
DUP --> EDIT
NAMES --> EDIT
EDIT --> PKG --> WEBP Sources and connectors
The archive draws heavily on Telegram, where much firsthand Iranian protest media is posted with captions and threads. Additional structured memorial or roster sources supply identity fields and portrait imagery in a more uniform way than typical chats.
Cross-platform harvest is part of the same architecture: tracked hashtags drive searches that pull in public posts and media from networks such as X (Twitter), alongside pages from the open web. That material is merged into memorial and item context so profiles are grounded in more than a single channel, but it always sits alongside the primary Telegram or roster record and its caption, never in place of it.
flowchart LR
subgraph tel [Telegram ecosystem]
CH["Public and activist channels"]
CAP["Captions and thread context"]
end
subgraph mem [Structured memorial feeds]
RO["Curated roster metadata"]
PT["Portrait and identity hints"]
end
subgraph cross [Social and open web harvest]
HASH["Hashtag-led searches"]
XPM["Posts and media from X"]
WEBSRC["Web pages and mirrors"]
end
CH --> CAP
RO --> PT
CAP --> primaryIngest["Primary ingest"]
PT --> primaryIngest
HASH --> XPM
HASH --> WEBSRC
XPM --> socialCtx["Cross channel context"]
WEBSRC --> socialCtx
socialCtx --> primaryIngest AI enrichment
Google Gemini powers vision-first passes over photos and sampled video frames: full descriptions in English and Persian, semantic tags, people and places, and event hints, including detection of جاویدنام (Javidnam) memorial content where appropriate. Separate text calls condense many captions into structured memorial profiles (age, death context, hometown, and narrative bio) so the جاویدنامان index can show more than a single photo. Evidence from the social and web harvest layer is folded into that same enrichment pass as supporting context.
AI output is treated as assistive: it speeds curation but does not bypass editorial judgment on politically or emotionally charged material.
Quality, deduplication, and naming
Near-duplicate media is flagged with perceptual hashing (pHash — robust to recompress and minor crops) and with OpenCLIP-style image embeddings compared in a FAISS-style vector index for semantic “same scene” neighbors. Candidate groups and tag disagreements surface in a dedicated browser review tool; approved decisions are written back into the catalog, never silently.
Person names in English often fragment across small spelling variants. Fuzzy matching (RapidFuzz-class scores) proposes merge groups; reviewers pick a canonical spelling and aligned Persian tags so people, tag slugs, and memorial pages stay consistent.
Hand corrections to the exported catalog can be preserved across rebuilds so one-off editorial fixes are not lost the next time data is regenerated.
flowchart TB
subgraph mediaTrack [Near duplicate media]
PH["Perceptual hashes"]
EM["Image embeddings CLIP class"]
VX["Vector similarity search"]
RV1["Human review UI"]
PH --> candidates["Candidate groups"]
EM --> candidates
VX --> candidates
candidates --> RV1
end
subgraph nameTrack [Person name consistency]
FZ["Fuzzy English name match"]
RV2["Human merge and canonical pick"]
FZ --> RV2
end
subgraph editTrack [Editorial]
MAN["Hand corrections to catalog"]
end
RV1 --> canonCat["Canonical catalog"]
RV2 --> canonCat
MAN --> canonCat
canonCat --> pubSite["Public archive"] Publication
The public site is a static build: pre-rendered HTML, no server-side database in front of visitors. Browsing runs through the archive grid, tag index, event index, and the جاویدنامان memorial roll with profile pages.
Glossary
- Javidnam / جاویدنام
- Eternal-name memorial framing for people killed in connection with the protest movement; items and profiles can carry this designation.
- Perceptual hash (pHash)
- A fingerprint resilient to minor editing; useful to cluster visually similar images without byte-identical files.
- Embedding (CLIP-class)
- A numeric vector summarizing image content so “same scene” pairs can be found even when pixels differ strongly.
- Vector index (FAISS-style)
- Approximate nearest-neighbor search over millions of vectors to retrieve similar images quickly.