How this archive is built
This project connects grassroots documentation to a durable public collection. Media and text move through scraping, ingestion, AI-assisted enrichment, and heuristic, rules-based curation, then through human review, including side-by-side visual inspection for near-duplicate items and deliberate decisions on how people are named, before anything is published. The aim is traceable provenance, bilingual consistency, and one coherent catalog that stays open to readers everywhere, so witness to repression and violence is harder to lose to the next news cycle or to be quietly forgotten.
Overview
Every item is meant to be findable by people, place, event, and theme, across English and Persian. Machine learning accelerates labeling, but sensitive merges and conflict resolution stay with reviewers so the catalog does not silently absorb mistakes. Items from link intake and similar sources do not enter the public export until a curator marks them verified.
Ingestion is context-aware: each link or post is normalized, then matched to domain-specific prompt packs and to retrieval hits from the published catalog before vision tagging runs. That keeps labeling consistent with source type (Telegram channel, memorial roster, X post, open-web image) and with what the archive already documents.
%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
subgraph sources [Sources]
TG["Grassroots<br/>channels"]
MR["Memorial<br/>rosters"]
SOC["Hashtag<br/>harvest"]
LI["Link<br/>intake"]
end
subgraph corePipe [Core pipeline]
IN["Ingest"]
ST["Catalog"]
RAG["RAG<br/>index"]
ROUTE["Prompt<br/>routing"]
TAG["Vision<br/>tagger"]
FACE["Face<br/>match"]
end
subgraph qa [Quality assurance]
CUR["Curator<br/>review"]
DUP["Dedup<br/>review"]
NAMES["Name<br/>review"]
EDIT["Overrides"]
end
subgraph out [Publication]
PKG["Export"]
WEBP["Static<br/>site"]
SRCH["Search<br/>index"]
GRAPH["Graph<br/>beta"]
end
TG --> IN
MR --> IN
SOC --> IN
LI --> IN
IN --> ST
ST --> RAG
ST --> ROUTE
RAG --> ROUTE
ROUTE --> TAG
TAG --> FACE
FACE --> CUR
CUR --> DUP
CUR --> NAMES
DUP --> EDIT
NAMES --> EDIT
EDIT --> PKG --> WEBP
PKG --> SRCH
PKG --> GRAPHSources and connectors
The archive draws heavily on Telegram, where much firsthand Iranian protest media is posted with captions and threads. Long-running public channels remain a backbone; a separate discovery intake path (monitored bot and backfill) queues URLs and attachments submitted from the field into the same catalog as bulk scrapes.
Additional structured memorial or roster sources supply identity fields and portrait imagery in a more uniform way than typical chats. Those feeds can be enqueued like grassroots links and then follow the same ingest, enrich, and review steps.
Cross-platform harvest is part of the same architecture: tracked hashtags drive searches that pull in public posts and media from networks such as X (Twitter), alongside pages from the open web. That material is merged into memorial and item context so profiles are grounded in more than a single channel, but it always sits alongside the primary Telegram or roster record and its caption, never in place of it.
%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
subgraph tel [Telegram]
CH["Public<br/>channels"]
CAP["Captions"]
BOT["Discovery<br/>intake"]
end
subgraph mem [Memorial feeds]
RO["Roster data"]
PT["Portraits"]
end
subgraph cross [Social and web]
HASH["Hashtag<br/>search"]
XPM["Posts<br/>from X"]
WEB["Web pages"]
end
CH --> CAP
RO --> PT
BOT --> urlDec["URL<br/>normalize"]
urlDec --> linkQ["Link<br/>inbox"]
CAP --> catalog["Catalog"]
PT --> catalog
HASH --> XPM
HASH --> WEB
XPM --> ctx["Cross-channel<br/>context"]
WEB --> ctx
ctx --> catalog
linkQ --> catalogStorage
A working catalog on the operator side holds every media item, captions, AI fields, review flags, and provenance (connector, channel, message id). That database is the system of record while material is being curated; the public site never queries it directly.
Object storage holds originals and thumbnails on CDN-backed storage. Stable public URLs in exported JSON point readers at media, not at a single machine.
Export snapshots turn the verified catalog into bilingual JSON for items, tags, events, people, places, and جاویدنامان profiles. Versioned snapshots support reproducible static builds and incremental rebuilds when only part of the collection changed.
Retrieval chunks (embeddings over export text) live beside the catalog so ingestion and verification can query similar memorial and item context without shipping the full archive into every prompt.
AI enrichment
Google Gemini powers vision-first passes over photos and sampled video frames: full descriptions in English and Persian, semantic tags, people and places, and event hints, including detection of جاویدنام (Javidnam) memorial content where appropriate. Separate text calls condense many captions into structured memorial profiles (age, death context, hometown, and narrative bio) so the جاویدنامان index can show more than a single photo. Evidence from the social and web harvest layer is folded into that same enrichment pass as supporting context.
Context routing works like selecting specialized editorial skills for each item. After a URL is decoded (Telegram, X, or direct image), the pipeline records connector, channel, and caption on the catalog row. A router then layers the right instruction packs: a shared bilingual tagging base, plus add-ons for link-inbox Telegram or X, memorial rosters, era-specific Twitter rules, or named channels with their own naming and event conventions. The result is one composed prompt per item, not a single generic template.
In parallel, a RAG-style index built from published export text is queried on ingest. Top matching chunks (existing items, memorial rows, related captions) support duplicate checks, link-inbox verification, and reviewer context. Retrieval and prompt packs address the same goal, grounding new material in prior catalog evidence, through vector search and through explicit source-specific rules.
The vision tagger receives that composed prompt, the media (photo or sampled video frames), and optional curator hints. It returns structured English and Persian descriptions, tags, people, places, and event fields that are merged with caption-derived facts and post-processing rules before anything is stored.
%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
subgraph intake [Intake]
URL["URL or<br/>caption"]
DEC["Normalize<br/>and fetch"]
META["Item<br/>metadata"]
end
subgraph context [Context assembly]
RAG["RAG top-k<br/>chunks"]
PACK["Prompts<br/>Skills"]
MERGE["Composed<br/>prompt"]
end
subgraph model [Vision tagging]
GEM["Gemini<br/>vision pass"]
OUT["Labels<br/>and text"]
end
URL --> DEC --> META
META --> PACK
META --> RAG
PACK --> MERGE
RAG --> MERGE
MERGE --> GEM --> OUTAI output is treated as assistive: it speeds curation but does not bypass editorial judgment on politically or emotionally charged material, and it does not replace curator sign-off on sensitive intake.
Quality, deduplication, and naming
Face detection and matching runs on incoming photos and sampled video frames: faces are embedded and compared to a catalog gallery built from verified portraits and memorial imagery. Matches surface in verification reports and link-inbox review as hints (likely same person, roster overlap, or duplicate portrait), alongside text retrieval from the RAG index. They assist reviewers, they do not auto-publish sensitive material.
Link inbox review surfaces unverified items across the archive with verification reports, duplicate hints, and face or name panels. Reviewers can verify, edit, remove, or hide items for later without dropping them from the underlying queue.
Near-duplicate media is flagged with perceptual hashing (pHash, robust to recompress and minor crops) and with OpenCLIP-style image embeddings compared in a FAISS-style vector index for semantic “same scene” neighbors. Candidate groups and tag disagreements surface in a unified browser review workspace; approved decisions are written back into the catalog, never silently.
Person names in English often fragment across small spelling variants. Fuzzy matching (RapidFuzz-class scores) proposes merge groups; reviewers pick a canonical spelling and aligned Persian tags so people, tag slugs, and memorial pages stay consistent.
Hand corrections to the exported catalog can be preserved across rebuilds so one-off editorial fixes are not lost the next time data is regenerated.
%%{init: {"flowchart": {"padding": 20, "nodeSpacing": 50, "rankSpacing": 58}}}%%
flowchart TB
subgraph inboxTrack [Link inbox]
VER["Auto<br/>checks"]
FD["Face<br/>detect"]
FM["Gallery<br/>match"]
RV0["Human<br/>review"]
VER --> RV0
FD --> FM --> RV0
end
subgraph mediaTrack [Near duplicates]
PH["pHash"]
EM["CLIP<br/>embed"]
VX["FAISS<br/>search"]
RV1["Human<br/>review"]
PH --> cand["Candidates"]
EM --> cand
VX --> cand
cand --> RV1
end
subgraph nameTrack [Names]
FZ["Fuzzy<br/>match"]
RV2["Canonical<br/>pick"]
FZ --> RV2
end
subgraph editTrack [Editorial]
MAN["Hand<br/>edits"]
end
RV0 --> canonCat["Catalog"]
RV1 --> canonCat
RV2 --> canonCat
MAN --> canonCat
canonCat --> pubSite["Public site"]Publication
The public site is a static build: pre-rendered HTML, no server-side database in front of visitors. Browsing runs through the archive grid, tag index, event index, item detail pages, and the جاویدنامان memorial roll with profile pages. Bilingual fields in the export drive English and Persian labels where both exist.
Full-text search uses an index generated from exported content and hosted on CDN alongside media, separate from the HTML deploy so search can be refreshed without rebuilding every page.
A graph network (beta) is exported from co-occurrence and shared-item links between people, places, and archived media. Explore views on جاویدنامان profiles show related names and connections; the feature is experimental and may change as the catalog grows.
Deployments flow from the project repository to edge-hosted static pages, with media served from object storage. Curated JSON and HTML are committed after human review passes complete, keeping the live site aligned with verified catalog state.
Glossary
- Javidnam / جاویدنام
- Eternal-name memorial framing for people killed in connection with the protest movement; items and profiles can carry this designation.
- Link inbox
- Intake queue for new URLs and media awaiting curator review before they are included in the public export.
- Curator verified
- Human approval flag; required for link-inbox-class and similar intake sources in the published build.
- Domain prompt pack
- Source-specific instruction text (channel, connector, era) composed with the shared tagging base before a vision model run.
- Context routing
- Rule-based selection of prompt packs from catalog metadata after URL normalize and fetch, analogous to choosing an editorial skill for the item.
- RAG index (ingestion)
- Embedded text chunks from the published catalog, retrieved during ingest to supply neighborhood context for verification and review.
- Perceptual hash (pHash)
- A fingerprint resilient to minor editing; useful to cluster visually similar images without byte-identical files.
- Embedding (CLIP-class)
- A numeric vector summarizing image content so “same scene” pairs can be found even when pixels differ strongly.
- Vector index (FAISS-style)
- Approximate nearest-neighbor search over millions of vectors to retrieve similar images quickly.
- Face gallery match
- Detected faces compared to embedded portraits in the catalog for verification hints and memorial linkage.
- Graph network (beta)
- Exported links between people and shared archive items, shown on Explore memorial views; still evolving.