wikifuse¶
wikifuse¶
Merge Wikipedia articles across languages into one comprehensive, source-attributed page.
The Problem¶
Wikipedia articles vary dramatically across languages. A politician’s English page might have 3 references while the French version has 25. A scientist’s Hindi page might cover their early life in detail while English focuses on achievements. wikifuse merges these perspectives into a single, richer article with full source attribution.
Quick Start¶
pip install wikifuse
# Compare English-only vs merged English+French for Rachida Dati
wikifuse diff --qid Q27182 --base en --compare en,fr --out ./rachida_dati/ --no-llm
Example output:
$ wikifuse diff --qid Q27182 --base en --compare en,fr --out ./rachida_dati/ --no-llm
Base (en only): 3,245 words, 12 references
Merged (en+fr): 5,891 words, 47 references
Gain: +81% words, +292% references
See example diff output comparing Rachida Dati’s English vs English+French articles.
Commands¶
diff - Compare base vs merged article¶
Shows what you gain by merging across languages:
wikifuse diff --qid Q27182 --base en --compare en,fr --out ./output/
fetch - Download articles¶
wikifuse fetch --qid Q1058 --languages en,hi --out ./out/Q1058
merge - Combine across languages¶
wikifuse merge --qid Q1058 --languages en,hi --out ./out/Q1058
render - Output wikitext¶
wikifuse render --ir ./out/Q1058/wikifuse.ir.json --out ./out/Q1058/wikifuse.wikitext
preview - HTML preview¶
wikifuse preview --ir ./out/Q1058/wikifuse.ir.json --out ./out/Q1058/preview.html
How It Works¶
Fetch: Download articles from multiple language Wikipedias using Wikidata QID
Translate: Non-English text translated to English for alignment
Align: Sentence embeddings cluster semantically similar claims
Merge: Deduplicate while preserving unique content and references
Render: Output wikitext or HTML with full provenance
Output Files¶
wikifuse.ir.json- Intermediate Representation with sections, claims, and attributionwikifuse.wikitext- MediaWiki wikitext ready for reviewpreview.html- HTML previewdiff.html- Side-by-side comparison (fromdiffcommand)
Configuration¶
# wikifuse.yaml
qid: Q1058
languages: [en, hi]
base_language: en
max_refs_per_claim: 3
emit: [ir, wikitext, html]
Installation¶
pip install wikifuse
For LLM-powered merging (uses OpenAI):
pip install wikifuse
export OPENAI_API_KEY=your-key
wikifuse merge --qid Q1058 --languages en,hi --out ./output/
Without LLM (basic text merge):
wikifuse merge --qid Q1058 --languages en,hi --out ./output/ --no-llm
Licensing & Attribution¶
Wikipedia text is CC BY-SA 4.0; remixes must include attribution
Generated
ATTRIBUTION.mdincludes source language and revision IDsWikidata statements are under compatible open licenses
Contributing¶
Issues and PRs welcome. Focus areas:
Enhanced translation service integration
Better cross-lingual alignment models
Performance optimization for large articles
License¶
MIT