wikifuse

wikifuse

PyPI version Python 3.11+ License: MIT CI

Merge Wikipedia articles across languages into one comprehensive, source-attributed page.

The Problem

Wikipedia articles vary dramatically across languages. A politician’s English page might have 3 references while the French version has 25. A scientist’s Hindi page might cover their early life in detail while English focuses on achievements. wikifuse merges these perspectives into a single, richer article with full source attribution.

Quick Start

pip install wikifuse

# Compare English-only vs merged English+French for Rachida Dati
wikifuse diff --qid Q27182 --base en --compare en,fr --out ./rachida_dati/ --no-llm

Example output:

$ wikifuse diff --qid Q27182 --base en --compare en,fr --out ./rachida_dati/ --no-llm

Base (en only):     3,245 words, 12 references
Merged (en+fr):     5,891 words, 47 references
Gain:               +81% words, +292% references

See example diff output comparing Rachida Dati’s English vs English+French articles.

Commands

diff - Compare base vs merged article

Shows what you gain by merging across languages:

wikifuse diff --qid Q27182 --base en --compare en,fr --out ./output/

fetch - Download articles

wikifuse fetch --qid Q1058 --languages en,hi --out ./out/Q1058

merge - Combine across languages

wikifuse merge --qid Q1058 --languages en,hi --out ./out/Q1058

render - Output wikitext

wikifuse render --ir ./out/Q1058/wikifuse.ir.json --out ./out/Q1058/wikifuse.wikitext

preview - HTML preview

wikifuse preview --ir ./out/Q1058/wikifuse.ir.json --out ./out/Q1058/preview.html

How It Works

  1. Fetch: Download articles from multiple language Wikipedias using Wikidata QID

  2. Translate: Non-English text translated to English for alignment

  3. Align: Sentence embeddings cluster semantically similar claims

  4. Merge: Deduplicate while preserving unique content and references

  5. Render: Output wikitext or HTML with full provenance

Output Files

  • wikifuse.ir.json - Intermediate Representation with sections, claims, and attribution

  • wikifuse.wikitext - MediaWiki wikitext ready for review

  • preview.html - HTML preview

  • diff.html - Side-by-side comparison (from diff command)

Configuration

# wikifuse.yaml
qid: Q1058
languages: [en, hi]
base_language: en
max_refs_per_claim: 3
emit: [ir, wikitext, html]

Installation

pip install wikifuse

For LLM-powered merging (uses OpenAI):

pip install wikifuse
export OPENAI_API_KEY=your-key
wikifuse merge --qid Q1058 --languages en,hi --out ./output/

Without LLM (basic text merge):

wikifuse merge --qid Q1058 --languages en,hi --out ./output/ --no-llm

Licensing & Attribution

  • Wikipedia text is CC BY-SA 4.0; remixes must include attribution

  • Generated ATTRIBUTION.md includes source language and revision IDs

  • Wikidata statements are under compatible open licenses

Contributing

Issues and PRs welcome. Focus areas:

  • Enhanced translation service integration

  • Better cross-lingual alignment models

  • Performance optimization for large articles

License

MIT

API Reference

Indices and tables