API Reference

wikifuse.api

API clients for fetching Wikipedia and Wikidata content.

class wikifuse.api.ArticleFetcher[source]

Bases: object

Fetch Wikipedia articles across languages for a given QID.

fetch_all(qid: str, languages: list[str], output_dir: str) dict[str, Any][source]
class wikifuse.api.WikidataClient(base_url: str = 'https://www.wikidata.org/w/api.php')[source]

Bases: object

Client for interacting with the Wikidata API.

get_entity(qid: str, languages: list[str] | None = None) Entity[source]

Fetch entity data from Wikidata.

get_entity_claims(qid: str, properties: list[str] | None = None) dict[str, list[dict[str, Any]]][source]

Get claims (statements) for a Wikidata entity.

Get sitelinks (Wikipedia article titles) for an entity.

class wikifuse.api.WikipediaClient[source]

Bases: object

Client for fetching Wikipedia article content.

get_article_extract(title: str, language: str, sentences: int = 10) dict[str, Any][source]
get_article_length(title: str, language: str) int[source]

Get the length (in bytes) of a Wikipedia article.

Parameters:
  • title – Article title.

  • language – Language code.

Returns:

Article length in bytes, or 0 if not found.

get_article_sections(title: str, language: str) list[dict[str, Any]][source]
get_article_wikitext(title: str, language: str) dict[str, Any][source]
search_articles(query: str, language: str, limit: int = 10) list[dict[str, Any]][source]
wikifuse.api.select_top_languages(qid: str, top_n: int = 2, always_include: list[str] | None = None) list[str][source]

Select the top N languages by article size for a given entity.

Parameters:
  • qid – Wikidata QID of the entity.

  • top_n – Number of top languages to select.

  • always_include – Languages to always include even if not in top N.

Returns:

List of language codes for the selected languages.

wikifuse.cli

Command line interface for wikifuse.

wikifuse.cli.main() None[source]

wikifuse.models

Core data models for the Intermediate Representation (IR).

class wikifuse.models.Claim(id: str, type: str = 'claim', lang: str = 'en', text: str = '', text_en: str | None = None, sources: list[str] = <factory>, provenance: Provenance | None = None, confidence: float | None = None)[source]

Bases: object

A textual claim from a Wikipedia article.

confidence: float | None = None
id: str
lang: str = 'en'
provenance: Provenance | None = None
sources: list[str]
text: str = ''
text_en: str | None = None
to_dict() dict[str, Any][source]

Convert to dictionary for JSON serialization.

type: str = 'claim'
class wikifuse.models.Entity(qid: str, labels: dict[str, str]=<factory>, descriptions: dict[str, str]=<factory>, aliases: dict[str, list[str]]=<factory>)[source]

Bases: object

Wikidata entity information.

aliases: dict[str, list[str]]
descriptions: dict[str, str]
labels: dict[str, str]
qid: str
class wikifuse.models.Fact(id: str, type: str = 'fact', property: str = '', value: str | dict[str, ~typing.Any]='', qualifiers: dict[str, ~typing.Any]=<factory>, sources: list[str] = <factory>, from_source: str = 'wikidata')[source]

Bases: object

A structured fact (typically from infobox or Wikidata).

from_source: str = 'wikidata'
id: str
property: str = ''
qualifiers: dict[str, Any]
sources: list[str]
to_dict() dict[str, Any][source]

Convert to dictionary for JSON serialization.

type: str = 'fact'
value: str | dict[str, Any] = ''
class wikifuse.models.IntermediateRepresentation(entity: Entity, sections: list[Section] = <factory>, content: dict[str, ~wikifuse.models.Claim | ~wikifuse.models.Fact]=<factory>, references: dict[str, ~wikifuse.models.Reference]=<factory>, metadata: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Complete IR for a merged Wikipedia article.

content: dict[str, Claim | Fact]
entity: Entity
classmethod from_dict(data: dict[str, Any]) IntermediateRepresentation[source]

Create IR from dictionary.

classmethod from_json(json_str: str) IntermediateRepresentation[source]

Create IR from JSON string.

metadata: dict[str, Any]
references: dict[str, Reference]
sections: list[Section]
to_dict() dict[str, Any][source]

Convert to dictionary for JSON serialization.

to_json(indent: int = 2) str[source]

Serialize to JSON string.

class wikifuse.models.Provenance(wiki: str, title: str, rev_id: int)[source]

Bases: object

Source provenance information for content.

rev_id: int
title: str
to_dict() dict[str, Any][source]

Convert to dictionary for JSON serialization.

wiki: str
class wikifuse.models.Reference(id: str, doi: str | None = None, url: str | None = None, title: str | None = None, date: str | None = None, author: str | None = None, publisher: str | None = None)[source]

Bases: object

A reference/citation for claims and facts.

author: str | None = None
date: str | None = None
doi: str | None = None
id: str
publisher: str | None = None
title: str | None = None
to_dict() dict[str, Any][source]

Convert to dictionary for JSON serialization.

url: str | None = None
class wikifuse.models.Section(id: str, title: dict[str, str]=<factory>, items: list[str] = <factory>, level: int = 2)[source]

Bases: object

A section in the merged article.

id: str
items: list[str]
level: int = 2
title: dict[str, str]
to_dict() dict[str, Any][source]

Convert to dictionary for JSON serialization.

wikifuse.parse

Utilities for parsing Wikipedia wikitext into structured components.

class wikifuse.parse.ParsedArticle(sections: dict[str, str], images: list[str], infobox: dict[str, str], references: list[str])[source]

Bases: object

Container for parsed article components.

images: list[str]
infobox: dict[str, str]
references: list[str]
sections: dict[str, str]
wikifuse.parse.parse_wikitext(wikitext: str) ParsedArticle[source]

Parse raw wikitext into sections, images, infobox, and references.

This parser uses wikitextparser for a light‑weight extraction that is sufficient for merging content across languages. It does not aim to fully replicate MediaWiki parsing but instead exposes the pieces of an article that the merge pipeline cares about.

wikifuse.merge

Utilities for merging parsed Wikipedia articles.

class wikifuse.merge.ImageMerger[source]

Bases: object

Simple merger that unions images from multiple articles.

static merge(image_lists: Sequence[list[str]]) list[str][source]
class wikifuse.merge.InfoboxMerger[source]

Bases: object

Merge infobox dictionaries by unioning parameter values.

static merge(boxes: Sequence[dict[str, str]]) dict[str, list[str]][source]
class wikifuse.merge.TextMerger(llm_service: Any | None = None, entity_name: str = '')[source]

Bases: object

Merge article sections using LLM for intelligent merging.

When LLM is available and enabled, uses LLM to intelligently merge sections from multiple language versions. Falls back to sentence-level merging when LLM is not available.

Parameters:
  • llm_service – Optional LLMService or MockLLMService instance.

  • entity_name – Name of the entity being merged.

merge(sections_list: Sequence[tuple[str, dict[str, str]]], target_lang: str = 'en') dict[str, str][source]

Merge sections from multiple language versions.

Parameters:
  • sections_list – Sequence of (language, sections) tuples.

  • target_lang – Target language for the merged output.

Returns:

Dictionary mapping section headings to merged text.

static merge_static(sections_list: Sequence[tuple[str, dict[str, str]]], target_lang: str = 'en') dict[str, str][source]

Static method for backward compatibility.

wikifuse.merge.merge_article(qid: str, languages: list[str], target_lang: str = 'en', use_llm: bool = True, llm_api_key: str | None = None, llm_model: str = 'gpt-4o-mini') IntermediateRepresentation[source]

High level pipeline: fetch, parse and merge article versions.

Parameters:
  • qid – Wikidata QID of the entity to merge.

  • languages – Languages to retrieve.

  • target_lang – Language used for the merged text.

  • use_llm – Whether to use LLM for intelligent text merging.

  • llm_api_key – OpenAI API key. If None, reads from OPENAI_API_KEY env var.

  • llm_model – LLM model to use for merging.

wikifuse.render

Wikitext renderer from IR to MediaWiki format.

class wikifuse.render.HTMLRenderer(language: str = 'en')[source]

Bases: object

Renders IntermediateRepresentation to simple HTML.

render(ir: IntermediateRepresentation) str[source]

Render the IR to high-quality HTML with full Wikipedia styling.

class wikifuse.render.WikitextRenderer(language: str = 'en')[source]

Bases: object

Renders IR to MediaWiki wikitext format.

render(ir: IntermediateRepresentation) str[source]

Render complete IR to wikitext.

Parameters:

ir – Intermediate Representation to render

Returns:

Complete wikitext string

wikifuse.translate

Translation service integration for cross-lingual alignment.

class wikifuse.translate.TextCleaner[source]

Bases: object

Clean and normalize text content.

static clean_sentence(text: str) str[source]

Clean a sentence for processing.

static extract_plain_text(wikitext: str) str[source]

Extract plain text from wikitext, removing all markup.

static normalize_reference_text(ref_text: str) str[source]

Normalize reference text for deduplication.

class wikifuse.translate.TranslationService[source]

Bases: object

Translation service for converting text to English.

batch_translate(texts: list[str], source_lang: str) list[tuple[str, float]][source]

Translate multiple texts in batch for efficiency.

Parameters:
  • texts – List of texts to translate

  • source_lang – Source language code

Returns:

List of (translated_text, confidence) tuples

translate_claims(claims: list[Claim]) list[Claim][source]

Translate a list of claims to English for alignment.

Parameters:

claims – List of claims to translate

Returns:

List of claims with English translations

translate_to_english(text: str, source_lang: str) tuple[str, float][source]

Translate text to English.

Parameters:
  • text – Text to translate

  • source_lang – Source language code

Returns:

Tuple of (translated_text, confidence_score)

wikifuse.llm

LLM service for intelligent content merging.

class wikifuse.llm.LLMService(api_key: str | None = None, model: str = 'gpt-4o-mini', provider: str = 'openai')[source]

Bases: object

LLM service for merging Wikipedia sections intelligently.

Parameters:
  • api_key – API key for the LLM provider. If None, reads from OPENAI_API_KEY environment variable.

  • model – Model identifier to use.

  • provider – LLM provider (currently only “openai” supported).

client: Any
merge_sections(sections: list[tuple[str, str, str]], entity_name: str, section_heading: str) str[source]

Merge sections from different Wikipedia language versions.

Parameters:
  • sections – List of (source_wiki, lang, content) tuples. e.g., [(“enwiki”, “en”, “…”), (“frwiki”, “fr”, “…”)]

  • entity_name – Name of the entity being described.

  • section_heading – The section heading being merged.

Returns:

Merged text combining information from all sources.

class wikifuse.llm.MockLLMService(**_kwargs: Any)[source]

Bases: object

Mock LLM service for testing without API calls.

merge_sections(sections: list[tuple[str, str, str]], entity_name: str, section_heading: str) str[source]

Return combined content from all sections for testing.

wikifuse.diff

Utilities for comparing base-only vs merged Wikipedia articles.

class wikifuse.diff.ArticleStats(section_count: int = 0, reference_count: int = 0, word_count: int = 0, image_count: int = 0, section_names: list[str] = <factory>)[source]

Bases: object

Statistics about an article version.

image_count: int = 0
reference_count: int = 0
section_count: int = 0
section_names: list[str]
word_count: int = 0
class wikifuse.diff.ComparisonResult(qid: str, entity_name: str, base_lang: str, compare_langs: list[str], base_stats: ~wikifuse.diff.ArticleStats, merged_stats: ~wikifuse.diff.ArticleStats, new_sections: list[str] = <factory>, section_diffs: dict[str, ~wikifuse.diff.SectionDiff] = <factory>)[source]

Bases: object

Result of comparing base-only vs merged articles.

base_lang: str
base_stats: ArticleStats
compare_langs: list[str]
entity_name: str
merged_stats: ArticleStats
new_sections: list[str]
qid: str
section_diffs: dict[str, SectionDiff]
class wikifuse.diff.SectionDiff(title: str, base_word_count: int = 0, merged_word_count: int = 0, base_text: str = '', merged_text: str = '', is_new: bool = False)[source]

Bases: object

Diff information for a single section.

base_text: str = ''
base_word_count: int = 0
is_new: bool = False
merged_text: str = ''
merged_word_count: int = 0
title: str
wikifuse.diff.compare_articles(qid: str, base_lang: str, compare_langs: list[str], use_llm: bool = True, llm_model: str = 'gpt-4o-mini') ComparisonResult[source]

Generate both base-only and merged versions and compute differences.

Parameters:
  • qid – Wikidata QID of the entity.

  • base_lang – Base language (typically “en”).

  • compare_langs – Languages to include in the merged version.

  • use_llm – Whether to use LLM for intelligent text merging.

  • llm_model – LLM model to use.

Returns:

ComparisonResult with statistics and diffs.

wikifuse.diff.generate_diff_html(comparison: ComparisonResult, output_path: str) None[source]

Generate side-by-side HTML comparison.

Parameters:
  • comparison – ComparisonResult from compare_articles().

  • output_path – Path to write the HTML file.

wikifuse.diff.print_stats(comparison: ComparisonResult) str[source]

Generate terminal-friendly statistics output.

Parameters:

comparison – ComparisonResult from compare_articles().

Returns:

Formatted string for terminal display.