API Reference¶
wikifuse.api¶
API clients for fetching Wikipedia and Wikidata content.
- class wikifuse.api.ArticleFetcher[source]¶
Bases:
objectFetch Wikipedia articles across languages for a given QID.
- class wikifuse.api.WikidataClient(base_url: str = 'https://www.wikidata.org/w/api.php')[source]¶
Bases:
objectClient for interacting with the Wikidata API.
- get_entity(qid: str, languages: list[str] | None = None) Entity[source]¶
Fetch entity data from Wikidata.
- class wikifuse.api.WikipediaClient[source]¶
Bases:
objectClient for fetching Wikipedia article content.
- wikifuse.api.select_top_languages(qid: str, top_n: int = 2, always_include: list[str] | None = None) list[str][source]¶
Select the top N languages by article size for a given entity.
- Parameters:
qid – Wikidata QID of the entity.
top_n – Number of top languages to select.
always_include – Languages to always include even if not in top N.
- Returns:
List of language codes for the selected languages.
wikifuse.cli¶
Command line interface for wikifuse.
wikifuse.models¶
Core data models for the Intermediate Representation (IR).
- class wikifuse.models.Claim(id: str, type: str = 'claim', lang: str = 'en', text: str = '', text_en: str | None = None, sources: list[str] = <factory>, provenance: Provenance | None = None, confidence: float | None = None)[source]¶
Bases:
objectA textual claim from a Wikipedia article.
- confidence: float | None = None¶
- id: str¶
- lang: str = 'en'¶
- provenance: Provenance | None = None¶
- sources: list[str]¶
- text: str = ''¶
- text_en: str | None = None¶
- type: str = 'claim'¶
- class wikifuse.models.Entity(qid: str, labels: dict[str, str]=<factory>, descriptions: dict[str, str]=<factory>, aliases: dict[str, list[str]]=<factory>)[source]¶
Bases:
objectWikidata entity information.
- aliases: dict[str, list[str]]¶
- descriptions: dict[str, str]¶
- labels: dict[str, str]¶
- qid: str¶
- class wikifuse.models.Fact(id: str, type: str = 'fact', property: str = '', value: str | dict[str, ~typing.Any]='', qualifiers: dict[str, ~typing.Any]=<factory>, sources: list[str] = <factory>, from_source: str = 'wikidata')[source]¶
Bases:
objectA structured fact (typically from infobox or Wikidata).
- from_source: str = 'wikidata'¶
- id: str¶
- property: str = ''¶
- qualifiers: dict[str, Any]¶
- sources: list[str]¶
- type: str = 'fact'¶
- value: str | dict[str, Any] = ''¶
- class wikifuse.models.IntermediateRepresentation(entity: Entity, sections: list[Section] = <factory>, content: dict[str, ~wikifuse.models.Claim | ~wikifuse.models.Fact]=<factory>, references: dict[str, ~wikifuse.models.Reference]=<factory>, metadata: dict[str, ~typing.Any]=<factory>)[source]¶
Bases:
objectComplete IR for a merged Wikipedia article.
- classmethod from_dict(data: dict[str, Any]) IntermediateRepresentation[source]¶
Create IR from dictionary.
- classmethod from_json(json_str: str) IntermediateRepresentation[source]¶
Create IR from JSON string.
- metadata: dict[str, Any]¶
- class wikifuse.models.Provenance(wiki: str, title: str, rev_id: int)[source]¶
Bases:
objectSource provenance information for content.
- rev_id: int¶
- title: str¶
- wiki: str¶
- class wikifuse.models.Reference(id: str, doi: str | None = None, url: str | None = None, title: str | None = None, date: str | None = None, author: str | None = None, publisher: str | None = None)[source]¶
Bases:
objectA reference/citation for claims and facts.
- author: str | None = None¶
- date: str | None = None¶
- doi: str | None = None¶
- id: str¶
- publisher: str | None = None¶
- title: str | None = None¶
- url: str | None = None¶
wikifuse.parse¶
Utilities for parsing Wikipedia wikitext into structured components.
- class wikifuse.parse.ParsedArticle(sections: dict[str, str], images: list[str], infobox: dict[str, str], references: list[str])[source]¶
Bases:
objectContainer for parsed article components.
- images: list[str]¶
- infobox: dict[str, str]¶
- references: list[str]¶
- sections: dict[str, str]¶
- wikifuse.parse.parse_wikitext(wikitext: str) ParsedArticle[source]¶
Parse raw wikitext into sections, images, infobox, and references.
This parser uses
wikitextparserfor a light‑weight extraction that is sufficient for merging content across languages. It does not aim to fully replicate MediaWiki parsing but instead exposes the pieces of an article that the merge pipeline cares about.
wikifuse.merge¶
Utilities for merging parsed Wikipedia articles.
- class wikifuse.merge.ImageMerger[source]¶
Bases:
objectSimple merger that unions images from multiple articles.
- class wikifuse.merge.InfoboxMerger[source]¶
Bases:
objectMerge infobox dictionaries by unioning parameter values.
- class wikifuse.merge.TextMerger(llm_service: Any | None = None, entity_name: str = '')[source]¶
Bases:
objectMerge article sections using LLM for intelligent merging.
When LLM is available and enabled, uses LLM to intelligently merge sections from multiple language versions. Falls back to sentence-level merging when LLM is not available.
- Parameters:
llm_service – Optional LLMService or MockLLMService instance.
entity_name – Name of the entity being merged.
- merge(sections_list: Sequence[tuple[str, dict[str, str]]], target_lang: str = 'en') dict[str, str][source]¶
Merge sections from multiple language versions.
- Parameters:
sections_list – Sequence of (language, sections) tuples.
target_lang – Target language for the merged output.
- Returns:
Dictionary mapping section headings to merged text.
- wikifuse.merge.merge_article(qid: str, languages: list[str], target_lang: str = 'en', use_llm: bool = True, llm_api_key: str | None = None, llm_model: str = 'gpt-4o-mini') IntermediateRepresentation[source]¶
High level pipeline: fetch, parse and merge article versions.
- Parameters:
qid – Wikidata QID of the entity to merge.
languages – Languages to retrieve.
target_lang – Language used for the merged text.
use_llm – Whether to use LLM for intelligent text merging.
llm_api_key – OpenAI API key. If None, reads from OPENAI_API_KEY env var.
llm_model – LLM model to use for merging.
wikifuse.render¶
Wikitext renderer from IR to MediaWiki format.
- class wikifuse.render.HTMLRenderer(language: str = 'en')[source]¶
Bases:
objectRenders IntermediateRepresentation to simple HTML.
- render(ir: IntermediateRepresentation) str[source]¶
Render the IR to high-quality HTML with full Wikipedia styling.
- class wikifuse.render.WikitextRenderer(language: str = 'en')[source]¶
Bases:
objectRenders IR to MediaWiki wikitext format.
- render(ir: IntermediateRepresentation) str[source]¶
Render complete IR to wikitext.
- Parameters:
ir – Intermediate Representation to render
- Returns:
Complete wikitext string
wikifuse.translate¶
Translation service integration for cross-lingual alignment.
- class wikifuse.translate.TextCleaner[source]¶
Bases:
objectClean and normalize text content.
- class wikifuse.translate.TranslationService[source]¶
Bases:
objectTranslation service for converting text to English.
- batch_translate(texts: list[str], source_lang: str) list[tuple[str, float]][source]¶
Translate multiple texts in batch for efficiency.
- Parameters:
texts – List of texts to translate
source_lang – Source language code
- Returns:
List of (translated_text, confidence) tuples
wikifuse.llm¶
LLM service for intelligent content merging.
- class wikifuse.llm.LLMService(api_key: str | None = None, model: str = 'gpt-4o-mini', provider: str = 'openai')[source]¶
Bases:
objectLLM service for merging Wikipedia sections intelligently.
- Parameters:
api_key – API key for the LLM provider. If None, reads from OPENAI_API_KEY environment variable.
model – Model identifier to use.
provider – LLM provider (currently only “openai” supported).
- client: Any¶
- merge_sections(sections: list[tuple[str, str, str]], entity_name: str, section_heading: str) str[source]¶
Merge sections from different Wikipedia language versions.
- Parameters:
sections – List of (source_wiki, lang, content) tuples. e.g., [(“enwiki”, “en”, “…”), (“frwiki”, “fr”, “…”)]
entity_name – Name of the entity being described.
section_heading – The section heading being merged.
- Returns:
Merged text combining information from all sources.
wikifuse.diff¶
Utilities for comparing base-only vs merged Wikipedia articles.
- class wikifuse.diff.ArticleStats(section_count: int = 0, reference_count: int = 0, word_count: int = 0, image_count: int = 0, section_names: list[str] = <factory>)[source]¶
Bases:
objectStatistics about an article version.
- image_count: int = 0¶
- reference_count: int = 0¶
- section_count: int = 0¶
- section_names: list[str]¶
- word_count: int = 0¶
- class wikifuse.diff.ComparisonResult(qid: str, entity_name: str, base_lang: str, compare_langs: list[str], base_stats: ~wikifuse.diff.ArticleStats, merged_stats: ~wikifuse.diff.ArticleStats, new_sections: list[str] = <factory>, section_diffs: dict[str, ~wikifuse.diff.SectionDiff] = <factory>)[source]¶
Bases:
objectResult of comparing base-only vs merged articles.
- base_lang: str¶
- base_stats: ArticleStats¶
- compare_langs: list[str]¶
- entity_name: str¶
- merged_stats: ArticleStats¶
- new_sections: list[str]¶
- qid: str¶
- section_diffs: dict[str, SectionDiff]¶
- class wikifuse.diff.SectionDiff(title: str, base_word_count: int = 0, merged_word_count: int = 0, base_text: str = '', merged_text: str = '', is_new: bool = False)[source]¶
Bases:
objectDiff information for a single section.
- base_text: str = ''¶
- base_word_count: int = 0¶
- is_new: bool = False¶
- merged_text: str = ''¶
- merged_word_count: int = 0¶
- title: str¶
- wikifuse.diff.compare_articles(qid: str, base_lang: str, compare_langs: list[str], use_llm: bool = True, llm_model: str = 'gpt-4o-mini') ComparisonResult[source]¶
Generate both base-only and merged versions and compute differences.
- Parameters:
qid – Wikidata QID of the entity.
base_lang – Base language (typically “en”).
compare_langs – Languages to include in the merged version.
use_llm – Whether to use LLM for intelligent text merging.
llm_model – LLM model to use.
- Returns:
ComparisonResult with statistics and diffs.
- wikifuse.diff.generate_diff_html(comparison: ComparisonResult, output_path: str) None[source]¶
Generate side-by-side HTML comparison.
- Parameters:
comparison – ComparisonResult from compare_articles().
output_path – Path to write the HTML file.
- wikifuse.diff.print_stats(comparison: ComparisonResult) str[source]¶
Generate terminal-friendly statistics output.
- Parameters:
comparison – ComparisonResult from compare_articles().
- Returns:
Formatted string for terminal display.