WikipediaExtractor 1.0.0

dotnet add package WikipediaExtractor --version 1.0.0
NuGet\Install-Package WikipediaExtractor -Version 1.0.0
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="WikipediaExtractor" Version="1.0.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add WikipediaExtractor --version 1.0.0
#r "nuget: WikipediaExtractor, 1.0.0"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install WikipediaExtractor as a Cake Addin
#addin nuget:?package=WikipediaExtractor&version=1.0.0

// Install WikipediaExtractor as a Cake Tool
#tool nuget:?package=WikipediaExtractor&version=1.0.0

Wikipedia Extractor

Wikipedia Extractor is a lightweight C# library which can be used to extract XML page data from a Wikipedia data dump. It makes use of the index file included with the compressed data dump to find the position of the page and quickly retrieve it from the archive. It was developed using Visual Studio 2022.

The current URL for the data dumps is https://dumps.wikimedia.org/enwiki/ you will need to download both files and extract the index but not the dump and enter the correct paths for the library to find the files.

The test project can be run without using the data dump as all of the index and page contents are created in memory.

This library does not parse the XML page elements instead it just returns an object containing the XML. There are other projects on GitHub for parsing the XML.

Here are some screenshots of the library running:

<img align='left' src='https://drive.google.com/uc?id=1d5y_9GKCelsbyn61Ui7oHYZYQhCB1MKG' width='240'> <img src='https://drive.google.com/uc?id=1IQeyd8hGIURlNH6VW9GjyjnShMoV9GYF' width='240'>

Example

var pageTitles = new List<string>
{
	"Software development",
	"Microsoft Visual Studio",
	"JavaScript"
};

using (var indexSearcher = new PageIndexSearcher(@"F:\enwiki-20190701-pages-articles-multistream-index.txt"))
{
	var pageIndexItems = indexSearcher.Search(pageTitles);
	foreach (PageIndexItem pii in pageIndexItems)
	{
		Console.WriteLine(pii.PageId + ": " + pii.PageTitle);
	}

	using (var dataDumpReader = new DataDumpReader(@"F:\enwiki-20190701-pages-articles-multistream.xml.bz2"))
	{
		var results = dataDumpReader.Search(pageIndexItems);
		foreach (var result in results) 
		{ 
			Console.WriteLine(result.Name + ": " + result.Value);               
		}
	}
}
Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp2.1 is compatible.  netcoreapp2.2 was computed.  netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.0 95 3/30/2024