X.Web.MetaExtractor 2.1.1

Prefix Reserved

dotnet add package X.Web.MetaExtractor --version 2.1.1

NuGet\Install-Package X.Web.MetaExtractor -Version 2.1.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="X.Web.MetaExtractor" Version="2.1.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="X.Web.MetaExtractor" Version="2.1.1" />
                    

                            Directory.Packages.props

<PackageReference Include="X.Web.MetaExtractor" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add X.Web.MetaExtractor --version 2.1.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: X.Web.MetaExtractor, 2.1.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#addin nuget:?package=X.Web.MetaExtractor&version=2.1.1
                    

                            Install X.Web.MetaExtractor as a Cake Addin

#tool nuget:?package=X.Web.MetaExtractor&version=2.1.1
                    

                            Install X.Web.MetaExtractor as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

X.Web.MetaExtractor

X.Web.MetaExtractor is a powerful library that allows you to extract meta information from any web page URL. It provides a variety of content loaders to handle HTTP requests using different libraries.

Breaking Changes

Metadata class was changed: The Content field has been removed from the Metadata class. Ensure to update your code to reflect this change if you were using the Content field.
Description Extraction Logic: The Extractor class now only extracts the description from meta tags, without attempting to parse the content of the page.
New WebPage Model: The library now returns a WebPage model with comprehensive information including links found on the page.
Link Extraction: Added support for extracting and processing all hyperlinks from web pages.

Features

Extract meta information from any web page URL.
Extract and process hyperlinks from web pages.
Support for multiple HTTP libraries:
- Flurl
- FsHttp
- RestSharp
Detect the language of the page content.

Installation

To install the library, use the following command:

dotnet add package X.Web.MetaExtractor

Usage

Here is a basic example of how to use the X.Web.MetaExtractor library:

using X.Web.MetaExtractor;
using X.Web.MetaExtractor.ContentLoaders;
using X.Web.MetaExtractor.LanguageDetectors;

// Create instances of the necessary components
IContentLoader contentLoader = new FlurlContentLoader();
ILanguageDetector languageDetector = new LanguageDetector();
string defaultImage = "https://example.com/example.jpg";

// Create an instance of the Extractor
IExtractor extractor = new Extractor(defaultImage, contentLoader, languageDetector);

// Extract information from a URL
var webPage = await extractor.Extract(new Uri("https://example.com"), CancellationToken.None);

// Display the extracted information
Console.WriteLine($"Title: {webPage.Title}");
Console.WriteLine($"Description: {webPage.Description}");
Console.WriteLine($"Keywords: {webPage.Keywords}");
Console.WriteLine($"Language: {webPage.Language}");

// Process links
if (webPage.Links != null)
{
    Console.WriteLine($"Found {webPage.Links.Count} links:");
    foreach (var link in webPage.Links)
    {
        Console.WriteLine($"- {link.Title}: {link.Value}");
    }
}

Interfaces and Classes

IExtractor

IExtractor defines the interface for extracting web page information, returning a comprehensive WebPage model.

ILanguageDetector

ILanguageDetector defines the interface for detecting the language of the page content.

IContentLoader

IContentLoader defines the interface for loading the content of a web page asynchronously.

WebPage

WebPage is the main model containing extracted information from a web page, including metadata, links, and source information.

Link

Link is a record that represents a hyperlink extracted from HTML content with Title and Value properties.

Source

Source is a record that contains information about the origin of web content, including the original URL and raw page content.

Extractors

The library architecture supports multiple specialized extractors that work together to build a complete representation of a web page:

MetaDocumentExtractor - Extracts metadata from HTML <meta> tags
OpenGraphDocumentExtractor - Extracts Open Graph protocol metadata
TitleDocumentExtractor - Extracts the page title
ImageDocumentExtractor - Extracts image URLs from the document
LinksDocumentExtractor – Extracts all hyperlinks from HTML documents, converting them to strongly-typed Link objects.

Content Loaders

Flurl

X.Web.MetaExtractor.ContentLoaders.Flurl provides a content loader using the Flurl HTTP library.

FsHttp

X.Web.MetaExtractor.ContentLoaders.FsHttp leverages the FsHttp library to load content.

HttpClient

X.Web.MetaExtractor.ContentLoaders.HttpClient utilizes the HttpClient class to load content.

RestSharp

X.Web.MetaExtractor.ContentLoaders.RestSharp uses the RestSharp library for content loading.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- HtmlAgilityPack (>= 1.12.0)
- Microsoft.Extensions.Http (>= 9.0.3)
- System.Collections.Immutable (>= 9.0.3)
net9.0
- HtmlAgilityPack (>= 1.12.0)
- Microsoft.Extensions.Http (>= 9.0.3)
- System.Collections.Immutable (>= 9.0.3)

NuGet packages (4)

Showing the top 4 NuGet packages that depend on X.Web.MetaExtractor:

Package	Downloads
X.Bluesky Simple client for posting to Bluesky	5.1K
X.Web.MetaExtractor.ContentLoaders.RestSharp X.Web.MetaExtractor.ContentLoaders.RestSharp uses the RestSharp library for content loading, providing an intuitive and powerful way to handle HTTP requests for extracting meta information from any page URL.	351
X.Web.MetaExtractor.ContentLoaders.FsHttp X.Web.MetaExtractor.ContentLoaders.FsHttp leverages the FsHttp library to load content, facilitating robust and type-safe HTTP request execution for extracting meta information from any page URL.	324
X.Web.MetaExtractor.ContentLoaders.Flurl X.Web.MetaExtractor allow extract meta information from any page url	220

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
2.1.1	198	3/23/2025
2.0.4	3,897	8/1/2024
2.0.2	11,286	7/9/2024
1.8.0	12,191	3/5/2023
1.7.0	562	9/8/2022
1.5.7688.23013	765	1/18/2021
1.4.7641.34535	451	12/2/2020
1.4.7312.26620	724	1/8/2020
1.1.7147.30235	656	9/26/2019
1.1.7147.30212	564	7/27/2019
1.0.12	908	11/24/2018
1.0.11	764	11/9/2018
1.0.10	725	11/5/2018
1.0.9	895	9/3/2018
1.0.8	822	8/31/2018
1.0.7	1,122	4/26/2018
1.0.5	1,047	4/11/2018
1.0.4	1,016	4/4/2018
1.0.3	1,152	1/31/2018
1.0.2	1,246	8/20/2017
1.0.0	973	7/23/2017