ProSol.WebScrap 2.0.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package ProSol.WebScrap --version 2.0.0                
NuGet\Install-Package ProSol.WebScrap -Version 2.0.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ProSol.WebScrap" Version="2.0.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add ProSol.WebScrap --version 2.0.0                
#r "nuget: ProSol.WebScrap, 2.0.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install ProSol.WebScrap as a Cake Addin
#addin nuget:?package=ProSol.WebScrap&version=2.0.0

// Install ProSol.WebScrap as a Cake Tool
#tool nuget:?package=ProSol.WebScrap&version=2.0.0                

ProSol.WebScrap

A HTML parser, for extracting the text from a web pages, with CSS selectors.

Purpose

The purpose of this library is to get the essential data from a web-page for a user, in JSON format.

It could be further used for:

  1. Analyzing the essential data. Like a charts, diagramms, plain tables.
  2. Tracking the history of the essential data. Like prices for sales, currencies, user activity.
  3. Searching for specific essential data. Some word in multiple html resources, like movie title, or any other product, any mentioning.

Usage

Let's make a console demo and install the package:

dotnet new console -n WebScrap.Demo.CLI
cd WebScrap.Demo.CLI
dotnet add package ProSol.WebScrap --version 2.0.0

And try the following code:

using ProSol.WebScrap;

var request = "https://en.wikipedia.org/wiki/Food_energy";

// Download the html:
using var client = new HttpClient();
using var response = await client.GetAsync(request);
var html = await response.Content.ReadAsStringAsync();

// Run the WebScrapper:
var css = "#firstHeading";
var result = WebScrapper
    .Run(html, css)
    .ToJsonString();

// Get the results:
Console.WriteLine(result);
// OUTPUT:
// [{"key":"#firstHeading","values":[{"value":"Food energy"}]}]
Console.Read();

Known Issues

The project currently under active development, and there are some issues, some of the obvious, which are not the priority right now.

CSS

  • multiple css entries, comma-separated, are not supported.
  • attribute-based css are not supported.

HTML

  • object model returns tags in reverse order.
  • non-unicode text is not converted.

Goals

This project is for extracting text from html in a performant way.

Extract text

  • Plain text: This tool must extract a plain text from html.
  • User-defined result structure: The amount of text, and it's structure is defined by user, via multiple css selectors.

Performance

  • Parallel processing: All of css selectors should process the html in parallel.
  • Stream-based processing: The processed parts of html should be disposed from memory.

Footnote

  • The versioning is complied to the Semver 2.0.0. Please refer to semver.org for details.
  • Please refer to the Changelog for the progress.
Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
2.0.2 591 12/8/2023
2.0.1 512 12/8/2023
2.0.0 502 12/8/2023
1.0.0 546 11/5/2023