ProSol.WebScrap
2.0.0
See the version list below for details.
dotnet add package ProSol.WebScrap --version 2.0.0
NuGet\Install-Package ProSol.WebScrap -Version 2.0.0
<PackageReference Include="ProSol.WebScrap" Version="2.0.0" />
paket add ProSol.WebScrap --version 2.0.0
#r "nuget: ProSol.WebScrap, 2.0.0"
// Install ProSol.WebScrap as a Cake Addin #addin nuget:?package=ProSol.WebScrap&version=2.0.0 // Install ProSol.WebScrap as a Cake Tool #tool nuget:?package=ProSol.WebScrap&version=2.0.0
ProSol.WebScrap
A HTML
parser, for extracting the text from a web pages, with CSS
selectors.
Purpose
The purpose of this library is to get the essential data from a web-page for a user, in JSON
format.
It could be further used for:
- Analyzing the essential data. Like a charts, diagramms, plain tables.
- Tracking the history of the essential data. Like prices for sales, currencies, user activity.
- Searching for specific essential data. Some word in multiple html resources, like movie title, or any other product, any mentioning.
Usage
Let's make a console demo and install the package:
dotnet new console -n WebScrap.Demo.CLI
cd WebScrap.Demo.CLI
dotnet add package ProSol.WebScrap --version 2.0.0
And try the following code:
using ProSol.WebScrap;
var request = "https://en.wikipedia.org/wiki/Food_energy";
// Download the html:
using var client = new HttpClient();
using var response = await client.GetAsync(request);
var html = await response.Content.ReadAsStringAsync();
// Run the WebScrapper:
var css = "#firstHeading";
var result = WebScrapper
.Run(html, css)
.ToJsonString();
// Get the results:
Console.WriteLine(result);
// OUTPUT:
// [{"key":"#firstHeading","values":[{"value":"Food energy"}]}]
Console.Read();
Known Issues
The project currently under active development, and there are some issues, some of the obvious, which are not the priority right now.
CSS
- multiple css entries, comma-separated, are not supported.
- attribute-based css are not supported.
HTML
- object model returns tags in reverse order.
- non-unicode text is not converted.
Goals
This project is for extracting text from html in a performant way.
Extract text
Plain text
: This tool must extract a plain text from html.User-defined result structure
: The amount of text, and it's structure is defined by user, via multiple css selectors.
Performance
Parallel processing
: All of css selectors should process the html in parallel.Stream-based processing
: The processed parts of html should be disposed from memory.
Footnote
- The versioning is complied to the Semver 2.0.0. Please refer to semver.org for details.
- Please refer to the Changelog for the progress.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
This package has no dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.