SmartReader 0.9.6
dotnet add package SmartReader --version 0.9.6
NuGet\Install-Package SmartReader -Version 0.9.6
<PackageReference Include="SmartReader" Version="0.9.6" />
paket add SmartReader --version 0.9.6
#r "nuget: SmartReader, 0.9.6"
// Install SmartReader as a Cake Addin #addin nuget:?package=SmartReader&version=0.9.6 // Install SmartReader as a Cake Tool #tool nuget:?package=SmartReader&version=0.9.6
<h1 align="center"> <br> <img src="https://raw.github.com/strumenta/SmartReader/master/logo.png" width="256" alt="SmartReader"> <br> SmartReader <br> </h1> <h5 align="center">A library to extract the main content of a web page, removing ads, sidebars, etc.</h5>
<p align="center"> <a href="https://www.nuget.org/packages/SmartReader/"> <img src="https://img.shields.io/nuget/dt/SmartReader" alt="Downloads on Nuget"/> </a> <a href="https://ci.appveyor.com/project/GabrieleTomassetti/smartreader"> <img src="https://ci.appveyor.com/api/projects/status/sdbndj848icahnfq?svg=true" alt="Build status"/> </a> <a href="https://github.com/strumenta/smartreader/License"> <img src="https://img.shields.io/github/license/strumenta/smartreader" alt="Apache License"/> </a> </p>
What and Why
This library supports the .NET Standard 2.0. The core algorithm is a port of the Mozilla Readability library. The original library is stable and used in production inside Firefox. This way we can piggyback on the hard and well-tested work of Mozilla.
SmartReader also added some improvements on the original library, getting more and better metadata:
- site name
- an author and publication date
- the language
- the excerpt of the article
- the featured image
- a list of images found (it can optionally also download them and store as data URI)
- an estimate of the time needed to read the article
Some of these fields are now present in the original library.
It also allows to perform custom operations before and after extracting the article.
Feel free to suggest new features.
Installation
It is trivial using the NuGet package.
PM> Install-Package SmartReader
Usage
There are mainly two ways to use the library:
The first is by creating a new
Reader
object, with the URI as the argument, and then calling theGetArticle
method to obtain the extractedArticle
The second one is by using one of the static methods
ParseArticle
ofReader
directly, to return anArticle
.
Both ways are available also through an async method, called respectively GetArticleAsync
and ParseArticleAsync
.
The advantage of using an object, instead of the static method, is that it gives you the chance to set some options.
There is also the option to parse directly a String
or Stream
that you have obtained by some other way. This is available either with one of the ParseArticle
methods or by using the proper Reader
constructor. In either case, you also need to give the original URI. It will not re-download the text, but it needs the URI to make some checks and fixing the links present on the page. If you cannot provide the original uri, you can use a fake one, like https:\\localhost
.
If the extraction fails to extract an article, the returned Article
object will have the field IsReadable
set to false
.
If fetching the resource fails, the library will catch the HttpRequestException
, set IsReadable
to false
, Completed
to false
and add an Exception to the list of Errors
.
The content of the article is unstyled, but it is wrapped in a div
with the id readability-content
that you can style yourself.
The library tries to detect the correct encoding of the text, if the correct tags are present in the text.
Getting Images
On the Article
object you can call GetImagesAsync
to obtain a Task for a list of Image
objects, representing the images found in the extracted article. The method is async because it makes HEAD Requests, to obtain the size of the images and only returns the ones that are bigger than the specified size. The size by default is 75KB.
This is done to exclude things such as images used in the UI.
On the Article
object you can also call ConvertImagesToDataUriAsync
to inline the images found in the article using the data URI scheme. The method is async. This will insert the images into the Content
property of the Article
. This may significantly increase the size of Content
.
The data URI scheme is not efficient, because is using Base64 to encode the bytes of the image. Base64 encoded data is approximately 33% larger than the original data. The purpose of this method is to provide an offline article that can be fully stored long term. This is useful in case the original article is not accessible anymore. The method only converts the images that are bigger than the specified size. The size by default is 75KB. This is done to exclude things such as images used in the UI.
Notice that this method will not store other external elements that are not images, such as embedded videos.
If fetching an image fails, the library will throw an HttpRequestException
, you should handle the exception.
Examples
Using the GetArticle
method.
SmartReader.Reader sr = new SmartReader.Reader("https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");
sr.Debug = true;
sr.LoggerDelegate = Console.WriteLine;
SmartReader.Article article = sr.GetArticle();
var images = article.GetImagesAsync();
if(article.IsReadable)
{
// do something with it
}
Using the ParseArticle
static method.
SmartReader.Article article = SmartReader.Reader.ParseArticle("https://arstechnica.com/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");
if(article.IsReadable)
{
Console.WriteLine($"Article title {article.Title}");
}
Settings
The following settings on the Reader
class can be modified.
int
MaxElemsToParse<br>Max number of nodes supported by this parser. <br> Default: 0 (no limit)int
NTopCandidates <br>The number of top candidates to consider when analyzing how tight the competition is among candidates. <br>Default: 5bool
Debug <br>Set the Debug option. If set to true the library writes the data on Logger.<br>Default: falseAction<string>
LoggerDelegate <br>Delegate of a function that accepts as argument a string; it will receive log messages.<br>Default: does not do anythingReportLevel
Logging <br>Level of information written with theLoggerDelegate
. The valid values are the ones for the enumReportLevel
: Issue or Info. The first level logs only errors or issue that could prevent correctly obtaining an article. The second level logs all the information needed for debugging a problematic article.<br>Default: ReportLevel.Issuebool
ContinueIfNotReadable <br> The library tries to determine if it will find an article before actually trying to do it. This option decides whether to continue if the library heuristics fails. This value is ignored if Debug is set to true <br> Default: trueint
CharThreshold <br>The minimum number of characters an article must have in order to return a result. <br>Default: 500bool
KeepClasses <br>Whether to preserve or clean CSS classes.<br>Default: falseString[]
ClassesToPreserve <br>The CSS classes that must be preserved in the article, if we opt to not keep all of them.<br>Default: ["page"]bool
DisableJSONLD <br> The library look first at JSON-LD to determine metadata. This setting gives you the option of disabling it<br> Default: falseDictionary<string, int>
MinContentLengthReadearable <br> The minimum node content length used to decide if the document is readerable (i.e., the library will find something useful)<br> You can provide a dictionary with values based on language.<br> Default: 140int
MinScoreReaderable <br> The minumum cumulated 'score' used to determine if the document is readerable<br> Default: 20Func<IElement, bool>
IsNodeVisible <br> The function used to determine if a node is visible. Used in the process of determining if the document is readerable<br> Default: NodeUtility.IsProbablyVisiblebool
ForceHeaderEncoding <br>Whether to force the encoding provided in the response header.<br>Default: falseint
AncestorsDepth <br>The default level of depth a node must have to be used for scoring.Nodes without as many ancestors as this level are not counted<br>Default: 5int
ParagraphThreshold <br>The default number of characters a node must have in order to be used for scoring<br>Default: 25double
linkDensityModifier <br>A number that is added to the base link density threshold during the shadiness checks. This can be used to penalize nodes with a high link density or vice versa.<br>Default: 0.0
Settings Notes
The settings <code>MinScoreReaderable</code>, <code>CharThreshold</code> and <code>MinContentLengthReadearable</code> are used in the process of determining if an article is readerable or if the result found is valid.
The algorithm for scoring assign some score to each valid node, then it determines the best node depending on their relationships, i.e., what score ancestors and descendants of the node have. The settings <code>NTopCandidates</code>, <code>AncestorsDepth</code> and <code>ParagraphThreshold</code> can help you customize this process. It makes sense to change them if you are interested in some sites that uses a particular style or design of coding.
The settings <code>ParagraphThreshold</code>, <code>MinContentLengthReadearable</code> and <code>CharThreshold</code> should be customized for content written in non-alphabetical languages.
Article Model
A brief overview of the Article model returned by the library.
Uri
Uri<br>Original UriString
Title<br>TitleString
Byline<br>Byline of the article, usually containing author and publication dateString
Dir<br>Direction of the textString
FeaturedImage<br>The main image of the articleString
Content<br>Html content of the articleString
TextContent<br>The plain text of the article with basic formattingString
Excerpt<br>A summary of the article, based on metadata or first paragraphString
Language<br>Language string (es. 'en-US')Dictionary<string, Uri>
AlternativeLanguageUris<br>Contains URIs for pages in alternative languages, where the key is the language code (es. 'en-US': 'https://www.example.com/en')String
Author<br>Author of the articleString
SiteName<br>Name of the site that hosts the articleint
Length<br>Length of the text of the articleTimeSpan
TimeToRead<br>Average time needed to read the articleDateTime?
PublicationDate<br>Date of publication of the articlebool
IsReadable<br>Indicate whether we successfully find an articlebool
Completed<br>Indicate whether we completed the process without getting an Exception (for instance, the HTTP request returned 403 Forbidden)List<Exception>
Errors<br>The list of errors generated during the process
It's important to be aware that the fields Byline, Author and PublicationDate are found independently of each other. So there might be some inconsistencies and unexpected data. For instance, Byline may be a string in the form "@Date by @Author" or "@Author, @Date" or any other combination used by the publication.
The TimeToRead calculation is based on the research found in Standardized Assessment of Reading Performance: The New International Reading Speed Texts IReST. It should be accurate if the article is written in one of the languages in the research, but it is just an educated guess for the others languages.
The FeaturedImage property holds the image indicated by the Open Graph or Twitter meta tags. If neither of these is present, and you called the GetImagesAsync
method, it will be set with the first image found.
The TextContent property is based on the pure text content of the HTML (i.e., the concatenations of text nodes. Then we apply some basic formatting, like removing double spaces or the newlines left by the formatting of the HTML code. We also add meaningful newlines for P and BR nodes.
The IsReadable property will be false if no article was extracted, whatever the reason (i.e., the algorithm did not found anything valuable or the request failed). The property Completed just indicated whether the process completed correctly or not. Previously we left to the user of the library to manage exceptions, but now we try to handle them ourselves.
Exceptions
The library could throw some exceptions, that should be caught and reported in the Errors
property and set Completed
to false.
If you set a value for MaxElemsToParse
larger than 0, the library will throw a standard Exception
if the threshold is passed.
If fetching an HTTP resource fails, the library will throw an HttpRequestException
. This will happen both when trying to fetch the whole article and when trying to fetch an image.
Project Structure
This project has the following directory structure.
Folder | Description |
---|---|
docfx_project/ | Contains the DocFx project that generates the documentation website |
src/ | The main source folder |
src/SmartReader | Source for the SmartReader library |
src/SmartReaderTests | Source for the Tests |
src/SmartReaderConsole | Source for example console project |
src/SmartReader.WebDemo | Source for the demo web project |
Demo
You can see the demo web live. So you can test for yourself how effective the library can be for you.
There is also a Docker project for the web demo. You can build and run it with the usual docker commands.
docker build -t smartreader-webdemo .
docker run -it -p 5000:5000 smartreader-webdemo
The second command will forward traffic from port 5000 on your local host to the port 5000 of the docker container. This means that you will be able to access the web demo by visiting http://localhost:5000.
Documentation
This README contains the info to get started in using the library. If you want to know more advanced options, API reference, etc. read the documentation on the main website.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 is compatible. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- AngleSharp (>= 1.1.2)
- System.Text.Json (>= 8.0.5)
-
.NETStandard 2.1
- AngleSharp (>= 1.1.2)
- System.Text.Json (>= 8.0.5)
NuGet packages (2)
Showing the top 2 NuGet packages that depend on SmartReader:
Package | Downloads |
---|---|
SuperMemoAssistant.Plugins.Import
Package Description |
|
Drastic.Feed.Parser.SmartReader
Drastic.Feed.Parser.SmartReader is an implementation of IArticleParserService for Drastic.Feed, using SmartReader. |
GitHub repositories (1)
Showing the top 1 popular GitHub repositories that depend on SmartReader:
Repository | Stars |
---|---|
Richasy/FantasyCopilot
A new-age AI desktop tool
|
Version | Downloads | Last updated |
---|---|---|
0.9.6 | 4,486 | 10/9/2024 |
0.9.5 | 12,310 | 6/2/2024 |
0.9.4 | 20,306 | 8/27/2023 |
0.9.3 | 9,641 | 4/15/2023 |
0.9.2 | 10,060 | 2/7/2023 |
0.9.1 | 9,127 | 10/23/2022 |
0.9.0 | 37,064 | 8/28/2022 |
0.8.1 | 3,071 | 6/29/2022 |
0.8.0 | 12,163 | 10/19/2021 |
0.7.5 | 12,360 | 10/31/2020 |
0.7.4 | 14,280 | 9/7/2020 |
0.7.3 | 525 | 9/5/2020 |
0.7.2 | 1,371 | 5/10/2020 |
0.7.1 | 1,688 | 3/8/2020 |
0.7.0 | 5,599 | 10/29/2019 |
0.6.3 | 2,380 | 8/18/2019 |
0.6.2 | 1,167 | 5/25/2019 |
0.6.1 | 1,056 | 4/20/2019 |
0.6.0 | 658 | 4/20/2019 |
0.5.2 | 965 | 1/12/2019 |
0.5.1 | 900 | 8/27/2018 |
0.5.0 | 961 | 8/13/2018 |
0.3.1 | 1,889 | 3/3/2018 |
0.3.0 | 1,016 | 2/17/2018 |
0.2.0 | 1,120 | 1/15/2018 |
0.1.3 | 1,100 | 11/27/2017 |
0.1.2 | 1,034 | 10/17/2017 |
0.1.1 | 1,246 | 9/26/2017 |