SmartReader 0.2.0

A .NET Standard library to extract the main content of a web page based on a port of the Readability library by Mozilla

There is a newer version of this package available.
See the version list below for details.
Install-Package SmartReader -Version 0.2.0
dotnet add package SmartReader --version 0.2.0
<PackageReference Include="SmartReader" Version="0.2.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add SmartReader --version 0.2.0
The NuGet Team does not provide support for this client. Please contact its maintainers for support.

<img src="https://raw.github.com/strumenta/SmartReader/master/logo.png" width="64">

SmartReader is a .NET Standard 1.3 library to extract the main content of a web page, based on a port of the Readability library by Mozilla, which in turn is based on the famous original Readability library.

Installation

You can do it the standard way, by using the NuGet package.

Install-Package SmartReader

Why You May Want To Use It

There are already other similar good projects, but they don't support .NET Core and they are based on old version of Readability. The original library is already quite stable, but there are always improvement to be made. So by relying on a original library maintained by such a competent organization we can piggyback on their hard work and user base.

There are also some minor improvements: it returns an author and publication date, together with the default byline, the language of the article and an indication of the time needed to read it. The time is considered accurate for all languages that use an alphabet, so, for instance, it isn't valid for Chinese.

I plan to add some features, like returning a list of the images in the article or, optionally, transforming them in data uri. But at the moment the Smart in SmartReader is more of an aspiration than a statement. Feel free to suggest new features.

Usage

There are mainly two ways to use the library. The first is by creating a new Reader object, with the URI as the argument, and then calling the GetArticle method to obtain the extracted Article. The second one is by using one of the static methods ParseArticle of Reader directly, to return an Article. Both ways are available also through an async method, called respectively GetArticleAsync and ParseArticleAsync.
The advantage of using an object, instead of the static method, is that it gives you the chance to set some options.

There is also the option to parse directly a String or Stream that you have obtained by some other way. This is available either with ParseArticle methods or by using the proper Reader constructor. In either case, you also need to give the original URI. It will not re-download the text, but it need the URI to make some checks and modifications on the links present on the page. If you cannot provide the original uri, you can use a fake one, like http:\\localhost.

If the extraction fails, the returned Article object will have the field IsReadable set to false.

The content of the article is unstyled, but it is wrapped in a div with the id readability-content that you can style yourself.

The library tries to detect the correct encoding of the text, if the correct tags are present in the text.

Examples

Using the GetArticle method.

SmartReader.Reader sr = new SmartReader.Reader("https://arstechnica.co.uk/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");

sr.Debug = true;
sr.Logger = new StringWriter();

SmartReader.Article article = sr.GetArticle();

if(article.IsReadable)
{
	// do something with it
}

Using the ParseArticle static method.


SmartReader.Article article = SmartReader.Reader.ParseArticle("https://arstechnica.co.uk/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");

if(article.IsReadable)
{
	// do something with it
}

Options

  • int MaxElemsToParse<br>Max number of nodes supported by this parser. <br> Default: 0 (no limit)
  • int NTopCandidates <br>The number of top candidates to consider when analysing how tight the competition is among candidates. <br>Default: 5
  • bool Debug <br>Set the Debug option. If set to true the library writes the data on Logger.<br>Default: false
  • TextWriter Logger <br> Where the debug data is going to be written. <br> Default: null
  • bool ContinueIfNotReadable <br> The library tries to determine if it will find an article before actually trying to do it. This option decides whether to continue if the library heuristics fails. This value is ignored if Debug is set to true <br> Default: true
  • int WordThreshold <br>The minimun number of words an article must have in order to return a result. <br>Default: 500

Article Model

  • Uri Uri<br>Original Uri
  • String Title<br>Title
  • String Byline<br>Byline of the article, usually containing author and publication date
  • String Dir<br>Direction of the text
  • String Content<br>Html content of the article
  • String TextContent<br>The pure text of the article
  • String Excerpt<br>A summary of the article, based on metadata or first paragraph
  • String Language<br>Language string (es. 'en-US')
  • int Length<br>Length of the text of the article
  • TimeSpan TimeToRead<br>Average time needed to read the article
  • DateTime? PublicationDate<br>Date of publication of the article
  • bool IsReadable<br>Indicate whether we successfully find an article

It's important to be aware that the fields Byline, Author and PublicationDate are found independently of each other. So there might be some inconsistencies and unexpected data. For instance, Byline may be a string in the form "@Date by @Author" or "@Author, @Date" or any other combination used by the publication.

Demo & Console Projects

The demo project is a simple ASP.NET Core webpage that allows you to input an address and see the results of the library.

The console project is a Console program that allows you to see the results of the library on a random test page.

Creating The Nuget Package

In case you want to build the nuget package yourself you cannot use the standard nuget pack because of a bug related to .NET Core. Insted use the dotnet pack command.

dotnet pack --configuration Release --output "..\nupkgs"

The command must be issued inside the src/SmartReader folder, otherwise it will generate nuget packages for all projects.

License

The project uses the Apache License.

<img src="https://raw.github.com/strumenta/SmartReader/master/logo.png" width="64">

SmartReader is a .NET Standard 1.3 library to extract the main content of a web page, based on a port of the Readability library by Mozilla, which in turn is based on the famous original Readability library.

Installation

You can do it the standard way, by using the NuGet package.

Install-Package SmartReader

Why You May Want To Use It

There are already other similar good projects, but they don't support .NET Core and they are based on old version of Readability. The original library is already quite stable, but there are always improvement to be made. So by relying on a original library maintained by such a competent organization we can piggyback on their hard work and user base.

There are also some minor improvements: it returns an author and publication date, together with the default byline, the language of the article and an indication of the time needed to read it. The time is considered accurate for all languages that use an alphabet, so, for instance, it isn't valid for Chinese.

I plan to add some features, like returning a list of the images in the article or, optionally, transforming them in data uri. But at the moment the Smart in SmartReader is more of an aspiration than a statement. Feel free to suggest new features.

Usage

There are mainly two ways to use the library. The first is by creating a new Reader object, with the URI as the argument, and then calling the GetArticle method to obtain the extracted Article. The second one is by using one of the static methods ParseArticle of Reader directly, to return an Article. Both ways are available also through an async method, called respectively GetArticleAsync and ParseArticleAsync.
The advantage of using an object, instead of the static method, is that it gives you the chance to set some options.

There is also the option to parse directly a String or Stream that you have obtained by some other way. This is available either with ParseArticle methods or by using the proper Reader constructor. In either case, you also need to give the original URI. It will not re-download the text, but it need the URI to make some checks and modifications on the links present on the page. If you cannot provide the original uri, you can use a fake one, like http:\\localhost.

If the extraction fails, the returned Article object will have the field IsReadable set to false.

The content of the article is unstyled, but it is wrapped in a div with the id readability-content that you can style yourself.

The library tries to detect the correct encoding of the text, if the correct tags are present in the text.

Examples

Using the GetArticle method.

SmartReader.Reader sr = new SmartReader.Reader("https://arstechnica.co.uk/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");

sr.Debug = true;
sr.Logger = new StringWriter();

SmartReader.Article article = sr.GetArticle();

if(article.IsReadable)
{
	// do something with it
}

Using the ParseArticle static method.


SmartReader.Article article = SmartReader.Reader.ParseArticle("https://arstechnica.co.uk/information-technology/2017/02/humans-must-become-cyborgs-to-survive-says-elon-musk/");

if(article.IsReadable)
{
	// do something with it
}

Options

  • int MaxElemsToParse<br>Max number of nodes supported by this parser. <br> Default: 0 (no limit)
  • int NTopCandidates <br>The number of top candidates to consider when analysing how tight the competition is among candidates. <br>Default: 5
  • bool Debug <br>Set the Debug option. If set to true the library writes the data on Logger.<br>Default: false
  • TextWriter Logger <br> Where the debug data is going to be written. <br> Default: null
  • bool ContinueIfNotReadable <br> The library tries to determine if it will find an article before actually trying to do it. This option decides whether to continue if the library heuristics fails. This value is ignored if Debug is set to true <br> Default: true
  • int WordThreshold <br>The minimun number of words an article must have in order to return a result. <br>Default: 500

Article Model

  • Uri Uri<br>Original Uri
  • String Title<br>Title
  • String Byline<br>Byline of the article, usually containing author and publication date
  • String Dir<br>Direction of the text
  • String Content<br>Html content of the article
  • String TextContent<br>The pure text of the article
  • String Excerpt<br>A summary of the article, based on metadata or first paragraph
  • String Language<br>Language string (es. 'en-US')
  • int Length<br>Length of the text of the article
  • TimeSpan TimeToRead<br>Average time needed to read the article
  • DateTime? PublicationDate<br>Date of publication of the article
  • bool IsReadable<br>Indicate whether we successfully find an article

It's important to be aware that the fields Byline, Author and PublicationDate are found independently of each other. So there might be some inconsistencies and unexpected data. For instance, Byline may be a string in the form "@Date by @Author" or "@Author, @Date" or any other combination used by the publication.

Demo & Console Projects

The demo project is a simple ASP.NET Core webpage that allows you to input an address and see the results of the library.

The console project is a Console program that allows you to see the results of the library on a random test page.

Creating The Nuget Package

In case you want to build the nuget package yourself you cannot use the standard nuget pack because of a bug related to .NET Core. Insted use the dotnet pack command.

dotnet pack --configuration Release --output "..\nupkgs"

The command must be issued inside the src/SmartReader folder, otherwise it will generate nuget packages for all projects.

License

The project uses the Apache License.

This package is not used by any popular GitHub repositories.

Version History

Version Downloads Last updated
0.7.1 90 3/8/2020
0.7.0 2,239 10/29/2019
0.6.3 1,639 8/18/2019
0.6.2 677 5/25/2019
0.6.1 241 4/20/2019
0.6.0 187 4/20/2019
0.5.2 350 1/12/2019
0.5.1 241 8/27/2018
0.5.0 241 8/13/2018
0.3.1 767 3/3/2018
0.3.0 285 2/17/2018
0.2.0 357 1/15/2018
0.1.3 363 11/27/2017
0.1.2 306 10/17/2017
0.1.1 301 9/26/2017
Show less