DocNetExtended 0.5.0.1

Install-Package DocNetExtended -Version 0.5.0.1
dotnet add package DocNetExtended --version 0.5.0.1
<PackageReference Include="DocNetExtended" Version="0.5.0.1" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add DocNetExtended --version 0.5.0.1
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: DocNetExtended, 0.5.0.1"
#r directive can be used in F# Interactive, C# scripting and .NET Interactive. Copy this into the interactive tool or source code of the script to reference the package.
// Install DocNetExtended as a Cake Addin
#addin nuget:?package=DocNetExtended&version=0.5.0.1

// Install DocNetExtended as a Cake Tool
#tool nuget:?package=DocNetExtended&version=0.5.0.1
The NuGet Team does not provide support for this client. Please contact its maintainers for support.

DocNetExtended

DocNetExtended is a small extension library built upon the DocNet library, designed to extract text in a readable order from PDFs.

Features

  • Get text
  • Get lines of text
  • Get words
  • Split lines of text into blocks

Usage

Extracting all text

using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
    using (var pageReader = new OrderedPageTextReader(docReader, 0))
    {
        Console.WriteLine(pageReader.GetTextInReadableOrder());
    }
}

Extracting lines of text

using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
    using (var pageReader = new OrderedPageTextReader(docReader, 0))
    {
        var textLines = pageReader.GetTextLines();

        foreach (var textLine in textLines)
        {
            Console.WriteLine(textLine.Text);
        }

    }
}

Extracting all words

using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
    using (var pageReader = new OrderedPageTextReader(docReader, 0))
    {
        var words = pageReader.GetWords();

        foreach (var word in words)
        {
            Console.WriteLine(word.Value);
        }
    }
}

Extracting blocks of text

When extracting text from a PDF, you may only be interested in a certain section of the page.

The GetTextBlocks method will split lines of text into blocks of text by dividing the page width by the block size, and then checking the position of each word to determine which block it should be in.

Note: Blocks are currently calculated per TextLine.

using (var docReader = DocLib.Instance.GetDocReader(pdfFileName, new PageDimensions(2480, 3508)))
{
   using (var pageReader = new OrderedPageTextReader(docReader, 0))
   {
       var textBlocks = pageReader.GetTextBlocks(300);

       foreach (var textBlock in textBlocks)
       {
           Console.WriteLine(textBlock.Text);
       }
   }
}

Disclaimer

Whilst every attempt is made to extract data in the order it appears in the PDF, this is very much a work in progress and may not support the structure of all PDFs.

Credit

This project wouldn't be possible without the work done by the DocNet team

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
0.5.0.1 65 11/12/2021
0.5.0 113 9/25/2021
0.4.0 133 9/25/2021
0.3.2 152 9/25/2021
0.3.1 151 9/25/2021
0.3.0 94 9/23/2021
0.2.0 108 9/22/2021
0.1.0.1 93 9/22/2021
0.1.0 105 9/22/2021