Markitdown.CSharp 1.0.2

dotnet add package Markitdown.CSharp --version 1.0.2
                    
NuGet\Install-Package Markitdown.CSharp -Version 1.0.2
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Markitdown.CSharp" Version="1.0.2" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Markitdown.CSharp" Version="1.0.2" />
                    
Directory.Packages.props
<PackageReference Include="Markitdown.CSharp" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Markitdown.CSharp --version 1.0.2
                    
#r "nuget: Markitdown.CSharp, 1.0.2"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Markitdown.CSharp@1.0.2
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Markitdown.CSharp&version=1.0.2
                    
Install as a Cake Addin
#tool nuget:?package=Markitdown.CSharp&version=1.0.2
                    
Install as a Cake Tool

Markitdown.CSharp

Convert any document to Markdown in .NET — Word, Excel, PDF, PowerPoint, HTML, RTF, images (with optional OCR) and more. A community MIT-licensed implementation for C#, multi-targeting .NET 8, 9 and 10, with extensible converters and dependencies under MIT or Apache 2.0 (see DEPENDENCIES-LICENSE.md).

This library offers a single API for local files, streams and URLs; built-in support for Office, PDF, HTML, RTF and images (including Tesseract OCR); and a pluggable converter model so you can add custom formats without forking. No vendor lock-in, no external services required for core conversions.


Why this package

  • MIT license — use in closed-source and commercial projects without legal friction.
  • Multi-target — one package for .NET 8, 9 and 10; the right build is selected automatically.
  • Wide format coverage — DOCX, XLSX, XLS, PPTX, PDF, HTML, RTF, images (with OCR), CSV, EPUB, MSG, Jupyter, plain text and more, with a single ConvertLocalAsync / ConvertStreamAsync / ConvertUriAsync API.
  • Extensible — implement IDocumentConverter, register your converter, and the engine routes by extension or MIME type.
  • Tested in real use — conversions for the formats listed below are exercised with sample files and continuous validation.

File format status

Tested formats

These extensions have been verified with sample files and are part of the regular conversion pipeline:

Format Extensions Notes
Word .docx ECMA and ISO OOXML namespaces; tables; Open XML SDK + fallback.
Excel .xlsx, .xls Sheets to markdown tables.
PowerPoint .pptx Slide text and structure.
PDF .pdf Text extraction with layout (lines, spacing).
HTML .html, .htm HTML to markdown (ReverseMarkdown).
RTF .rtf RTF → HTML → markdown (RtfPipe).
Images .png, .jpg, .jpeg Metadata; optional Tesseract OCR for text.
Plain text .txt, .md, .json, .jsonl Pass-through or charset detection.

Untested but possibly compatible

These formats are supported by built-in converters but are not part of the standard test set. Behavior may vary with real-world files:

Format Extensions Notes
E-books .epub EPUB package extraction.
Email .msg Outlook MSG (MsgReader).
Notebooks .ipynb Jupyter notebook cells.
Data .csv CSV to markdown table.
Archives .zip Extracts and converts contents.
Audio (via converter) Optional transcription service.
Web/special URLs, RSS, Wikipedia, YouTube, Bing SERp Dedicated converters when URL is passed.

Basic usage

Install

dotnet add package Markitdown.CSharp

Convert a local file

using MarkItDown.CSharp;

using var markItDown = new MarkItDown();
var result = await markItDown.ConvertLocalAsync("document.docx");
Console.WriteLine(result.Markdown);

Convert and save to disk

using var markItDown = new MarkItDown();
var result = await markItDown.ConvertLocalAsync("report.xlsx");
await File.WriteAllTextAsync("report.md", result.Markdown);

Convert from URL

using var markItDown = new MarkItDown();
var result = await markItDown.ConvertUriAsync("https://example.com/page.html");
Console.WriteLine(result.Title);
Console.WriteLine(result.Markdown);

Convert from stream (e.g. upload)

await using var fileStream = File.OpenRead("document.pdf");
using var markItDown = new MarkItDown();
var result = await markItDown.ConvertStreamAsync(fileStream,
    new StreamInfo { FileName = "document.pdf", Extension = ".pdf" });
// Use result.Markdown

Usage with Tesseract (OCR)

To extract text from images, Tesseract language data (e.g. eng.traineddata) must be available in a folder named tessdata. Use either the Tesseract installer layout or download tessdata / tessdata_fast and place the tessdata folder somewhere. The path you configure is the parent of that tessdata folder.

In code

Pass the parent of tessdata via MarkItDownOptions:

var options = new MarkItDownOptions
{
    TesseractTessDataPath = @"C:\Program Files\Tesseract-OCR",  // or e.g. t:\tools\Tesseract-OCR
    TesseractLang = "eng",  // optional; default is "eng"
};
using var markItDown = new MarkItDown(options);
var result = await markItDown.ConvertLocalAsync("screenshot.png");
// result.Markdown includes "### Text in image (OCR)" when OCR runs

For stream/URL conversions, the same options are applied via the engine; you can also pass tesseract_tessdata_path and tesseract_lang in ConversionOptions when calling the lower-level APIs.

With MkTool (command line)

Set the environment variable TESSDATA_PREFIX to the parent folder of tessdata:

  • Windows (CMD): set TESSDATA_PREFIX=C:\Program Files\Tesseract-OCR
  • PowerShell: $env:TESSDATA_PREFIX = "C:\Program Files\Tesseract-OCR"

Then run: dotnet run --project MkTool -- image.png

Optional: TESSERACT_LANG (e.g. eng, por) to override the default language.

If you convert an image and no text is extracted, MkTool prints a hint to set TESSDATA_PREFIX. The path must be the parent of the tessdata folder (e.g. C:\Program Files\Tesseract-OCR, not C:\Program Files\Tesseract-OCR\tessdata).


Extensibility

The component supports adding new conversion modules: you can implement a converter for any file type and plug it into the engine. The engine selects the converter by file extension (or MIME type) and calls it when you use ConvertLocalAsync, ConvertStreamAsync, or ConvertUriAsync. No changes to the library are required.

Creating a new converter (overview)

flowchart LR
  subgraph Step1[1. Implement]
    A[IDocumentConverter]
    A1[Accepts]
    A2[ConvertAsync]
    A --> A1
    A --> A2
  end
  subgraph Step2[2. Register]
    B[RegisterConverter or Builder.AddConverter]
  end
  subgraph Step3[3. Use]
    C[ConvertLocalAsync / Stream / Uri]
    D[Engine picks converter by extension]
    E[Your converter runs]
    C --> D
    D --> E
  end
  Step1 --> Step2
  Step2 --> Step3

Example: add a custom converter

Implement the interface (or inherit DocumentConverter), declare the extension, and register:

using MarkItDown.CSharp;
using MarkItDown.CSharp.Interfaces;

// 1. Implement the converter (e.g. for .log files)
public class LogConverter : DocumentConverter
{
    private static readonly HashSet<string> Extensions = new(StringComparer.OrdinalIgnoreCase) { ".log" };

    public override IReadOnlyList<string>? SupportedExtensions => new[] { ".log" };

    public override bool Accepts(Stream fileStream, StreamInfo streamInfo, ConversionOptions options) =>
        streamInfo.Extension != null && Extensions.Contains(streamInfo.Extension);

    public override async Task<DocumentConverterResult> ConvertAsync(
        Stream fileStream, StreamInfo streamInfo, ConversionOptions options,
        CancellationToken cancellationToken = default)
    {
        using var reader = new StreamReader(fileStream, leaveOpen: true);
        var text = await reader.ReadToEndAsync(cancellationToken);
        var markdown = "```\n" + text + "\n```";
        return new DocumentConverterResult(markdown);
    }
}

// 2. Register and use
using var markItDown = new MarkItDown();
markItDown.RegisterConverter(new LogConverter());
var result = await markItDown.ConvertLocalAsync("app.log");  // Uses LogConverter

Alternatively, use the builder: new MarkItDownBuilder().AddConverter(new LogConverter()).Build();

For step-by-step instructions, a full example, and more diagrams, see docs/custom-converters.md.


Requirements

  • .NET 8, .NET 9, or .NET 10

License

MIT. Use freely in personal or commercial projects.


Inspired by Microsoft MarkItDown.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.2 680 2/22/2026

1.0.2: RTF converter (RtfPipe), DOCX namespace/table fix, Tesseract OCR path fix, MkTool default tessdata path. Multi-target net8.0/net9.0/net10.0.