Robots.Txt.Parser
1.0.0
dotnet add package Robots.Txt.Parser --version 1.0.0
NuGet\Install-Package Robots.Txt.Parser -Version 1.0.0
<PackageReference Include="Robots.Txt.Parser" Version="1.0.0" />
<PackageVersion Include="Robots.Txt.Parser" Version="1.0.0" />
<PackageReference Include="Robots.Txt.Parser" />
paket add Robots.Txt.Parser --version 1.0.0
#r "nuget: Robots.Txt.Parser, 1.0.0"
#:package Robots.Txt.Parser@1.0.0
#addin nuget:?package=Robots.Txt.Parser&version=1.0.0
#tool nuget:?package=Robots.Txt.Parser&version=1.0.0
Table of Contents
Overview
Parse robots.txt and sitemaps using dotnet. Supports the proposed RFC9309 standard, as well as the following common, non-standard directives:
- Sitemap
- Host
- Crawl-delay
Design Considerations
This library is based upon HttpClient, making it very familiar, easy to use and adaptable to your needs. Since you have full control over the HttpClient, you are able to configure custom message handlers to intercept outgoing requests and responses. For example, you may want to add custom headers on a request, configure additional logging or set up a retry policy.
Some websites can have very large sitemaps. For this reason, async streaming is supported as the preferred way of parsing sitemaps.
There is also the possibility to extend this library to support protocols other than HTTP, such as FTP.
Features
| Name | Supported | Priority |
|---|---|---|
| HTTP/HTTPS | ✔️ | |
| FTPS/FTPS | ❌ | 0.1 |
Wildcard (*) User-agent |
✔️ | |
| Allow & disallow rules | ✔️ | |
End-of-match ($) and wildcard (*) paths |
✔️ | |
| Sitemap entries | ✔️ | |
| Host directive | ✔️ | |
| Crawl-delay directive | ✔️ | |
| RSS 2.0 feeds | ❌ | 0.8 |
| Atom 0.3/1.0 feeds | ❌ | 0.8 |
| Sitemaps XML format | ✔️ | |
| Simple text sitemaps | ✔️ | |
| Async streaming of sitemaps | ✔️ | |
| Cancellation token support | ✔️ | |
| Memory management | ✔️ |
Usage
Install the package via NuGet.
dotnet add package Robots.Txt.Parser
Minimal Example
First, create an instance of RobotWebClient.
With Dependency Injection
public void ConfigureServices(IServiceCollection services)
{
services.AddHttpClient<IRobotWebClient, RobotWebClient>();
}
Without Dependency Injection
using var httpClient = new HttpClient();
var robotWebClient = new RobotWebClient(httpClient);
Web Crawler Example
Optionally, specify message handlers to modify the HTTP pipeline. For example, you may want to throttle the rate of your requests, to responsibily crawl a large sitemap. You can achieve this by adding a custom HttpMessageHandler to the pipeline.
public class ResponsibleCrawlerHttpClientHandler : DelegatingHandler
{
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
var response = await base.SendAsync(request, cancellationToken);
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
return response;
}
}
With Dependency Injection
public void ConfigureServices(IServiceCollection services)
{
services.TryAddTransient<ResponsibleCrawlerHttpClientHandler>();
services.AddHttpClient<IRobotWebClient, RobotWebClient>()
.AddPrimaryHttpMessageHandler<ResponsibleCrawlerHttpClientHandler>();
}
Without Dependency Injection
var httpClientHandler = new ResponsibleCrawlerHttpClientHandler()
{
InnerHandler = new HttpClientHandler
{
AutomaticDecompression = DecompressionMethods.All
}
};
using var httpClient = new HttpClient(httpClientHandler);
var robotWebClient = new RobotWebClient(httpClient);
Retrieving the Sitemap
var robotsTxt = await robotWebClient.LoadRobotsTxtAsync(new Uri("https://github.com"));
// providing a datetime only retrieves sitemap items modified since this datetime
var modifiedSince = new DateTime(2023, 01, 01);
// sitemaps are iterated asynchronously
// even if robots.txt does not contain sitemap directive, looks for a sitemap at {url}/sitemap.xml
await foreach(var item in robotsTxt.LoadSitemapAsync(modifiedSince))
{
}
Checking a Rule
var robotsTxt = await robotWebClient.LoadRobotsTxtAsync(new Uri("https://github.com"));
// if rules for the specific robot are not present, it falls back to the wildcard *
var hasAnyRulesDefined = robotsTxt.TryGetRules(ProductToken.Parse("SomeBot"), out var rules);
// even if no wildcard rules exist, an empty rule-checker is returned
var isAllowed = rules.IsAllowed("/some/path");
Getting Preferred Host
var robotsTxt = await robotWebClient.LoadRobotsTxtAsync(new Uri("https://github.com"));
// host value will fall back to provided host, if no directive exists
var hasHostDirective = robotsTxt.TryGetHost(out var host);
Getting Crawl Delay
var robotsTxt = await robotWebClient.LoadRobotsTxtAsync(new Uri("https://github.com"));
// if rules for the specific robot are not present, it falls back to the wildcard *
// if no Crawl-delay directive exists, crawl delay will be 0
var hasCrawlDelayDirective = robotsTxt.TryGetCrawlDelay(ProductToken.Parse("SomeBot"), out var crawlDelay);
Contributing
Issues and pull requests are encouraged. For large or breaking changes, it is suggested to open an issue first, to discuss before proceeding.
If you find this project useful, please give it a star.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.0.0 | 91 | 1/28/2026 |
| 1.0.0-rc9 | 94 | 1/10/2026 |
| 1.0.0-rc8 | 254 | 9/2/2023 |
| 1.0.0-rc7 | 205 | 8/28/2023 |
| 1.0.0-rc6 | 188 | 8/28/2023 |
| 1.0.0-rc5 | 189 | 8/27/2023 |
| 1.0.0-rc4 | 180 | 8/27/2023 |
| 1.0.0-rc3 | 187 | 8/27/2023 |
| 1.0.0-rc2 | 181 | 8/26/2023 |
| 1.0.0-rc12 | 86 | 1/27/2026 |
| 1.0.0-rc11 | 83 | 1/18/2026 |
| 1.0.0-rc10 | 87 | 1/15/2026 |
| 1.0.0-rc1 | 197 | 8/26/2023 |