WebCrawlerApi 1.0.2

.NET 9.0

dotnet add package WebCrawlerApi --version 1.0.2

NuGet\Install-Package WebCrawlerApi -Version 1.0.2

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="WebCrawlerApi" Version="1.0.2" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="WebCrawlerApi" Version="1.0.2" />
                    

                            Directory.Packages.props

<PackageReference Include="WebCrawlerApi" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add WebCrawlerApi --version 1.0.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: WebCrawlerApi, 1.0.2"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package WebCrawlerApi@1.0.2

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=WebCrawlerApi&version=1.0.2
                    

                            Install as a Cake Addin

#tool nuget:?package=WebCrawlerApi&version=1.0.2
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

A WebCrawler API .NET SDK

A .NET SDK for interacting with the WebCrawlerAPI. WebCrawlerAPI allows you to turn any website into data. Read more at WebCrawlerAPI.

In order to use the API you have to get an API key from WebCrawlerAPI

Read documentation at WebCrawlerAPI Docs for more information.

Requirements

.NET 7.0 or higher

Installation

Install the package via NuGet:

dotnet add package WebCrawlerApi

Usage

using WebCrawlerApi;
using WebCrawlerApi.Models;

// Initialize the client
var crawler = new WebCrawlerApiClient("YOUR_API_KEY");

// Synchronous crawling (blocks until completion)
var job = await crawler.CrawlAndWaitAsync(
    url: "https://example.com",
    scrapeType: "markdown",
    itemsLimit: 10,
    webhookUrl: "https://yourserver.com/webhook",
    allowSubdomains: false,
    maxPolls: 100  // Optional: maximum number of status checks
);

Console.WriteLine($"Job completed with status: {job.Status}");

// Access job items and their content
foreach (var item in job.JobItems)
{
    Console.WriteLine($"Page title: {item.Title}");
    Console.WriteLine($"Original URL: {item.OriginalUrl}");
    Console.WriteLine($"Item status: {item.Status}");
    
    // Get the content based on job's scrape_type
    // Returns null if item is not in "done" status
    var content = await item.GetContentAsync();
    if (content != null)
    {
        Console.WriteLine($"Content length: {content.Length}");
        Console.WriteLine($"Content preview: {content[..Math.Min(200, content.Length)]}...");
    }
    else
    {
        Console.WriteLine("Content not available or item not done");
    }
}

// Or use asynchronous crawling
var response = await crawler.CrawlAsync(
    url: "https://example.com",
    scrapeType: "markdown",
    itemsLimit: 10,
    webhookUrl: "https://yourserver.com/webhook",
    allowSubdomains: false
);

// Get the job ID from the response
var jobId = response.Id;
Console.WriteLine($"Crawling job started with ID: {jobId}");

// Check job status and get results
var job = await crawler.GetJobAsync(jobId);
Console.WriteLine($"Job status: {job.Status}");

// Access job details
Console.WriteLine($"Crawled URL: {job.Url}");
Console.WriteLine($"Created at: {job.CreatedAt:yyyy-MM-dd HH:mm:ss}");
Console.WriteLine($"Number of items: {job.JobItems.Count}");

// Cancel a running job if needed
var message = await crawler.CancelJobAsync(jobId);
Console.WriteLine($"Cancellation response: {message}");

API Methods

CrawlAndWaitAsync()

Starts a new crawling job and waits for its completion. This method will continuously poll the job status until:

The job reaches a terminal state (done, error, or cancelled)
The maximum number of polls is reached (default: 100)
The polling interval is determined by the server's RecommendedPullDelayMs or defaults to 5 seconds

CrawlAsync()

Starts a new crawling job and returns immediately with a job ID. Use this when you want to handle polling and status checks yourself, or when using webhooks.

GetJobAsync()

Retrieves the current status and details of a specific job.

CancelJobAsync()

Cancels a running job. Any items that are not in progress or already completed will be marked as canceled and will not be charged.

Parameters

Crawl Methods (CrawlAndWaitAsync and CrawlAsync)

url (required): The seed URL where the crawler starts. Can be any valid URL.
scrapeType (default: "html"): The type of scraping you want to perform. Can be "html", "cleaned", or "markdown".
itemsLimit (default: 10): Crawler will stop when it reaches this limit of pages for this job.
webhookUrl (optional): The URL where the server will send a POST request once the task is completed.
allowSubdomains (default: false): If true, the crawler will also crawl subdomains.
whitelistRegexp (optional): A regular expression to whitelist URLs. Only URLs that match the pattern will be crawled.
blacklistRegexp (optional): A regular expression to blacklist URLs. URLs that match the pattern will be skipped.
maxPolls (optional, CrawlAndWaitAsync only): Maximum number of status checks before returning (default: 100)

JobItem Properties

Each JobItem object represents a crawled page and contains:

Id: The unique identifier of the item
JobId: The parent job identifier
OriginalUrl: The URL of the page
PageStatusCode: The HTTP status code of the page request
Status: The status of the item (new, in_progress, done, error)
Title: The page title
CreatedAt: The date when the item was created
Cost: The cost of the item in $
ReferredUrl: The URL where the page was referred from
LastError: Any error message if the item failed
ErrorCode: The error code associated with the job if it failed
GetContentAsync(): Method to get the page content based on the job's ScrapeType (html, cleaned, or markdown). Returns null if the item's status is not "done" or if content is not available. Content is automatically fetched and cached when accessed.
RawContentUrl: URL to the raw content (if available)
CleanedContentUrl: URL to the cleaned content (if ScrapeType is "cleaned")
MarkdownContentUrl: URL to the markdown content (if ScrapeType is "markdown")

License

MIT License

Product	Compatible and additional computed target framework versions.
.NET	net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net9.0
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.2	136	4/11/2025
1.0.0	136	1/1/2025