WebFlux 0.1.9
dotnet add package WebFlux --version 0.1.9
NuGet\Install-Package WebFlux -Version 0.1.9
<PackageReference Include="WebFlux" Version="0.1.9" />
<PackageVersion Include="WebFlux" Version="0.1.9" />
<PackageReference Include="WebFlux" />
paket add WebFlux --version 0.1.9
#r "nuget: WebFlux, 0.1.9"
#:package WebFlux@0.1.9
#addin nuget:?package=WebFlux&version=0.1.9
#tool nuget:?package=WebFlux&version=0.1.9
WebFlux
A .NET SDK for preprocessing web content for RAG (Retrieval-Augmented Generation) systems.
Overview
WebFlux processes web content into chunks optimized for RAG systems. It handles web crawling, content extraction, and intelligent chunking with support for multiple content formats.
Installation
dotnet add package WebFlux
Quick Start
using WebFlux;
using Microsoft.Extensions.DependencyInjection;
var services = new ServiceCollection();
// Register your AI service implementations
services.AddScoped<ITextEmbeddingService, YourEmbeddingService>();
services.AddScoped<ITextCompletionService, YourLLMService>(); // Optional
// Add WebFlux
services.AddWebFlux();
var provider = services.BuildServiceProvider();
var processor = provider.GetRequiredService<IWebContentProcessor>();
// Process a website
await foreach (var result in processor.ProcessWithProgressAsync("https://example.com"))
{
if (result.IsSuccess && result.Result != null)
{
foreach (var chunk in result.Result)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content}");
}
}
}
Features
- Interface-Based Design: Bring your own AI services (OpenAI, Anthropic, Azure, local models)
- Multiple Chunking Strategies: Auto, Smart, Semantic, Intelligent, MemoryOptimized, Paragraph, FixedSize, DomStructure
- Content Formats: HTML, Markdown, JSON, XML, PDF
- Web Standards: robots.txt, sitemap.xml, ai.txt, llms.txt, manifest.json
- Streaming: Process large websites with AsyncEnumerable
- Parallel Processing: Concurrent crawling and processing
- Rich Metadata: Web document metadata extraction (SEO, Open Graph, Schema.org, Twitter Cards)
- Progress Tracking: Real-time batch crawling progress with detailed statistics
Chunking Strategies
| Strategy | Use Case |
|---|---|
| Auto | Automatically selects best strategy based on content |
| Smart | Structured HTML documentation |
| Semantic | General web pages and articles |
| Intelligent | Blogs and knowledge bases |
| MemoryOptimized | Large documents with memory constraints |
| Paragraph | Markdown with natural boundaries |
| FixedSize | Uniform chunks for testing |
| DomStructure | HTML DOM structure-based chunking preserving semantic boundaries |
Core Interfaces
WebFlux uses the Interface Provider pattern. You provide AI service implementations, and WebFlux handles crawling, extraction, and chunking.
Required AI Services
ITextEmbeddingService (Required)
Vector embedding generation for semantic chunking:
public interface ITextEmbeddingService
{
Task<float[]> GetEmbeddingAsync(string text, CancellationToken cancellationToken = default);
Task<IReadOnlyList<float[]>> GetEmbeddingsAsync(IReadOnlyList<string> texts, CancellationToken cancellationToken = default);
int MaxTokens { get; }
int EmbeddingDimension { get; }
}
Optional AI Services
ITextCompletionService (Optional)
LLM text completion for multimodal processing and content reconstruction:
public interface ITextCompletionService
{
Task<string> CompleteAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
IAsyncEnumerable<string> CompleteStreamAsync(string prompt, TextCompletionOptions? options = null, CancellationToken cancellationToken = default);
Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}
IImageToTextService (Optional)
Image-to-text conversion for multimodal content:
public interface IImageToTextService
{
Task<string> ConvertImageToTextAsync(string imageUrl, ImageToTextOptions? options = null, CancellationToken cancellationToken = default);
Task<string> ExtractTextFromImageAsync(string imageUrl, CancellationToken cancellationToken = default);
Task<bool> IsAvailableAsync(CancellationToken cancellationToken = default);
}
Main Processor
IWebContentProcessor
The main entry point for all web content processing:
// Single URL processing
var chunks = await processor.ProcessUrlAsync("https://example.com");
// Website crawling (streaming)
await foreach (var chunk in processor.ProcessWebsiteAsync(url, crawlOptions, chunkOptions))
{
// Process chunk
}
// Batch processing
var results = await processor.ProcessUrlsBatchAsync(urls, chunkOptions);
Extensibility
IChunkingStrategy
Implement custom chunking strategies:
public interface IChunkingStrategy
{
string Name { get; }
string Description { get; }
Task<IReadOnlyList<WebContentChunk>> ChunkAsync(ExtractedContent content, ChunkingOptions? options = null, CancellationToken cancellationToken = default);
}
IProgressReporter & IEventPublisher
Monitor processing progress and subscribe to system events:
// Progress monitoring
await foreach (var progress in progressReporter.MonitorProgressAsync(jobId))
{
Console.WriteLine($"Progress: {progress.Progress:P0}");
}
// Event subscription
eventPublisher.Subscribe<PageProcessedEvent>(async evt => await LogEvent(evt));
For detailed implementation examples, see the Tutorial.
Configuration
var options = new CrawlOptions
{
MaxDepth = 3,
MaxPages = 100,
RespectRobotsTxt = true,
UserAgent = "MyBot/1.0"
};
var chunkOptions = new ChunkingOptions
{
Strategy = "Auto",
MaxChunkSize = 512,
OverlapSize = 64
};
await foreach (var result in processor.ProcessWithProgressAsync(url, options, chunkOptions))
{
// Handle results
}
Documentation
- Tutorial - Step-by-step guide with practical examples
- Architecture - System design and pipeline
- Interfaces - API contracts and implementation guide
- Chunking Strategies - Detailed strategy guide
- Changelog - Version history and release notes
License
MIT License - see LICENSE file for details.
Support
- Issues: GitHub Issues
- Package: NuGet
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- AngleSharp (>= 1.4.0)
- HtmlAgilityPack (>= 1.12.4)
- Markdig (>= 0.44.0)
- Microsoft.Extensions.Caching.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Caching.Memory (>= 10.0.2)
- Microsoft.Extensions.Configuration (>= 10.0.2)
- Microsoft.Extensions.Configuration.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Configuration.Binder (>= 10.0.2)
- Microsoft.Extensions.DependencyInjection (>= 10.0.2)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Http (>= 10.0.2)
- Microsoft.Extensions.Logging (>= 10.0.2)
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.2)
- Microsoft.Playwright (>= 1.57.0)
- Polly (>= 8.6.5)
- Polly.Extensions.Http (>= 3.0.0)
- YamlDotNet (>= 16.3.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on WebFlux:
| Package | Downloads |
|---|---|
|
FluxIndex.SDK
FluxIndex SDK - Complete RAG infrastructure with FileFlux integration, FluxCurator preprocessing, and FluxImprover quality enhancement. AI providers are externally injectable. |
GitHub repositories
This package is not used by any popular GitHub repositories.