ByteMux 1.0.0
dotnet add package ByteMux --version 1.0.0
NuGet\Install-Package ByteMux -Version 1.0.0
<PackageReference Include="ByteMux" Version="1.0.0" />
<PackageVersion Include="ByteMux" Version="1.0.0" />
<PackageReference Include="ByteMux" />
paket add ByteMux --version 1.0.0
#r "nuget: ByteMux, 1.0.0"
#:package ByteMux@1.0.0
#addin nuget:?package=ByteMux&version=1.0.0
#tool nuget:?package=ByteMux&version=1.0.0
ByteMux
A low-latency outbound multiplexer for high-throughput .NET servers that need to merge targeted and broadcast traffic into per-connection TCP streams.
What It Is
ByteMux is a purpose-built outbound convergence layer for workloads where many independent producers need to inject data into the same connection.
It combines:
- targeted per-connection delivery
- shared broadcast delivery
- lock-free hot-path signaling
- shard-owned draining
- socket-oriented batching
- near-zero steady-state allocations
ByteMux is not a generic queue library or messaging framework. It is a complete solution for one specific problem: per-connection fan-in with low-latency egress.
When to Use ByteMux
ByteMux is a good fit when:
- multiple threads or subsystems need to inject packets into the same TCP connection
- you need both targeted traffic and shared broadcast traffic
- latency and GC pressure matter more than general-purpose flexibility
- you want explicit control over scheduling, batching, and drop behavior
- you prefer shard-owned serialized egress over ad hoc per-source flushing
ByteMux is probably not the right fit when:
- standard async socket writes are already sufficient
- you only need a generic queue, channel, or ring buffer
- your workload does not have true multi-producer fan-in
- simplicity matters more than extreme control over the hot path
- you need a full networking framework or actor runtime
Why It Exists
ByteMux was built for a high-throughput game proxy that sits between players and a game server. The proxy forwards packets in both directions, but it also needs to inject new packets into a player's stream from outside the normal packet flow: targeted messages for individual connections and broadcasts for large groups of subscribers.
That requirement becomes difficult under real constraints. The proxy handles thousands of concurrent connections. Each connection has its own processing pipeline. Injected packets must merge into that pipeline without stalling it, with tight latency targets, minimal GC pressure, and no locks on the hot path.
We evaluated existing approaches before building ByteMux. LMAX Disruptor for .NET is excellent, but it is a concurrency primitive rather than a complete outbound convergence model. Its shape did not match our workload: many independent producers converging on one connection, not one producer writing to many consumers.
We also explored channel-based and pipeline-based designs, but for this latency budget they still left too much coordination, scheduling, and policy overhead on the hot path. We did not need a better queue in isolation. We needed a purpose-built solution for per-connection fan-in and broadcast delivery.
That is what ByteMux provides.
Key Ideas
One
TxMuxper connection
Each live connection owns its own outbound multiplexer.Two delivery paths
ByteQueuehandles targeted traffic.ByteTopichandles broadcast traffic.Dirty-bit signaling
Producers mark sources as ready using per-connection dirty bitmasks.Shard-owned draining
Dedicated shard workers drain signaled muxes and serialize socket egress.Coalesced output
Drainers batch frames before writing to the sink.
Status
ByteMux is currently source-only and intended for advanced .NET networking scenarios where the workload and latency targets justify a specialized outbound pipeline.
Non-Goals
ByteMux is not:
- a drop-in replacement for
System.Threading.Channels - a generic ring buffer package
- a general-purpose async messaging framework
- an actor runtime
- a full socket framework
Architecture
The execution model is simple:
- producers enqueue or append data
- they signal readiness through dirty bitmasks
- shard workers drain signaled muxes
- drainers coalesce frames
- the sink writes the batch
External Thread A External Thread B Topic Writer
| | |
QueuePublisher QueuePublisher TopicPublisher
.NotifyDirty() .NotifyDirty() .Append()
| | |
+----------+---------------+ signals all subscribers
| |
v v
+---------------------------------------------------------+
| TxMux (one per connection) |
| queueDirtyMask : uint32 (1 bit per registered queue) |
| topicDirtyMask : uint32 (1 bit per subscribed topic) |
+---------------------+-----------------------------------+
| ShardWork (gate + latch)
v
+------------------+
| TxShard | (dedicated worker thread)
| worker loop |
+--------+---------+
|
TxMuxDrainer.DrainOnce()
(coalesce frames into batch)
|
ITxSink.TrySend()
(SocketSaeaTxSink -> socket)
Components
| Component | Responsibility |
|---|---|
TxFabric |
Owns the shard pool and assigns connections to shards |
TxMux |
Per-connection multiplexer with queue/topic dirty masks |
TxShard |
Dedicated worker thread that drains scheduled muxes |
ShardWork |
Gate and latch used to avoid duplicate scheduling and capture signal-during-drain races |
ByteQueue |
SPSC ring buffer for targeted packet delivery |
ByteTopic |
Single-writer, multi-reader append log for broadcast delivery |
QueuePublisher |
Enqueue and signal handle for a ByteQueue |
TopicPublisher |
Append and fan-out handle for a ByteTopic |
TopicSubscription |
Subscription handle for a topic; disposing it unsubscribes and releases the slot |
TxMuxDrainer |
Drain policy abstraction for batching and fairness |
ITxSink |
Pluggable output backend such as socket, stream, or null sink |
IEgressTransformer |
Optional in-place transform hook for encryption, framing, or compression |
Hard Limits
Each TxMux can hold at most:
32 registered queues
32 topic subscriptions
These limits are fixed by the internal SlotTable32 used for queue and topic slot management. They cannot be expanded at runtime.
Both TryAddQueue and TrySubscribe return false when their respective table is full. They do not throw and do not emit an error automatically.
That means their return values must always be checked. Ignoring them silently drops the source registration.
If a single connection needs more than 32 independent targeted sources, or more than 32 topic subscriptions, the design should be revisited. Common approaches include multiplexing multiple upstream producers into fewer ByteQueue instances or sharing a queue across multiple producers before data reaches ByteMux.
Getting Started
ByteMux is currently source-only. Add the src/ByteMux project to your solution.
A typical setup looks like this:
start a
TxFabricattach one
TxMuxper live connectionregister one or more
ByteQueueinstances for targeted trafficregister shared
ByteTopicinstances for broadcast trafficlet shard workers drain signaled muxes into the configured sink
Targeted Queue Example
Use a queue when one connection needs targeted packets from one or more external producers.
// Create the fabric (owns shard threads)
var fabric = new TxFabric(new TxFabricOptions { ShardCount = 4 });
fabric.Start();
// Create a drainer and attach it to a connection
var sink = new SocketSaeaTxSink(socket);
var drainer = new FairCoalescedTxMuxDrainer(
sink,
new FairCoalescedDrainerOptions
{
BatchSize = 64 * 1024
});
TxMux mux = fabric.AttachDrainer(drainer);
// Register a queue on this connection
var queue = new ByteQueue(new ByteQueueOptions
{
ByteCapacity = 64 * 1024,
DescCapacity = 1024
});
if (!mux.TryAddQueue(queue, out QueuePublisher? publisher))
throw new InvalidOperationException("TxMux queue table is full.");
// From any thread: enqueue and signal
ReadOnlySpan<byte> packet = ...;
if (queue.TryEnqueue(packet))
publisher!.NotifyDirty();
Broadcast Topic Example
Use a topic when one payload should be written once and delivered to many subscribed connections.
// Create a shared topic (one writer, many readers)
var topic = new ByteTopic(new ByteTopicOptions
{
ByteCapacity = 256 * 1024,
DescCapacity = 4096
});
var topicPublisher = new TopicPublisher(topic);
// Subscribe each connection's mux to the topic
if (!topicPublisher.TrySubscribe(
mux,
out TopicPublisher.TopicSubscription subscription))
{
throw new InvalidOperationException("TxMux topic table is full.");
}
// From the writer thread: append once, all subscribers are signaled
ReadOnlySpan<byte> broadcastPacket = ...;
topicPublisher.Append(broadcastPacket);
Shard Selection
// Round-robin (default)
TxMux mux = fabric.AttachDrainer(drainer);
// Pin by remote endpoint (stable assignment per client IP:port)
TxMux mux = fabric.AttachDrainer(
drainer,
ShardHint.FromEndPoint(remoteEndPoint));
// Explicit shard index
TxMux mux = fabric.AttachDrainer(
drainer,
ShardHint.FromIndex(2));
Lifecycle and Disposal
ByteMux has a strict ownership hierarchy and disposal order.
Ownership model:
TxFabricowns the shard worker threadsTxFabricdoes not ownTxMuxinstancesthe caller owns each
TxMuxTxMuxmust be disposed beforeTxFabric
Disposing TxFabric first stops the shard threads while a TxMux may still hold references to shard-owned scheduling state. Always dispose muxes before disposing the fabric.
Recommended disposal order:
QueuePublisher.Dispose()TopicSubscription.Dispose()TopicPublisher.Dispose()TxMux.Dispose()TxFabric.Dispose()
Failing to dispose a QueuePublisher or TopicSubscription leaks a slot in the corresponding table, which continues to count against the per-TxMux 32-slot limit.
TopicSubscription is itself IDisposable. Unsubscribing is done by disposing the subscription, not by calling a removal method on TopicPublisher.
Teardown Example
// Teardown in reverse ownership order
publisher?.Dispose();
subscription.Dispose();
topicPublisher.Dispose();
mux.Dispose();
fabric.Dispose();
Core Concepts
Queue Path
ByteQueue is a lock-free SPSC ring buffer. One producer enqueues frames while the shard worker dequeues and drains them. The ring is backed by pooled memory, so the hot path performs no allocation. When the ring is full, TryEnqueue returns false and the caller decides how to handle backpressure.
Zero-Copy Queue Writes
TryEnqueue(ReadOnlySpan<byte>) is the simplest way to publish a frame, but it copies the caller's data into the ring.
For callers that want to construct frames directly inside the ring, or already hold data in a form such as ReadOnlySequence<byte>, ByteQueue also supports an acquire/commit path that avoids that copy.
// 1. Reserve space in the ring
if (!queue.TryAcquire(payloadLength, out FrameLease lease))
{
// ring is full
return;
}
// 2. Write directly into the lease
lease.CopyFrom(sourceSpan);
// 3. Commit the frame
queue.Commit(in lease);
// 4. Signal as normal
publisher!.NotifyDirty();
FrameLease is a two-segment write window over the circular ring:
FirstSecond
When the frame fits without wrapping the ring boundary, Second is empty and the lease is effectively single-segment. When the frame crosses the ring boundary, the payload is split across both spans.
FrameLease.CopyFrom(ReadOnlySpan<byte>) and FrameLease.CopyFrom(in ReadOnlySequence<byte>) handle that split automatically.
Prefer acquire/commit when:
the frame is being constructed in-place
the source data already exists as a
ReadOnlySequence<byte>eliminating the extra copy matters
For callers that cannot handle split writes, TryAcquireContiguous is available as a stricter variant that either returns a guaranteed single-segment lease or indicates that a wrap would occur.
For advanced lease access, LeasePrimitives provides wrap-aware helpers for copying and typed reads, including operations such as:
TryCopyOutTryReadUInt16LETryReadUInt32LETryReadInt32LE
These helpers handle boundary crossing without heap allocation.
Topic Path
ByteTopic is an append-only circular log with a single writer and any number of independent readers. Each subscriber owns its own cursor. The writer publishes once, and each subscriber advances independently.
If a slow subscriber is lapped by the writer, FastForwardIfLapped advances that subscriber to the oldest retained message and records the drop count. This keeps the writer unblocked.
TopicStartMode controls where a new subscriber begins:
| Value | Behavior |
|---|---|
Head |
Receive only future messages |
OldestRetained |
Replay everything currently retained |
Drainer Strategies
| Drainer | When to Use |
|---|---|
FairCoalescedTxMuxDrainer |
General-purpose option with fairness controls and IEgressTransformer support |
CoalescedTxMuxDrainer |
Simpler strategy that drains queues first, then topics, into one pooled buffer |
DirectStreamTxMuxDrainer |
Writes directly to a Stream when batching is handled elsewhere |
All drainers return a DrainResult that drives scheduling:
| Result | Meaning |
|---|---|
Completed |
Batch sent; may cool down if no pending signals remain |
Pending |
Async send is in flight; shard will be rescheduled on completion |
Busy |
Sink cannot accept data yet; requeue immediately |
NoWork |
Nothing to drain; park |
Closed |
Sink is closed; stop draining |
Egress Transforms
IEgressTransformer allows in-place packet transformation before a batch reaches the sink. Typical uses include encryption, framing, and compression.
Implement MaxOverhead to declare how many bytes the transform may prepend, and implement Transform to perform the operation. The drainer reserves space before each packet so the source and destination spans can overlap safely.
Configuration
TxFabricOptions
| Property | Default | Description |
|---|---|---|
ShardCount |
Environment.ProcessorCount |
Number of shard worker threads |
TxShardOptions
| Property | Default | Description |
|---|---|---|
ThreadPriority |
Normal |
Worker thread priority |
PostWorkSpinDuration |
(none) | Spin duration after drain before parking |
ByteQueueOptions
| Property | Default | Description |
|---|---|---|
ByteCapacity |
(required) | Ring buffer size in bytes; must be a power of two |
DescCapacity |
(required) | Maximum frames in flight; must be a power of two |
EnableLatencyTracking |
false |
Record enqueue timestamps for latency measurement |
ByteTopicOptions
| Property | Default | Description |
|---|---|---|
ByteCapacity |
(required) | Log size in bytes; must be a power of two |
DescCapacity |
(required) | Maximum retained messages; must be a power of two |
EnableLatencyTracking |
false |
Record append timestamps for latency measurement |
FairCoalescedDrainerOptions
| Property | Default | Description |
|---|---|---|
BatchSize |
(required) | Coalescing buffer size in bytes |
QueueSharePercent |
50 |
Budget percentage reserved for queues; the remainder goes to topics |
MaxMsgsPerBatch |
(none) | Optional cap on messages per drain cycle |
MaxBytesPerBatch |
(none) | Optional cap on bytes per drain cycle |
SourceBurstLimit |
(none) | Optional cap on messages drained per source per cycle |
Benchmarks and Simulation
tests/ByteMux.Simulation stress-tests the full stack under realistic load. The simulation outputs per-second P50 and P99 drain latency along with GC allocation per tick.
dotnet run --project tests/ByteMux.Simulation -c Release
The simulation project is also configured for Native AOT publish:
dotnet publish tests/ByteMux.Simulation -c Release -r win-x64
Simulation Results
The simulation suite exercises ByteMux under steady-state multi-connection load using FairCoalescedTxMuxDrainer with a NullSink.
Test environment: AMD Ryzen 5 5600X (6 cores / 12 threads), Windows 10 IoT Enterprise LTSC
Run profile: 5-minute steady-state, first 30 seconds excluded as warmup, 600 connections across 6 shards, 64-byte packets
These results are directional indicators of library overhead under controlled conditions. They are useful for regression tracking and internal comparison, not as universal end-to-end performance claims.
Scenario A — 1 queue per connection
600 connections × 1 queue × 20 msg/s = 12,000 messages/second
- P50 drain latency: low single-digit microseconds
- P99 drain latency: tens of microseconds, comfortably below 100 µs
Scenario B — 3 queues per connection, 3 shared topics
600 connections × 3 queues × 20 msg/s
- 3 topics × 600 subscribers × 20 msg/s
= 36,000 messages/second
- P50 drain latency: single- to low double-digit microseconds
- P99 drain latency: generally below 200 µs, with tail variation influenced by OS scheduling
tests/ByteMux.Benchmarks contains focused BenchmarkDotNet benchmarks for queue throughput, drainer fairness, and socket-versus-stream comparisons.
dotnet run --project tests/ByteMux.Benchmarks -c Release
Design Notes
Bitmask Deduplication
Interlocked.Or on a 32-bit dirty mask is the full signaling mechanism. If multiple producers signal the same connection during the same window, those signals collapse into a single drain cycle with no queued notifications and no locking.
Gate and Latch Scheduling
ShardWork ensures a TxMux appears in a shard run queue at most once at a time. Its gate prevents duplicate scheduling, and its latch captures signals that arrive while a drain is already in progress.
SPSC Queue Semantics
ByteQueue uses Volatile.Read and Volatile.Write on head and tail indices. Producer and consumer state remain separate, and only head/tail visibility crosses the boundary.
Topic Publish Ordering
ByteTopic writes the descriptor Seq field last with a Volatile.Write release. Readers observe Seq using Volatile.Read acquire semantics. A reader that sees a valid sequence number is guaranteed to also see the fully written payload.
Drop-Oldest Broadcast Semantics
Topics never block the writer. When the log wraps, slow subscribers are automatically fast-forwarded to the oldest retained message. For broadcast-style workloads, dropping older frames is often preferable to applying backpressure to the writer.
License
SPDX-License-Identifier: MPL-2.0
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.0.0 | 115 | 4/11/2026 |
| 1.0.0-preview.1 | 52 | 4/11/2026 |