ParquetSharp.Dataset 0.1.0-beta3

This is a prerelease version of ParquetSharp.Dataset.

dotnet add package ParquetSharp.Dataset --version 0.1.0-beta3

NuGet\Install-Package ParquetSharp.Dataset -Version 0.1.0-beta3

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ParquetSharp.Dataset" Version="0.1.0-beta3" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add ParquetSharp.Dataset --version 0.1.0-beta3

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ParquetSharp.Dataset, 0.1.0-beta3"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install ParquetSharp.Dataset as a Cake Addin
#addin nuget:?package=ParquetSharp.Dataset&version=0.1.0-beta3&prerelease

// Install ParquetSharp.Dataset as a Cake Tool
#tool nuget:?package=ParquetSharp.Dataset&version=0.1.0-beta3&prerelease

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

ParquetSharp.Dataset

This is a work in progress and is not yet ready for public use

ParquetSharp.Dataset supports reading datasets consisting of multiple Parquet files, which may be partitioned with a partitioning strategy such as Hive partitioning. Data is read using the Apache Arrow format.

Note that ParquetSharp.Dataset does not use the Apache Arrow C++ Dataset library, but is implemented on top of ParquetSharp, which uses the Apache Arrow C++ Parquet library.

Usage

To begin with, you will need a dataset of Parquet files that have the same schema:

/my-dataset/data0.parquet
/my-dataset/data1.parquet

You can then create a DatasetReader, and read data from this as a stream of Arrow RecordBatch:

using ParquetSharp.Dataset;

var dataset = new DatasetReader("/my-dataset");
using var arrayStream = dataset.ToBatches();
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
    using (batch)
    {
        // Use data in the batch
    }
}

Your dataset may be partitioned using Hive partitioning, where directories are named containing a field name and value:

/my-dataset/part=a/data0.parquet
/my-dataset/part=a/data1.parquet
/my-dataset/part=b/data0.parquet
/my-dataset/part=b/data1.parquet

To read Hive partitioned data, you can provide a HivePartitioning.Factory instance to the DatasetReader constructor, and the partitioning schema will be inferred by looking at the dataset directory structure:

var partitioningFactory = new HivePartitioning.Factory();
var dataset = new DatasetReader("/my-dataset", partitioningFactory);

Alternatively, you can specify the partitioning schema explicitly:

var partitioningSchema = new Apache.Arrow.Schema.Builder()
    .Field(new Field("part", new StringType(), nullable: false))
    .Build());
var partitioning = new HivePartitioning(partitioningSchema);
var dataset = new DatasetReader("/my-dataset", partitioning);

When creating a DatasetReader, the schema from the first Parquet file found will be inspected to determine the full dataset schema. This can be avoided by providing the full dataset schema explicitly:

var datasetSchema = new Apache.Arrow.Schema.Builder()
    .Field(new Field("part", new StringType(), nullable: false))
    .Field(new Field("x", new Int32Type(), nullable: false))
    .Field(new Field("y", new FloatType(), nullable: false))
    .Build());
var dataset = new DatasetReader("/my-dataset", partitioning, datasetSchema);

Filtering data

When reading data from a dataset, you can specify the columns to include and filter rows based on field values. Row filters may apply to fields from data files or from the partitioning schema. When a filter excludes a partition directory no files from that directory will be read.

var columns = new[] {"x", "y"};
var filter = Col.Named("part").IsIn(new[] {"a", "c"});
using var arrayStream = dataset.ToBatches(filter, columns);
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
    using (batch)
    {
        // batch will only contain columns "x" and "y",
        // and only files in the selected partitions will be read.
    }
}

Product	Compatible and additional computed target framework versions.
.NET	net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net6.0
- Apache.Arrow (>= 15.0.1)
- ParquetSharp (>= 15.0.2-beta1)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.