ParquetSharp.Dataset
0.1.0-beta4
dotnet add package ParquetSharp.Dataset --version 0.1.0-beta4
NuGet\Install-Package ParquetSharp.Dataset -Version 0.1.0-beta4
<PackageReference Include="ParquetSharp.Dataset" Version="0.1.0-beta4" />
paket add ParquetSharp.Dataset --version 0.1.0-beta4
#r "nuget: ParquetSharp.Dataset, 0.1.0-beta4"
// Install ParquetSharp.Dataset as a Cake Addin #addin nuget:?package=ParquetSharp.Dataset&version=0.1.0-beta4&prerelease // Install ParquetSharp.Dataset as a Cake Tool #tool nuget:?package=ParquetSharp.Dataset&version=0.1.0-beta4&prerelease
ParquetSharp.Dataset
This is a work in progress and is not yet ready for public use
ParquetSharp.Dataset supports reading datasets consisting of multiple Parquet files, which may be partitioned with a partitioning strategy such as Hive partitioning. Data is read using the Apache Arrow format.
Note that ParquetSharp.Dataset does not use the Apache Arrow C++ Dataset library, but is implemented on top of ParquetSharp, which uses the Apache Arrow C++ Parquet library.
Usage
To begin with, you will need a dataset of Parquet files that have the same schema:
/my-dataset/data0.parquet
/my-dataset/data1.parquet
You can then create a DatasetReader
, and read data from this as a stream of Arrow RecordBatch
:
using ParquetSharp.Dataset;
var dataset = new DatasetReader("/my-dataset");
using var arrayStream = dataset.ToBatches();
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
using (batch)
{
// Use data in the batch
}
}
Your dataset may be partitioned using Hive partitioning, where directories are named containing a field name and value:
/my-dataset/part=a/data0.parquet
/my-dataset/part=a/data1.parquet
/my-dataset/part=b/data0.parquet
/my-dataset/part=b/data1.parquet
To read Hive partitioned data, you can provide a HivePartitioning.Factory
instance
to the DatasetReader
constructor, and the partitioning schema will be inferred
by looking at the dataset directory structure:
var partitioningFactory = new HivePartitioning.Factory();
var dataset = new DatasetReader("/my-dataset", partitioningFactory);
Alternatively, you can specify the partitioning schema explicitly:
var partitioningSchema = new Apache.Arrow.Schema.Builder()
.Field(new Field("part", new StringType(), nullable: false))
.Build());
var partitioning = new HivePartitioning(partitioningSchema);
var dataset = new DatasetReader("/my-dataset", partitioning);
When creating a DatasetReader
, the schema from the first Parquet file found will
be inspected to determine the full dataset schema.
This can be avoided by providing the full dataset schema explicitly:
var datasetSchema = new Apache.Arrow.Schema.Builder()
.Field(new Field("part", new StringType(), nullable: false))
.Field(new Field("x", new Int32Type(), nullable: false))
.Field(new Field("y", new FloatType(), nullable: false))
.Build());
var dataset = new DatasetReader("/my-dataset", partitioning, datasetSchema);
Filtering data
When reading data from a dataset, you can specify the columns to include and filter rows based on field values. Row filters may apply to fields from data files or from the partitioning schema. When a filter excludes a partition directory no files from that directory will be read.
var columns = new[] {"x", "y"};
var filter = Col.Named("part").IsIn(new[] {"a", "c"});
using var arrayStream = dataset.ToBatches(filter, columns);
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
using (batch)
{
// batch will only contain columns "x" and "y",
// and only files in the selected partitions will be read.
}
}
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net6.0
- Apache.Arrow (>= 15.0.1)
- ParquetSharp (>= 15.0.2-beta1)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
0.1.0-beta4 | 79 | 8/22/2024 |
0.1.0-beta3 | 73 | 4/10/2024 |
0.1.0-beta2 | 65 | 4/9/2024 |
0.1.0-beta1 | 63 | 4/4/2024 |