[Kernel][Defaults] Support reading parquet files with legacy 3-level repeated types #3083

vkorukanti · 2024-05-10T17:18:28Z

Description

When legacy mode is enabled in Spark, array physical types are stored slightly different from the standard format.

Standard mode (default):

optional group readerFeatures (LIST) {
  repeated group list {
    optional binary element (STRING);
  }
}

When write legacy mode is enabled (spark.sql.parquet.writeLegacyFormat = true):

optional group readerFeatures (LIST) {
  repeated group bag {
    optional binary array (STRING);
  }
}

TODO: We need to handle the 2-level lists. Will post a separate PR. The challenge is with generating or finding the Parquet files with 2-level lists.

How was this patch tested?

Added tests

Fixes #3082

…repeated types

vkorukanti · 2024-05-10T17:22:21Z

...rnel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ArrayColumnReader.java

@@ -52,15 +55,38 @@ public ColumnVector getDataColumnVector(int batchSize) {
        return arrayVector;
    }

+    /**


same code in Spark Parquet reader.

scovich

LGTM

…repeated types (#3083) ## Description When legacy mode is enabled in Spark, array physical types are stored slightly different from the standard format. Standard mode (default): ``` optional group readerFeatures (LIST) { repeated group list { optional binary element (STRING); } } ``` When write legacy mode is enabled (`spark.sql.parquet.writeLegacyFormat = true`): ``` optional group readerFeatures (LIST) { repeated group bag { optional binary array (STRING); } } ``` TODO: We need to handle the 2-level lists. Will post a separate PR. The challenge is with generating or finding the Parquet files with 2-level lists. ## How was this patch tested? Added tests Fixes #3082

vkorukanti added 2 commits May 10, 2024 10:09

[Kernel][Defaults] Support reading parquet files with legacy 3-level …

4addd13

…repeated types

clean up

8f1c14b

vkorukanti added the kernel label May 10, 2024

vkorukanti commented May 10, 2024

View reviewed changes

fix tests

a2a3f91

scovich approved these changes May 10, 2024

View reviewed changes

scottsand-db approved these changes May 10, 2024

View reviewed changes

vkorukanti merged commit a5d7c69 into delta-io:master May 10, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Defaults] Support reading parquet files with legacy 3-level repeated types #3083

[Kernel][Defaults] Support reading parquet files with legacy 3-level repeated types #3083

vkorukanti commented May 10, 2024

vkorukanti May 10, 2024

scovich left a comment

[Kernel][Defaults] Support reading parquet files with legacy 3-level repeated types #3083

[Kernel][Defaults] Support reading parquet files with legacy 3-level repeated types #3083

Conversation

vkorukanti commented May 10, 2024

Description

How was this patch tested?

vkorukanti May 10, 2024

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment