-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SymlinkTextInputFormat Manifest Generation for Presto/Athena read support #76
Comments
@marmbrus is this feature available in 0.3.0 release? If not, will it be available anytime soon? |
Let me know if I can be of any help developing any features for this project. is this something need to be ported from data bricks? |
@marmbrus Any updates on this issue? |
@marmbrus if we generate the manifest based on Delta transaction log of the last version, does that works with Athena? My understanding is the last version should have all the files that required to create the manifest. I m banking on this approach until its available in Delta Lake. |
Yes, if you generate a manifest using the transaction log that should work with Athena. |
@marmbrus thanks for confirming |
We are working on pushing our existing manifest generation code and tests to OSS.. |
Right now, I m developing the code to generate that manifest. I will probably use it temporarily until its available in Delta Lake. please let me know if you have any ETA. Some of our Delta Lake tables got partition columns. do you know if the manifest generation works w/ partition columns as well? |
@kkr78 yeah, our manifest generation will work with partition columns. for partitioned tables, the manifest files are itself partitioned by the same columns, so query (at least in Presto/Athena) will read manifests for only the partitions that it will query. |
So, Today we can't query in Presto (PrestoDB or PrestoSQL), right? When #232 is resolved, presto can query delta lake? |
For anyone looking for a workaround this is what i'm using for now : https://gist.github.com/FabioBatSilva/d6b168a01cf4ba991d9e77881c5ea1e5 |
@FabioBatSilva So we can use |
@cozos Yes, That is what i'm using to get data into AWS Athena. |
@FabioBatSilva Do you mind telling me about the manifest workflow? Do you generate a new manifest file everytime the Delta Transaction Log is modified? |
@cozos See the sample python code below that I developed to generate manifest based on Delta Transaction Log. Basically the code traverses through all the JSON files in _delta_log directory, identifies the list of parquet files and writes them to symlink text. The manifest is regenerated when the Transaction Log is modified. This code only works on s3. Disclaimer: This code is not production-ready, not fully tested, this is just to give an idea of how it can be implemented. https://github.com/kkr78/delta-util The Athena table points to _symlink directory where symlink.txt files are located. For partitions, generate a symlink.txt file for each partition. |
Hey all, we are open-sourcing our Databricks Delta's manifest generation code in this PR #250 Thank you, everyone, for showing so much interest in Delta Lake and its compatibility with other engines. |
@tdas is this PR sufficient scope for a point release? |
By "point release", do you mean patch release ... as in 0.4.1? |
Hi. Sorry I did not see the 7 Dec milestone for the 0.5.0 point release. |
…tion for Presto/Athena support ## What changes were proposed in this pull request? This PR is the first in the sequence of PRs to add manifest file generation (SymlinkInputFormat) to OSS Delta for Presto/Athena read support (issue #76). Specifically, this PR adds the core functionality for manifest generation and rigorous tests to verify the contents of the manifest. Future PRs will add the public APIs for on-demand generation. - Added post-commit hooks to run tasks after a successful commit. - Added GenerateSymlinkManifest implementation of post-commit hook to generate the manifests. - Each manifest contains the name of data files to read for querying the whole table or partition - Non-partitioned table produces a single manifest file containing all the data files. - Partitioned table produces partitioned manifest files; same partition structured like the table, each partition directory containing one manifest file containing data files of that partition. This allows Presto/Athena partition-pruned queries to read only manifest files of the necessary partitions. - Each attempt to generate manifest will atomically (as much as possible) overwrite the manifest files in the directories (if they exist) and also delete manifest files of partitions that have been deleted from the table. Closes #250 Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com> Co-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Rahul Mahadev <rahul.mahadev@databricks.com> #6910 is resolved by tdas/SC-25511. GitOrigin-RevId: a3e04f2fcdafb6ac29c3adcfb791a3d0611583dc
The core manifest generation code for symlink style manifest generation has been merged in b18ffba We are currently working on the Scala/Python/SQL APIs for the manifest generation. |
If someone would like to compile the project to test into the AWS Athena, the command to export the Symlink is: `import org.apache.spark.sql.delta._ val deltaLog = DeltaLog.forTable(spark, "s3a://....") GenerateSymlinkManifest.generateFullManifest(spark, deltaLog)` Table syntax: |
This is the PR for public APIs for manifest generation - #262 |
Great stuff, looking forward to getting this and #262 in 0.5.0! |
The public APIs were committed in 5b3e3eb |
Delta's use of MVCC causes external readers (presto, hive, etc) to see inconsistent or duplicate data. The
SymlinkTextInputFormat
allows these systems to read a set of manifest files, each containing a list of file paths, in order to determine which data files to read (rather that listing the files present on the file system).This issue tracks adding support for generating these manifest files from the Delta transaction log, automatically after each commit.
This feature would give support for reading Delta tables from Presto. Hive would require some addition work on the Hive side, as Hive does not use the file extension to determine the final
InputFormat
to use to decode the data (and as such interprets the files incorrectly as text).The text was updated successfully, but these errors were encountered: