Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh pre signed urls in Spark connector #69

Merged
merged 6 commits into from
Oct 27, 2021
Merged

Refresh pre signed urls in Spark connector #69

merged 6 commits into from
Oct 27, 2021

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Oct 22, 2021

Currently, when reading a Delta Sharing table, we generate the pre signed urls first and use them to read. However, for a long running query, the pre signed urls may expire before we try to read them.

This PR adds a pre signed url cache in Spark driver. It will refresh pre signed urls in a background thread. Tasks running in Spark executors will talk to the Spark driver to fetch pre signed urls. This will ensure when we read a file in an executor, it will always get the latest valid pre signed url.

@zsxwing zsxwing requested a review from FX196 October 22, 2021 21:11
@FX196
Copy link
Collaborator

FX196 commented Oct 27, 2021

LGTM,

To summarize:

  • Instead of storing the pre-signed URLs, we make Spark store a path that references an entry in the cache. This way we can utilize Delta caching since the path looks unchanged to Spark.
  • To keep the paths valid, we have a background thread that refreshes the path -> pre-signed URL map.
  • We remove paths from the cache when there are no longer any WeakReferences to the table or the table hasn't been accessed for expireAfterAccessMs.

@FX196
Copy link
Collaborator

FX196 commented Oct 27, 2021

qq: It is correct that we don't need to worry about cache validity since Delta creates new files to replace the old files when there is an update?

@zsxwing
Copy link
Member Author

zsxwing commented Oct 27, 2021

qq: It is correct that we don't need to worry about cache validity since Delta creates new files to replace the old files when there is an update?

Yep

@zsxwing zsxwing merged commit bfc1068 into main Oct 27, 2021
@zsxwing zsxwing deleted the pre-signed branch October 27, 2021 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants