Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python rest client changes for CDF. #150

Merged
merged 4 commits into from
May 25, 2022
Merged

Python rest client changes for CDF. #150

merged 4 commits into from
May 25, 2022

Conversation

chakankardb
Copy link
Collaborator

This PR has the following changes:

  1. Implemented list_table_changes in the python rest client (made the method a POST on server side).
  2. Added integ test.

@chakankardb chakankardb self-assigned this May 24, 2022

# Add cdf options.
# We do not validate the CDF options here since the server will perform validations anyways.
if cdfOptions.starting_version is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L282 to L290 is still assuming that one and only one starting parameter will be set, are you intended to do so?

I prefer we do not make such assumption?

Also for starting_timestamp and ending_timestamp, we need to encode them for the server to be able to parse, something I found on web:

import urllib.parse
urllib.parse.quote(query)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the assumption.

I tried your suggestion about urllib, but the server does not like it and it returns an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I think we need to do sth like this, so that & is correctly added.

params_list = []
if cdfOptions.starting_version is not None:
  params_list.append(f"startingVersion={cdfOptions.starting_version}")
query_str = "&".join(params_list))

For the urllib, I think we need and only need to apply urllib.parse.quote on cdfOptions.starting_timestamp, so that it doesn't encode =, and it encodes the space .
This is due the the server behavior, verified in both Delta Sharing OSS server and our rpc parser: https://github.com/databricks/universe/pull/174415

And needs to add a corresponding test case to test the encoding behavior is correct in our client.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done... good catch on the params!

cdf_table = Table(name="cdf_table_with_partition", share="share1", schema="default")
response = rest_client.list_table_changes(
cdf_table,
CdfOptions(starting_version=0, ending_version=3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add a test case with starting_timestamp, to just ensure it's correctly encoded(so that the server is able to parse it).

But one thing to note is that, seems that when running tests locally the server timezone is PST, and when github runs tests the server timezone is GMT. So the same string startingTimestamp will be parsed as different timestamp (long), which will map to different version of the table, thus results in different testing behavior.

One trick I did is use a very large startingTimestamp and expect an error message: in #146 , DeltaSharingRestClientSuite:getCDFFiles with Timestamp,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for start timestamp.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, needs to add a corresponding test case to test the encoding behavior is correct in our client, could you separate it out, and expect an error, similar to #146?

Though agree with you, we need to discuss whether it's safe to always expose DeltaStandaloneException, but in this case, we have to expose it because it's not internal error, and for the user/client to know that they need to set correct timestamp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Also added a more specific exception mapping in the reader, which will be safer.

@linzhou-db
Copy link
Collaborator

The error is transient, it's gone..

Copy link
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor comment on the server exception handling.

* illegal arg exception.
*/
private def getActiveCommitAtTime(ts: Timestamp): DeltaSharingHistoryManager.Commit = {
val commit = try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you sync with the head, you don't need this change? The main concern is that history.getActiveCommitAtTime(ts) may throw other exceptions which may not be DeltaCDFIllegalArgumentException.

Or do you still prefer this change over what I did here? https://github.com/delta-io/delta-sharing/blob/main/server/src/main/scala/io/delta/sharing/server/DeltaSharingService.scala#L98

Neither is obviously better since both are not precisely catching and throwing the error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of getActiveCommitAt is a lot smaller, and looking at the code it will most commonly only throw timestamp related exception. So it is safer imho.

The other one seems too broad, so we should perhaps remove that one until we find a better solution.

@chakankardb chakankardb merged commit 4267337 into main May 25, 2022
@wchau wchau deleted the SC-101436 branch July 8, 2022 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants