Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow SFTPToGCSOperator sftp file exist check #41472

Open
2 tasks done
kandharvishnu opened this issue Aug 14, 2024 · 5 comments
Open
2 tasks done

Airflow SFTPToGCSOperator sftp file exist check #41472

kandharvishnu opened this issue Aug 14, 2024 · 5 comments
Assignees
Labels
good first issue kind:feature Feature Requests provider:google Google (including GCP) related issues

Comments

@kandharvishnu
Copy link
Contributor

kandharvishnu commented Aug 14, 2024

Description

Currently, the operator fails if a file is not found on the SFTP server. To make this behavior more flexible, we could introduce a new parameter, fail_on_sftp_file_not_exist, allowing users to customize the operator's response in such scenarios. By setting this parameter, users can choose whether the operator should raise an error when a file is missing or simply log a message and continue processing.

Use case/motivation

This enhancement would give users greater control over the operator's behaviour, particularly in environments where missing files are expected or where continued processing is preferred.

Related issues

A similar issue has been opened for the SFTPToS3Operator.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@kandharvishnu kandharvishnu added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Aug 14, 2024
@dosubot dosubot bot added the provider:google Google (including GCP) related issues label Aug 14, 2024
@potiuk potiuk added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Aug 14, 2024
@geraj1010
Copy link
Contributor

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

@kandharvishnu
Copy link
Contributor Author

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

@geraj1010
Copy link
Contributor

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

I see. In that case, I don't think it makes sense to have the task succeed if a file is not found. A 'success' would not have much meaning anymore besides that the task simply ran. The interpretation of 'success' would become ambiguous i.e. how can you tell if a file was indeed moved or not without having to further dig into the logs? In addition, it doesn't feel very 'airflow-like' since we would be overloading the operator with functionality that already exists through a sensor.

I'm not sure what your DAG looks like, but perhaps you can use a SFTPSensor and trigger rules for your downstream tasks? There is soft_fail for sensors too, if you want the sensor to 'skip' instead of 'fail'.

@kandharvishnu
Copy link
Contributor Author

kandharvishnu commented Aug 21, 2024

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

I see. In that case, I don't think it makes sense to have the task succeed if a file is not found. A 'success' would not have much meaning anymore besides that the task simply ran. The interpretation of 'success' would become ambiguous i.e. how can you tell if a file was indeed moved or not without having to further dig into the logs? In addition, it doesn't feel very 'airflow-like' since we would be overloading the operator with functionality that already exists through a sensor.
There is soft_fail for sensors too, if you want the sensor to 'skip' instead of 'fail'.

I agree with your point, but this is an optional parameter. By default, it will fail the task if the file is not found. If the user sets the fail_on_sftp_file_not_exist to False, it would make some difference.

I'm not sure what your DAG looks like, but perhaps you can use a SFTPSensor and trigger rules for your downstream tasks?

The final task of my DAG is looking for a file, and the dag_run should not fail, though the file is not present. And I don't want to use the sensor to poke for a file.

@geraj1010
Copy link
Contributor

Is there an issue with using an SFTPSensor for checking file existence, prior to running SFTPToGCSOperator?

The SFTP sensor will wait until the file is detected. This issue is to determine whether the task should fail if the file is not found.

I see. In that case, I don't think it makes sense to have the task succeed if a file is not found. A 'success' would not have much meaning anymore besides that the task simply ran. The interpretation of 'success' would become ambiguous i.e. how can you tell if a file was indeed moved or not without having to further dig into the logs? In addition, it doesn't feel very 'airflow-like' since we would be overloading the operator with functionality that already exists through a sensor.
There is soft_fail for sensors too, if you want the sensor to 'skip' instead of 'fail'.

I agree with your point, but this is an optional parameter. By default, it will fail the task if the file is not found. If the user sets the fail_on_sftp_file_not_exist to False, it would make some difference.

I'm not sure what your DAG looks like, but perhaps you can use a SFTPSensor and trigger rules for your downstream tasks?

The final task of my DAG is looking for a file, and the dag_run should not fail, though the file is not present. And I don't want to use the sensor to poke for a file.

That's unfortunate you don't want to use a sensor for checking for file existence. If the sensor failed (i.e. file didn't exist after some time), you could have it soft_fail and it would skip (instead of fail) and you could choose which downstream tasks get skipped or ran based on trigger rules.

At any rate, I think as long as you use the trigger rule all_done for one (or more) of the downstream tasks, the dag_run will still be marked as successful. So you could have your SFTPToGCSOperator task fail (because the file doesn't exist on SFTP), and have the trigger rule for all other downstream tasks set to all_done, and they will still run and the dag_run will still be considered successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:feature Feature Requests provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

3 participants