Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Pandas 2.0 with pyarrow engine add the argument like 'skip_bad_lines=True' #54480

Closed
1 of 3 tasks
sunflyfairy opened this issue Aug 10, 2023 · 5 comments · Fixed by #54643
Closed
1 of 3 tasks

ENH: Pandas 2.0 with pyarrow engine add the argument like 'skip_bad_lines=True' #54480

sunflyfairy opened this issue Aug 10, 2023 · 5 comments · Fixed by #54643
Assignees
Labels
Arrow pyarrow functionality Enhancement IO CSV read_csv, to_csv

Comments

@sunflyfairy
Copy link

sunflyfairy commented Aug 10, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

When I use Pandas 2.0 with pyarrow engine to read CSV file, if the file get some bad lines, it will raise a error such as "pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2: ".

Then I added the argument '', it will raise another error "TypeError: read_csv() got an unexpected keyword argument 'skip_bad_lines'"

So,can you add the argument like 'skip_bad_lines=True' with pyarrow engine.

df_arrow = pd.read_csv(r"VBAP.CSV", engine='pyarrow', dtype_backend='pyarrow')
df_arrow = pd.read_csv(r"VBAP.CSV", engine='pyarrow', dtype_backend='pyarrow', skip_bad_lines=True)

Feature Description

df_arrow = pd.read_csv(r"VBAP.CSV", engine='pyarrow', dtype_backend='pyarrow', skip_bad_lines=True)

Alternative Solutions

df_arrow = pd.read_csv(r"VBAP.CSV", engine='pyarrow', dtype_backend='pyarrow', skip_bad_lines=True)

Additional Context

No response

@sunflyfairy sunflyfairy added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2023
@lithomas1
Copy link
Member

This should be possible in now recent pyarrow versions where invalid_row_handler is present as a ParseOption.

PRs are welcome for this.

@lithomas1 lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2023
@sunflyfairy
Copy link
Author

OK, thanks.
If it released, please update this issue, thanks.

@amithkk
Copy link
Contributor

amithkk commented Aug 19, 2023

@lithomas1 I've created a pull request that utilizes invalid_row_handler. However, instead of introducing a new argument, I've integrated it into the existing on_bad_lines argument to read_csv when the pyarrow engine is selected. This allows for flexible handling if the invalid line has to put out a warning, be skipped, or have user-defined behavior (much like the standard implementation for the python engine).

To achieve the behavior described in the issue, the syntax would be:
pd.read_csv(r"test.csv", engine='pyarrow', dtype_backend='pyarrow', on_bad_lines='skip')

@amithkk
Copy link
Contributor

amithkk commented Aug 20, 2023

Merging/Review is blocked due to #54650

@amithkk
Copy link
Contributor

amithkk commented Aug 26, 2023

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants