Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas get_dummies validate "columns" input #28383

Closed
TonyCongqianWang opened this issue Sep 11, 2019 · 13 comments · Fixed by #28463
Closed

Pandas get_dummies validate "columns" input #28383

TonyCongqianWang opened this issue Sep 11, 2019 · 13 comments · Fixed by #28463
Labels
API Design Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@TonyCongqianWang
Copy link

Code Sample, a copy-pastable example if possible

import string, pandas

p_csv = pandas.read_csv(my_dir/myFile), index_col=0)

sepanames = sorted(p_csv["SEPARATOR"].unique())

for i in range(0, 14):
print(i)
col = p_csv.columns.get_loc("SEPARATOR") + 1 + i
p_csv.insert(col, "SEPARATOR_" + sepanames[i].upper(), p_csv["SEPARATOR"].apply(lambda x: int(x == sepanames[i])))

p_csv.to_csv("my_dir/new_file.csv")

''' p_csv= pandas.read_csv(("/myDir/myFile.csv"), index_col=0)
pandas.get_dummies(p_csv, prefix="SEPARATOR_", columns="SEPARATOR")
p_csv.to_csv("/myDir/myNew.csv")'''

FILES:
https://www.amazon.de/clouddrive/share/h37d1hqtrj5SrZTvKdrs9gXltVKUgo8Is9BxL8WH7Sf

Problem description

This is all of my code. The quoted part is what I first tried, but after 20 minutes it ended through a sigkill without any result. The required files are available for download. I would think, that my code does the equivalent in this very case and it works just fine in under about a minute.

Pandas Version: 25.1.0
Pandas git version: '171c71611886aab8549a8620c5b0071a129ad685'

Expected Output

No error and changed csv file

@jbrockmendel
Copy link
Member

Can you post an example that we can copy/paste to replicate the behavior

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 11, 2019
@TonyCongqianWang
Copy link
Author

I already posted the example in quotes. It is:

Import pandas

p_csv= pandas.read_csv(("/myDir/myFile.csv"), index_col=0)
pandas.get_dummies(p_csv, prefix="SEPARATOR_", columns="SEPARATOR")
p_csv.to_csv("/myDir/myNew.csv")

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2019

That code sample is not self-contained, so no one can copy / paste to replicate your issue. The below link might help:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@TonyCongqianWang
Copy link
Author

TonyCongqianWang commented Sep 11, 2019

alright. here is a copy-pastable version with random data

###CODE

import pandas, numpy as np

columns = np.random.rand(145)
columns = columns.astype(str)
columns[110] = "SEPARATOR"
columns[0] = "INSTANCES"

separators = ["and", "clique", "gomory", "gomory1", "gomory2", "gomory3", "gomory4", "zerohalf"]
instances = ["021erkdjfiejrk", "5484ierdkj", "5487ehej"]

d = {}

for i in range(145):
    d[columns[i]]= np.random.rand(145000)

d["SEPARATOR"] = [np.random.choice(instances) for i in range(145000)]
d["INSTANCES"] = [np.random.choice(instances) for i in range(145000)]

p_csv= pd.DataFrame(d)
p_csv["SEPARATOR"]
print("getting dummies", p_csv.shape)
pd.get_dummies(p_csv, prefix="SEPARATOR_", columns="SEPARATOR")
print("success")

@jbrockmendel
Copy link
Member

That is liable to produce a DataFrame with shape on the order of (145000, 145000), which I think is on the order of 80GB, which certainly won't fit on my laptop. SIGKILL is likely coming from the OOM killer.

@TonyCongqianWang
Copy link
Author

that is what I guessed what probably happened. But why would the api function do that when there are only 6 different values for separator?

@jbrockmendel
Copy link
Member

(edited your comment to make it copy/paste more neatly, LMK if you object and I'll revert)

@jbrockmendel
Copy link
Member

do you get the expected result if you pass columns=["SEPARATOR"] instead of columns="SEPARATOR"?

@TonyCongqianWang
Copy link
Author

Yes! That worked, very fast too. Now that you mention it, I do see that the expected parameter is "list-like". Didn't expect that to be a problem, since I didn't get any error and also do get an error if I pass 'columns="fake_column"'

@WillAyd
Copy link
Member

WillAyd commented Sep 12, 2019

I think this should raise if columns is a string to avoid confusion like this. @TonyCongqianWang any interest in submitting a PR for that?

@WillAyd WillAyd changed the title Pandas get dummies sigkill Pandas get_dummies validate "columns" input Sep 12, 2019
@WillAyd WillAyd added API Design Error Reporting Incorrect or improved errors from pandas and removed Needs Info Clarification about behavior needed to assess issue labels Sep 12, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 12, 2019
@saurav2608
Copy link

I will take a shot at this.

@R1j1t
Copy link
Contributor

R1j1t commented Sep 17, 2019

@WillAyd I created a pull request for this validation, but I wanted to confirm one thing. If columns is not list_like then should it raise error or warning?
I referred to the code in reshape/utils.py and it raised error there. Would like to verify if I am correctly interpreting it.

@WillAyd
Copy link
Member

WillAyd commented Sep 17, 2019

Yea I think that would be good to emulate - nice find!

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants