Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in read_csv with duplicated column names #7160

Closed
rafaljozefowicz opened this issue May 18, 2014 · 10 comments
Closed

Bug in read_csv with duplicated column names #7160

rafaljozefowicz opened this issue May 18, 2014 · 10 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@rafaljozefowicz
Copy link

Tested on 0.13.0, 0.13.1 and 0.14.0rc1:

from StringIO import StringIO
import pandas as pd

# this is correct
print(pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["a", "b", "a"]))
# and this is fine as well
# (although, changing the column names to a,b,a.1)
print(pd.read_csv(StringIO("a,b,a\n0,1,2\n3,4,5")))
# but this is not correct
print(pd.read_csv(StringIO("0,1,2\n3,4,5"), names=["a", "b", "a"]))

The last one returns:

Out[5]: 
   a  b  a
0  2  1  2
1  5  4  5

I would expect all 3 methods to return the same DataFrame. I noticed this when I wanted to read csv file that had a separate file with a header (and a duplicated column in it). BTW is there a better way to do it than to read the header file first and pass the output into 'names' parameter of read_csv?

@jreback
Copy link
Contributor

jreback commented May 18, 2014

read_csv needs to interpret many formats so that's why it changes to not have duplicate columns
(and some code that assumes they r unique)

so this needs some work

marking as a bug for 0.15 - feel free to submit a pr

@jreback jreback added this to the 0.15.0 milestone May 18, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@gfyoung
Copy link
Member

gfyoung commented Apr 17, 2016

@jreback : This bug still exists in master. What sort of output should be expected from this? Is the second one supposed to give the exact same output as the first and third (when patched) ones?

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

yeh I think the problem is that the names->columns are passed back as a dict and not as 2 lists, so it gets lots. Its in the post-processing code in python somewhere.

I would expect as the OP suggests. Note tthis is different that if usecols has dupes. names is merely acting as the header here.

@gfyoung
Copy link
Member

gfyoung commented Apr 19, 2016

FYI: for the second example, that output is correct because mangle_dupe_cols defaults to True, meaning that all columns become unique with .{x} labelling as expected. The bug also surfaces if you set mangle_dupe_cols=False.

@gfyoung
Copy link
Member

gfyoung commented Apr 19, 2016

@jreback : Question, what does the as_recarray option do exactly? And, should we be raising a ValueError (as we do now in master) in the third example if we set as_recarray=True?

@jreback
Copy link
Contributor

jreback commented Apr 19, 2016

as_recarray returns a numpy rec-array. But AFAIK its pretty much unused / not tested very well / somewhat buggy. Originally for numpy compat (and of course would be important if we eventually extracted the read_csv as a separate indepedent module (so that numpy could use it directly).

@gfyoung
Copy link
Member

gfyoung commented Apr 19, 2016

Oh, okay. How about my second question?

EDIT: never mind - numpy says the answer is yes.

@gfyoung
Copy link
Member

gfyoung commented Apr 19, 2016

@jreback : Question, what does mangle_dupe_cols mean exactly? It seems to only apply when the header is in the file but has no effect if names has duplicates (as in this issue).

@jreback
Copy link
Contributor

jreback commented Apr 19, 2016

it's a way to turn duplicates into things like
A_1 A_2. etc

we really don't need this anymore but it's there so leave I guess -main issue is supporting duplicates properly

@gfyoung
Copy link
Member

gfyoung commented Apr 19, 2016

That is true...as of right now, there is ZERO support for duplicates AFAICT. 😞

gfyoung added a commit to forking-repos/pandas that referenced this issue May 23, 2016
Deduplicates the 'names' parameter by default if
there are duplicate names. Also raises when 'mangle_
dupe_cols' is False to prevent data overwrite.

Closes pandas-devgh-7160.
Closes pandas-devgh-9424.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants