Bug in read_csv with duplicated column names #7160

rafaljozefowicz · 2014-05-18T04:06:33Z

Tested on 0.13.0, 0.13.1 and 0.14.0rc1:

from StringIO import StringIO
import pandas as pd

# this is correct
print(pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["a", "b", "a"]))
# and this is fine as well
# (although, changing the column names to a,b,a.1)
print(pd.read_csv(StringIO("a,b,a\n0,1,2\n3,4,5")))
# but this is not correct
print(pd.read_csv(StringIO("0,1,2\n3,4,5"), names=["a", "b", "a"]))

The last one returns:

Out[5]: 
   a  b  a
0  2  1  2
1  5  4  5

I would expect all 3 methods to return the same DataFrame. I noticed this when I wanted to read csv file that had a separate file with a header (and a duplicated column in it). BTW is there a better way to do it than to read the header file first and pass the output into 'names' parameter of read_csv?

The text was updated successfully, but these errors were encountered:

jreback · 2014-05-18T17:01:14Z

read_csv needs to interpret many formats so that's why it changes to not have duplicate columns
(and some code that assumes they r unique)

so this needs some work

marking as a bug for 0.15 - feel free to submit a pr

gfyoung · 2016-04-17T15:08:59Z

@jreback : This bug still exists in master. What sort of output should be expected from this? Is the second one supposed to give the exact same output as the first and third (when patched) ones?

jreback · 2016-04-17T15:31:55Z

yeh I think the problem is that the names->columns are passed back as a dict and not as 2 lists, so it gets lots. Its in the post-processing code in python somewhere.

I would expect as the OP suggests. Note tthis is different that if usecols has dupes. names is merely acting as the header here.

gfyoung · 2016-04-19T16:26:52Z

FYI: for the second example, that output is correct because mangle_dupe_cols defaults to True, meaning that all columns become unique with .{x} labelling as expected. The bug also surfaces if you set mangle_dupe_cols=False.

gfyoung · 2016-04-19T16:44:04Z

@jreback : Question, what does the as_recarray option do exactly? And, should we be raising a ValueError (as we do now in master) in the third example if we set as_recarray=True?

jreback · 2016-04-19T16:46:20Z

as_recarray returns a numpy rec-array. But AFAIK its pretty much unused / not tested very well / somewhat buggy. Originally for numpy compat (and of course would be important if we eventually extracted the read_csv as a separate indepedent module (so that numpy could use it directly).

gfyoung · 2016-04-19T16:47:08Z

Oh, okay. How about my second question?

EDIT: never mind - numpy says the answer is yes.

gfyoung · 2016-04-19T21:30:36Z

@jreback : Question, what does mangle_dupe_cols mean exactly? It seems to only apply when the header is in the file but has no effect if names has duplicates (as in this issue).

jreback · 2016-04-19T21:39:10Z

it's a way to turn duplicates into things like
A_1 A_2. etc

we really don't need this anymore but it's there so leave I guess -main issue is supporting duplicates properly

gfyoung · 2016-04-19T21:45:49Z

That is true...as of right now, there is ZERO support for duplicates AFAICT. 😞

Deduplicates the 'names' parameter by default if there are duplicate names. Also raises when 'mangle_ dupe_cols' is False to prevent data overwrite. Closes pandas-devgh-7160. Closes pandas-devgh-9424.

jreback added Bug labels May 18, 2014

jreback added this to the 0.15.0 milestone May 18, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

jreback added Difficulty Intermediate labels Apr 17, 2016

gfyoung mentioned this issue Apr 20, 2016

BUG, ENH: Add support for parsing duplicate columns #12935

Closed

gfyoung mentioned this issue Apr 22, 2016

read_csv clobbers values of columns with duplicate names #9424

Closed

jreback closed this as completed in 9a6ce07 May 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in read_csv with duplicated column names #7160

Bug in read_csv with duplicated column names #7160

rafaljozefowicz commented May 18, 2014

jreback commented May 18, 2014

gfyoung commented Apr 17, 2016

jreback commented Apr 17, 2016

gfyoung commented Apr 19, 2016 •

edited

Loading

gfyoung commented Apr 19, 2016 •

edited

Loading

jreback commented Apr 19, 2016

gfyoung commented Apr 19, 2016 •

edited

Loading

gfyoung commented Apr 19, 2016

jreback commented Apr 19, 2016

gfyoung commented Apr 19, 2016

Bug in read_csv with duplicated column names #7160

Bug in read_csv with duplicated column names #7160

Comments

rafaljozefowicz commented May 18, 2014

jreback commented May 18, 2014

gfyoung commented Apr 17, 2016

jreback commented Apr 17, 2016

gfyoung commented Apr 19, 2016 • edited Loading

gfyoung commented Apr 19, 2016 • edited Loading

jreback commented Apr 19, 2016

gfyoung commented Apr 19, 2016 • edited Loading

gfyoung commented Apr 19, 2016

jreback commented Apr 19, 2016

gfyoung commented Apr 19, 2016

gfyoung commented Apr 19, 2016 •

edited

Loading

gfyoung commented Apr 19, 2016 •

edited

Loading

gfyoung commented Apr 19, 2016 •

edited

Loading