Feature Request: row-level Merge Status Variable #8790

nickeubank · 2014-11-11T22:15:07Z

Hello All!

Moving into Pandas from R/Matlab/Stata. One feature I'm finding I really miss: a field generated during a merge that reports, for each row, whether that row came from the left dataset, the right dataset, or was successfully merged.

I do a lot of work in social science where our data is VERY dirty. For example, I'm often merging transliterated names from Pakistan, and so I just want to see after a merge how many records successfully merged, and then easily pull out the records of each time to compare. I'm including a kludge I'm using below, but an in-line option would be so nice, and I think others in my area would also appreciate it.

Thanks!

    df1['left'] = 1
    df2['right'] = 1

    mergedDF = pd.merge(df1,df2, how='outer', left_on='key1', right_on='key2')

    def mergeVar(x):
        if x['left'] == 1 and x['right'] == 1 :
            return 'both'
        elif x['left'] == 1 and x['right'] != 1:
            return 'leftOnly'
        else: return 'rightOnly'

    mergedDF['mergeReport'] = mergedDF.apply(mergeVar, axis=1)
    mergedDF.drop(['left', 'right'], axis = 1)

(I'd also be happy to help make it, but I'm relatively new to python, so would need some hand-holding to learn how to integrate a new option into an existing function...)

jreback · 2014-11-11T22:25:52Z

mentioned here originall: #7412

but that's closed. Yep this could be an option. I'll put it on the list. Its not hard, but just takes a bit of work.

nickeubank · 2014-11-11T22:27:48Z

Thanks!

nickeubank · 2014-11-11T22:56:18Z

As an aside, that would also fill another gap in Pandas, which is a lack of a "difference" function that reports what keys aren't shared between dataframes.

jreback · 2014-11-11T23:32:11Z

look at set operations on Index objects: http://pandas.pydata.org/pandas-docs/stable/indexing.html#set-operations-on-index-objects

.difference()/.union() etc.

nickeubank · 2014-11-11T23:48:24Z

Brilliant! Thanks! You guys are the best!
On Tue, Nov 11, 2014 at 3:32 PM jreback notifications@github.com wrote:

look at set operations on Index objects:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#set-operations-on-index-objects

.difference()/.union() etc.

—
Reply to this email directly or view it on GitHub
#8790 (comment).

jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 11, 2014

jreback added this to the 0.16.0 milestone Nov 11, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

nickeubank mentioned this issue May 3, 2015

ENH: Create merge indicator for obs from left, right, or both #10054

Merged

jreback modified the milestones: 0.17.0, Next Major Release May 10, 2015

jreback closed this as completed in #10054 Sep 4, 2015

chris-b1 mentioned this issue Sep 20, 2015

ENH: add merge indicator to DataFrame.merge #11154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: row-level Merge Status Variable #8790

Feature Request: row-level Merge Status Variable #8790

nickeubank commented Nov 11, 2014

jreback commented Nov 11, 2014

nickeubank commented Nov 11, 2014

nickeubank commented Nov 11, 2014

jreback commented Nov 11, 2014

nickeubank commented Nov 11, 2014

Feature Request: row-level Merge Status Variable #8790

Feature Request: row-level Merge Status Variable #8790

Comments

nickeubank commented Nov 11, 2014

jreback commented Nov 11, 2014

nickeubank commented Nov 11, 2014

nickeubank commented Nov 11, 2014

jreback commented Nov 11, 2014

nickeubank commented Nov 11, 2014