Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: row-level Merge Status Variable #8790

Closed
nickeubank opened this issue Nov 11, 2014 · 5 comments · Fixed by #10054
Closed

Feature Request: row-level Merge Status Variable #8790

nickeubank opened this issue Nov 11, 2014 · 5 comments · Fixed by #10054
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@nickeubank
Copy link
Contributor

Hello All!

Moving into Pandas from R/Matlab/Stata. One feature I'm finding I really miss: a field generated during a merge that reports, for each row, whether that row came from the left dataset, the right dataset, or was successfully merged.

I do a lot of work in social science where our data is VERY dirty. For example, I'm often merging transliterated names from Pakistan, and so I just want to see after a merge how many records successfully merged, and then easily pull out the records of each time to compare. I'm including a kludge I'm using below, but an in-line option would be so nice, and I think others in my area would also appreciate it.

Thanks!

    df1['left'] = 1
    df2['right'] = 1

    mergedDF = pd.merge(df1,df2, how='outer', left_on='key1', right_on='key2')

    def mergeVar(x):
        if x['left'] == 1 and x['right'] == 1 :
            return 'both'
        elif x['left'] == 1 and x['right'] != 1:
            return 'leftOnly'
        else: return 'rightOnly'

    mergedDF['mergeReport'] = mergedDF.apply(mergeVar, axis=1)
    mergedDF.drop(['left', 'right'], axis = 1)

(I'd also be happy to help make it, but I'm relatively new to python, so would need some hand-holding to learn how to integrate a new option into an existing function...)

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

mentioned here originall: #7412

but that's closed. Yep this could be an option. I'll put it on the list. Its not hard, but just takes a bit of work.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 11, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 11, 2014
@nickeubank
Copy link
Contributor Author

Thanks!

@nickeubank
Copy link
Contributor Author

As an aside, that would also fill another gap in Pandas, which is a lack of a "difference" function that reports what keys aren't shared between dataframes.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

look at set operations on Index objects: http://pandas.pydata.org/pandas-docs/stable/indexing.html#set-operations-on-index-objects

.difference()/.union() etc.

@nickeubank
Copy link
Contributor Author

Brilliant! Thanks! You guys are the best!
On Tue, Nov 11, 2014 at 3:32 PM jreback notifications@github.com wrote:

look at set operations on Index objects:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#set-operations-on-index-objects

.difference()/.union() etc.


Reply to this email directly or view it on GitHub
#8790 (comment).

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release May 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants