Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heckman #1935

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open

Heckman #1935

wants to merge 38 commits into from

Conversation

bert9bert
Copy link

Added a module to estimate the Heckman selection model using the Heckman 2-step. A test module checks the parameter estimates that this module creates in an example dataset against the parameter estimates that Stata creates. If this happens to merge well into the existing statsmodels code, I'd like to add an MLE estimation method as well. Stata currently has the option to estimate the Heckman model by either the 2-step or by MLE.

self.nobs_uncensored = self.nobs = np.sum(self.treated)
self.nobs_censored = self.nobs_total - self.nobs_uncensored

# store variable names if data came in as Pandas objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is taken care of by super for endog and exog. variable names and some descriptive things about the data are in the model.data attribute.
check dir(model.data), I don't know those by heart

@josef-pkt
Copy link
Member

I went quickly through this to get a rough idea.

Looks good so far, I don't think we will run into any serious problems with the inherited methods in this case.
I added a few comments, where I had some comments based on the rough reading.

This will be for 0.7, it's already too late to add new models to 0.6.
0.7 will come out hopefully just a few months after.

BTW:
We have a Stata ado file that dumps the results from Stata into a python module. I use it quite heavily for writing unit tests.
My latest version, for Stata 12, is in this PR
#1716
an example output is here
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/tests/results/results_grunfeld_ols_robust_cluster.py
with the use in the corresponding test file test_robustcov.py

not sure where I have a do file, here is part of what I used for GLM.fit_constrained
The "boiler plate" part is to be copy-pasted and adjusted for model or case specific arrays.
all of ereturn list is included



tempname filename
local filename = "results_glm_logit_constrained_.py"

insheet using "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\datasets\spector\spector.csv", delimiter(" ")
constraint 1 gpa = 2.8
constraint 2 gpa = psi

glm grade  gpa tuce psi, family(binomial)
/* boiler plate, add matrices if needed */
tempname cov
tempname params_table
matrix cov = e(V)
matrix params_table = r(table)'
estat ic
matrix infocrit = r(S)
estmat2nparray params_table cov infocrit, saving(`filename') format("%16.0g") resname("noconstraint") replace
/*------------------*/

glm grade  gpa tuce psi, family(binomial) vce(robust)
/* boiler plate, add matrices if needed */
tempname cov
tempname params_table
matrix cov = e(V)
matrix params_table = r(table)'
estat ic
matrix infocrit = r(S)
estmat2nparray params_table cov infocrit, saving(`filename') format("%16.0g") resname("noconstraint_robust") append
/*------------------*/


glm grade  gpa tuce psi, family(binomial) constraints(1)
/* boiler plate, add matrices if needed */
tempname cov
tempname params_table
matrix cov = e(V)
matrix params_table = r(table)'
matrix params_table = r(table)'
estat ic
matrix infocrit = r(S)
estmat2nparray params_table cov infocrit, saving(`filename') format("%16.0g") resname("constraint1") append
/*------------------*/

@josef-pkt
Copy link
Member

about MLE
you can currently call super().fit which has the nonlinear optimizers. Some of them only need loglike, other optimizers also need score first derivative of loglike or score and hessian.

On design problem will be that you have two different kinds of parameter vectors (different length if MLE estimates the joint model.

@josef-pkt
Copy link
Member

I don't find the comments about missing handling anymore in this PR

To the latest comment:
extra arrays can only be either 1d, or 2d-square (like covariance matrices)
I think it should be relatively easy to fix add a check for 2d arrays that are (nobs, k) in the corresponding code in base.data. We would need this in several models already, like GEE and IV2SLS, IIRC. (need new issue)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 2b64a40 on bert9bert:heckman into 957a43e on statsmodels:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 9762164 on bert9bert:heckman into 957a43e on statsmodels:master.

@josef-pkt
Copy link
Member

If you need numerical derivatives temporarily or because it's difficult to get an analytical expression, then you could copy score and hessian methods from base.model.GenericLikelihoodModel.

another tip: if loglike and similar are just the sum of the individual contribution, then it's better to define the xxx_obs version and do only the summation in the main method. loglike_obs loglike,
score versus score_obs.
(score_obs is needed in general for sandwich covariance calculations and lagrange multiplier/score_tests)

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) when pulling a6b552d on bert9bert:heckman into 957a43e on statsmodels:master.

@josef-pkt
Copy link
Member

our git workflow is to rebase branches on master not to merge master into a "feature branch"
The commit list shows now many commits from master.

I can fix this in an interactive rebase before merging, or, if you are comfortable enough with git, then you could checkout your branch at your second-to-last commit (before the merge of master) and git rebase on master instead of merging master, assuming you don't have any commits after the merge commit.

@josef-pkt
Copy link
Member

Looks good overall.

The main design problem that is left is how to handle the two different definitions of params in the Results class which depends on which estimation method was chosen.
We don't have yet a precedence case for that, but we will get more cases like this.
Ping me (add a comment) if there are questions about that. I don't know yet where we will run into problems with this kind of model.

from numpy.testing import assert_
import pdb

import heckman

This comment was marked as resolved.

@josef-pkt
Copy link
Member

There is also a print of summary in the test log on TravisCI that contains NaNs in the bse of the summary table. Is this intended?

@jseabold jseabold added this to the 0.7 milestone Sep 20, 2014
@bert9bert
Copy link
Author

The NaNs in the summary tables for the MLE estimates are not intentional. The LikelihoodModel class sometimes returns a covariance matrix with negative entries along the diagonal for the MLE estimate, and the NaNs appear because the summary table reports them as NaN because it square roots them to compute the standard error.

Should I leave it as it is since it's what the LikelihoodModel superclass returns, or would it be better for me to implement bootstrap standard errors?

@josef-pkt
Copy link
Member

bootstrap standard errors are a bonus.

But we need to check that we get a valid positive definite hessian in a regular case (if we don't have close to perfect multicollinearity and the parameters are identified).

Can you put the or an example with non-definite hessian in a script? I can look at it in a few days, I got pretty good at debugging optimization problems for most cases.

BTW: along the same lines as your MLE here, I saw recently the discussion on the Stata forum for doing MLE with a bivariate Probit. Once this PR is finished and Heckman is working, then it will be easier to add similar models following the same or similar pattern.

bnds.append((-1,1))
bnds.append((0,None))

heckman_res = heckman_model.fit(method='mle', method_mle='ncg', disp=False)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example of an MLE fit will produce a covariance estimate that is not positive-definite.

@josef-pkt
Copy link
Member

I'm trying to figure out what's going on in the mle estimation.
So far I don't see why the results are not converging, or hessian is zero.

A possible candidate is the loglikelihood defined in loglikeobs.
It looks different than the one in the Stata manual. And the values for llf look different.

What reference did you use for the loglikelihood?

@josef-pkt
Copy link
Member

two other things:

Can you rebase on current master and force push, so we get rid of the unrelated test failures.

To check what's going on, I attached the full mle results instance to the results, i.e. add one line at the end of _fit_mle:

results.results_mle = results_mle
return results

This will be useful in general for further analyzing the results.

@coveralls
Copy link

coveralls commented Feb 23, 2017

Coverage Status

Coverage increased (+0.02%) to 90.325% when pulling 1a498b9 on bert9bert:heckman into 898ddfc on statsmodels:master.

@coveralls
Copy link

coveralls commented Feb 26, 2017

Coverage Status

Coverage increased (+0.02%) to 90.325% when pulling ef251c0 on bert9bert:heckman into d206799 on statsmodels:master.

@coveralls
Copy link

coveralls commented Feb 27, 2017

Coverage Status

Coverage increased (+0.02%) to 90.302% when pulling 4731560 on bert9bert:heckman into e139f6d on statsmodels:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 90.307% when pulling 3154a33 on bert9bert:heckman into e139f6d on statsmodels:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 90.307% when pulling 3154a33 on bert9bert:heckman into e139f6d on statsmodels:master.

@coveralls
Copy link

coveralls commented Mar 27, 2017

Coverage Status

Coverage increased (+0.03%) to 90.316% when pulling 92ea622 on bert9bert:heckman into e139f6d on statsmodels:master.

@josef-pkt josef-pkt modified the milestones: 0.8, 0.14 Aug 10, 2021
@josef-pkt
Copy link
Member

bump for 0.14

After zero-inflated and Beta regression, we have more experience and support for multipart models.
IIRC, still needs decision for how api looks like when different estimators are used, MLE versus two-step.

@josef-pkt josef-pkt mentioned this pull request Aug 10, 2021
@josef-pkt josef-pkt modified the milestones: 0.14, 0.15 Nov 30, 2022
@lgtm-com
Copy link

lgtm-com bot commented Nov 30, 2022

This pull request introduces 6 alerts when merging 92ea622 into 2d41628 - view on LGTM.com

new alerts:

  • 5 for Unused local variable
  • 1 for Unreachable code

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants