Heckman #1935

bert9bert · 2014-08-26T00:05:49Z

Added a module to estimate the Heckman selection model using the Heckman 2-step. A test module checks the parameter estimates that this module creates in an example dataset against the parameter estimates that Stata creates. If this happens to merge well into the existing statsmodels code, I'd like to add an MLE estimation method as well. Stata currently has the option to estimate the Heckman model by either the 2-step or by MLE.

josef-pkt · 2014-08-26T00:10:42Z

statsmodels/regression/heckman.py

+        self.nobs_uncensored = self.nobs = np.sum(self.treated)
+        self.nobs_censored = self.nobs_total - self.nobs_uncensored
+
+        # store variable names if data came in as Pandas objects


this part is taken care of by super for endog and exog. variable names and some descriptive things about the data are in the model.data attribute.
check dir(model.data), I don't know those by heart

josef-pkt · 2014-08-26T01:22:15Z

I went quickly through this to get a rough idea.

Looks good so far, I don't think we will run into any serious problems with the inherited methods in this case.
I added a few comments, where I had some comments based on the rough reading.

This will be for 0.7, it's already too late to add new models to 0.6.
0.7 will come out hopefully just a few months after.

BTW:
We have a Stata ado file that dumps the results from Stata into a python module. I use it quite heavily for writing unit tests.
My latest version, for Stata 12, is in this PR
#1716
an example output is here
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/tests/results/results_grunfeld_ols_robust_cluster.py
with the use in the corresponding test file test_robustcov.py

not sure where I have a do file, here is part of what I used for GLM.fit_constrained
The "boiler plate" part is to be copy-pasted and adjusted for model or case specific arrays.
all of ereturn list is included



tempname filename
local filename = "results_glm_logit_constrained_.py"

insheet using "E:\Josef\eclipsegworkspace\statsmodels-git\statsmodels-all-new2_py27\statsmodels\statsmodels\datasets\spector\spector.csv", delimiter(" ")
constraint 1 gpa = 2.8
constraint 2 gpa = psi

glm grade  gpa tuce psi, family(binomial)
/* boiler plate, add matrices if needed */
tempname cov
tempname params_table
matrix cov = e(V)
matrix params_table = r(table)'
estat ic
matrix infocrit = r(S)
estmat2nparray params_table cov infocrit, saving(`filename') format("%16.0g") resname("noconstraint") replace
/*------------------*/

glm grade  gpa tuce psi, family(binomial) vce(robust)
/* boiler plate, add matrices if needed */
tempname cov
tempname params_table
matrix cov = e(V)
matrix params_table = r(table)'
estat ic
matrix infocrit = r(S)
estmat2nparray params_table cov infocrit, saving(`filename') format("%16.0g") resname("noconstraint_robust") append
/*------------------*/


glm grade  gpa tuce psi, family(binomial) constraints(1)
/* boiler plate, add matrices if needed */
tempname cov
tempname params_table
matrix cov = e(V)
matrix params_table = r(table)'
matrix params_table = r(table)'
estat ic
matrix infocrit = r(S)
estmat2nparray params_table cov infocrit, saving(`filename') format("%16.0g") resname("constraint1") append
/*------------------*/

josef-pkt · 2014-08-26T01:26:07Z

about MLE
you can currently call super().fit which has the nonlinear optimizers. Some of them only need loglike, other optimizers also need score first derivative of loglike or score and hessian.

On design problem will be that you have two different kinds of parameter vectors (different length if MLE estimates the joint model.

josef-pkt · 2014-08-26T12:01:43Z

I don't find the comments about missing handling anymore in this PR

To the latest comment:
extra arrays can only be either 1d, or 2d-square (like covariance matrices)
I think it should be relatively easy to fix add a check for 2d arrays that are (nobs, k) in the corresponding code in base.data. We would need this in several models already, like GEE and IV2SLS, IIRC. (need new issue)

coveralls · 2014-08-27T23:43:50Z

Coverage increased (+0.05%) when pulling 2b64a40 on bert9bert:heckman into 957a43e on statsmodels:master.

coveralls · 2014-08-28T02:40:45Z

Coverage increased (+0.01%) when pulling 9762164 on bert9bert:heckman into 957a43e on statsmodels:master.

josef-pkt · 2014-08-28T17:18:17Z

If you need numerical derivatives temporarily or because it's difficult to get an analytical expression, then you could copy score and hessian methods from base.model.GenericLikelihoodModel.

another tip: if loglike and similar are just the sum of the individual contribution, then it's better to define the xxx_obs version and do only the summation in the main method. loglike_obs loglike,
score versus score_obs.
(score_obs is needed in general for sandwich covariance calculations and lagrange multiplier/score_tests)

coveralls · 2014-08-29T01:31:37Z

Coverage decreased (-0.05%) when pulling a6b552d on bert9bert:heckman into 957a43e on statsmodels:master.

josef-pkt · 2014-08-29T12:56:51Z

our git workflow is to rebase branches on master not to merge master into a "feature branch"
The commit list shows now many commits from master.

I can fix this in an interactive rebase before merging, or, if you are comfortable enough with git, then you could checkout your branch at your second-to-last commit (before the merge of master) and git rebase on master instead of merging master, assuming you don't have any commits after the merge commit.

josef-pkt · 2014-08-29T13:15:33Z

Looks good overall.

The main design problem that is left is how to handle the two different definitions of params in the Results class which depends on which estimation method was chosen.
We don't have yet a precedence case for that, but we will get more cases like this.
Ping me (add a comment) if there are questions about that. I don't know yet where we will run into problems with this kind of model.

statsmodels/regression/tests/test_heckman.py

+from numpy.testing import assert_
+import pdb
+
+import heckman


josef-pkt · 2014-09-20T20:03:04Z

There is also a print of summary in the test log on TravisCI that contains NaNs in the bse of the summary table. Is this intended?

bert9bert · 2014-09-22T15:16:51Z

The NaNs in the summary tables for the MLE estimates are not intentional. The LikelihoodModel class sometimes returns a covariance matrix with negative entries along the diagonal for the MLE estimate, and the NaNs appear because the summary table reports them as NaN because it square roots them to compute the standard error.

Should I leave it as it is since it's what the LikelihoodModel superclass returns, or would it be better for me to implement bootstrap standard errors?

josef-pkt · 2014-09-22T16:04:33Z

bootstrap standard errors are a bonus.

But we need to check that we get a valid positive definite hessian in a regular case (if we don't have close to perfect multicollinearity and the parameters are identified).

Can you put the or an example with non-definite hessian in a script? I can look at it in a few days, I got pretty good at debugging optimization problems for most cases.

BTW: along the same lines as your MLE here, I saw recently the discussion on the Stata forum for doing MLE with a bivariate Probit. Once this PR is finished and Heckman is working, then it will be easier to add similar models following the same or similar pattern.

bert9bert · 2014-09-23T15:27:05Z

statsmodels/regression/tests/test_heckman.py

+    bnds.append((-1,1))
+    bnds.append((0,None))
+
+    heckman_res = heckman_model.fit(method='mle', method_mle='ncg', disp=False)


This example of an MLE fit will produce a covariance estimate that is not positive-definite.

josef-pkt · 2014-10-09T00:43:15Z

I'm trying to figure out what's going on in the mle estimation.
So far I don't see why the results are not converging, or hessian is zero.

A possible candidate is the loglikelihood defined in loglikeobs.
It looks different than the one in the Stata manual. And the values for llf look different.

What reference did you use for the loglikelihood?

josef-pkt · 2014-10-09T00:49:02Z

two other things:

Can you rebase on current master and force push, so we get rid of the unrelated test failures.

To check what's going on, I attached the full mle results instance to the results, i.e. add one line at the end of _fit_mle:

results.results_mle = results_mle
return results

This will be useful in general for further analyzing the results.

coveralls · 2017-02-23T14:12:25Z

Coverage increased (+0.02%) to 90.325% when pulling 1a498b9 on bert9bert:heckman into 898ddfc on statsmodels:master.

coveralls · 2017-02-26T00:55:10Z

Coverage increased (+0.02%) to 90.325% when pulling ef251c0 on bert9bert:heckman into d206799 on statsmodels:master.

…summary

coveralls · 2017-02-27T03:43:56Z

Coverage increased (+0.02%) to 90.302% when pulling 4731560 on bert9bert:heckman into e139f6d on statsmodels:master.

… code

coveralls · 2017-02-27T07:43:48Z

Coverage increased (+0.03%) to 90.307% when pulling 3154a33 on bert9bert:heckman into e139f6d on statsmodels:master.

coveralls · 2017-02-27T07:43:48Z

Coverage increased (+0.03%) to 90.307% when pulling 3154a33 on bert9bert:heckman into e139f6d on statsmodels:master.

coveralls · 2017-03-27T17:15:31Z

Coverage increased (+0.03%) to 90.316% when pulling 92ea622 on bert9bert:heckman into e139f6d on statsmodels:master.

josef-pkt · 2021-08-10T19:31:36Z

bump for 0.14

After zero-inflated and Beta regression, we have more experience and support for multipart models.
IIRC, still needs decision for how api looks like when different estimators are used, MLE versus two-step.

lgtm-com · 2022-11-30T20:20:18Z

This pull request introduces 6 alerts when merging 92ea622 into 2d41628 - view on LGTM.com

new alerts:

5 for Unused local variable
1 for Unreachable code

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed 🚀. For more information, please check out our post on the GitHub blog.

josef-pkt reviewed Aug 26, 2014
View reviewed changes

josef-pkt mentioned this pull request Aug 26, 2014

FAQ-D writing unit tests, getting reference results #1936

Open

bert9bert force-pushed the heckman branch from 2f73dad to a6b552d Compare August 30, 2014 20:37

josef-pkt mentioned this pull request Sep 1, 2014

ENH: full MLE for composite models ? ZeroInflated, Heckman #1950

Open

josef-pkt reviewed Sep 18, 2014
View reviewed changes

statsmodels/regression/tests/test_heckman.py Outdated

from numpy.testing import assert_

import pdb

import heckman

This comment was marked as resolved.

Sign in to view

jseabold added this to the 0.7 milestone Sep 20, 2014

bert9bert reviewed Sep 23, 2014
View reviewed changes

bert9bert force-pushed the heckman branch from 33df173 to c0020de Compare October 11, 2014 16:05

bert9bert force-pushed the heckman branch from 501333d to 1a498b9 Compare February 23, 2017 13:36

bert9bert added 6 commits February 25, 2017 02:30

STY: removed unneeded duplicate calculation

c524967

ENH: made some calculations with sparse matrices more memory efficient

4c8e08e

STY: vectorized an assertion

4204360

STY: changed parameterization of likelihood function

90fcdb2

DOC: updated section doc header in Heckman class

cdba6aa

STY: embedded test data instead of downloading from external URL

ef251c0

bert9bert force-pushed the heckman branch from 1a498b9 to ef251c0 Compare February 26, 2017 00:14

bert9bert added 3 commits February 26, 2017 20:23

STY: renamed variable param_inverse_mills to params_inverse_mills

75b1fa0

STY: have data prep function in test explicitly return copies

31a774f

STY: removed option to print summary to console when calling heckman …

4731560

…summary

bert9bert force-pushed the heckman branch from f11f80c to 4731560 Compare February 27, 2017 03:05

bert9bert added 3 commits February 26, 2017 22:47

STY: better organized the formulation of the mle to be clearer in the…

262aaa8

… code

STY: cleaned up package dependencies and comments

c476001

STY: cleaned up assert statements in test

3154a33

BUG: addressed corner situation where likelihood is very close to zero

92ea622

josef-pkt modified the milestones: 0.8, 0.14 Aug 10, 2021

josef-pkt added the prio-elev label Aug 10, 2021

josef-pkt mentioned this pull request Aug 10, 2021

Heckman rebased #7636

Open

josef-pkt modified the milestones: 0.14, 0.15 Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heckman #1935

Heckman #1935

bert9bert commented Aug 26, 2014

josef-pkt Aug 26, 2014

josef-pkt commented Aug 26, 2014

josef-pkt commented Aug 26, 2014

josef-pkt commented Aug 26, 2014

coveralls commented Aug 27, 2014

coveralls commented Aug 28, 2014

josef-pkt commented Aug 28, 2014

coveralls commented Aug 29, 2014

josef-pkt commented Aug 29, 2014

josef-pkt commented Aug 29, 2014

This comment was marked as resolved.

josef-pkt commented Sep 20, 2014

bert9bert commented Sep 22, 2014

josef-pkt commented Sep 22, 2014

bert9bert Sep 23, 2014

josef-pkt commented Oct 9, 2014

josef-pkt commented Oct 9, 2014

coveralls commented Feb 23, 2017 •

edited

Loading

coveralls commented Feb 26, 2017 •

edited

Loading

coveralls commented Feb 27, 2017 •

edited

Loading

coveralls commented Feb 27, 2017

coveralls commented Feb 27, 2017

coveralls commented Mar 27, 2017 •

edited

Loading

josef-pkt commented Aug 10, 2021

lgtm-com bot commented Nov 30, 2022

Heckman #1935

Are you sure you want to change the base?

Heckman #1935

Conversation

bert9bert commented Aug 26, 2014

josef-pkt Aug 26, 2014

Choose a reason for hiding this comment

josef-pkt commented Aug 26, 2014

josef-pkt commented Aug 26, 2014

josef-pkt commented Aug 26, 2014

coveralls commented Aug 27, 2014

coveralls commented Aug 28, 2014

josef-pkt commented Aug 28, 2014

coveralls commented Aug 29, 2014

josef-pkt commented Aug 29, 2014

josef-pkt commented Aug 29, 2014

This comment was marked as resolved.

josef-pkt commented Sep 20, 2014

bert9bert commented Sep 22, 2014

josef-pkt commented Sep 22, 2014

bert9bert Sep 23, 2014

Choose a reason for hiding this comment

josef-pkt commented Oct 9, 2014

josef-pkt commented Oct 9, 2014

coveralls commented Feb 23, 2017 • edited Loading

coveralls commented Feb 26, 2017 • edited Loading

coveralls commented Feb 27, 2017 • edited Loading

coveralls commented Feb 27, 2017

coveralls commented Feb 27, 2017

coveralls commented Mar 27, 2017 • edited Loading

josef-pkt commented Aug 10, 2021

lgtm-com bot commented Nov 30, 2022

coveralls commented Feb 23, 2017 •

edited

Loading

coveralls commented Feb 26, 2017 •

edited

Loading

coveralls commented Feb 27, 2017 •

edited

Loading

coveralls commented Mar 27, 2017 •

edited

Loading