This homework builds on what we studied in class. We are going to simulate from the very simple model of labor supply we considered.
The agent problem is
\[ \max_{c,h,e} c - \beta \frac{h^{1+\gamma}}{1+\gamma}\\ \text{s.t. } c = e \cdot \rho \cdot w\cdot h +(1-e)\cdot r \cdot h \] The individual takes his wage \(w\) as given, he chooses hours of work \(h\) and consumption \(c\). He also chooses whether to work (\(e=0\)) in the labor force (\(e=1\))or to work at home where he has an equivalent wage \(r\).
We are going to simulate a data set where agents will choose participation as well as the number of hours if they decide to work. This requires for us to specify how each of the individual specific variables are drawn. We then set the following:
\[ \begin{align*} \log W_i &= \eta X_i + Z_i + u_i \\ \log R_i &= \delta_0 + \log(W_i) + \delta Z_i + \xi_i \\ \log \beta_i &= \nu X_i +\epsilon_i + a \xi_i \\ \end{align*} \]
and finally \((X_i,Z_i,\epsilon_i,u_i,\xi_i)\) are independent normal draws. Given all of this we can simulate our data. The \(\xi\) enter both equations on purpose.
Question 1 What does the \(a\) parameter capture here?
set.seed(31130)
p = list(gamma = 0.8,beta=1,a=1,rho=1,eta=0.2,delta=-0.2,delta0=-0.1,nu=0.5) # parameters
N=10000 # size of the simulation
simdata = data.table(i=1:N,X=rnorm(N))
# simulating variables
simdata[,X := rnorm(N)]
simdata[,Z := rnorm(N)]
simdata[,u := rnorm(N)]
simdata[,lw := p$eta*X + Z + 0.2*u ] # log wage
simdata[,xi := rnorm(N)*0.2]
simdata[,lr := lw + p$delta0+ p$delta*Z + xi]; # log home productivity
simdata[,eps:=rnorm(N)*0.2]
simdata[,beta := exp(p$nu*X + p$a*xi + eps)]; # heterogenous beta coefficient
# compute decision variables
simdata[, lfp := log(p$rho) + lw >= lr] # labor force participation
simdata[, h := (p$rho * exp(lw)/beta)^(1/p$gamma)] # hours
simdata[lfp==FALSE,h:=NA][lfp==FALSE,lw:=NA]
simdata[,mean(lfp)]
We have now our simulated data.
Question 2 Simulate data with \(a=0\) and \(a=1\). Comment on the value of the coefficient of the regression of log hours on log wage and X.
As we have seen in class, Heckman (74) offers a way for us to correct the our regression in order to recover our structural parameters. We need to understand how the error term in the hour regression correlates with the labor participation decision.
Question 3 Following what we did in class, and using the class note, derive the expression for the Heckman correction term as a function of known parameters. In other words, derive \(E( a \xi_i + \epsilon_i | X_i, \log W_i, lfp=1)\).
Construction of this epxression requires us to recover the parameters \(\delta/\sigma_{\xi},\delta_0/\sigma_{\xi}\). We can get these by running a probit of participation on \(Z_i\).
fit2 = glm(lfp ~ Z,simdata,family = binomial(link = "probit"))
Question 4 Check that the regression does recover correctly the coefficients. Use them to construct the inverse Mills ratio. Use the correction you created and show that the regression with this extra term delivers the correct estimates for \(\gamma\) even in the case where \(a\neq 0\).
Lastly we want to replicate the approach of Blundell, Duncan and Meghir (1998) paper . The paper relies on the assumption that the endogeneity term can be expressed as a the sum of a term which is group specific and a term which is time specific. To achieve this structure we simply change the distribution of the \(\xi\) and make it an exponential, to be precise, assume that \(-\xi\) is distributed according to an exponential.
Question 5 Given that for the exponential distribution we have that \(E[ X | X > a] = 1+ a\), express the new conditional mean of \(\xi\) in the regression of log-hours on log-wages and X. Show that it can be writen at a linear term in \(\delta_0\) and \(Z_i\).
In estimation however we do not want to stricly rely on the linearity in \(Z\) and hence we are going to follow the procedure in the paper and bin the \(Z\) variable and rely on variation in time. The following code will simulate two cross-sections of data (not a panel). The difference from period 1 to 2 will be reflected in a change in the return to \(X\) and a change in the intercept of the labor force participation intercept.
simulate <- function(p,N,t) {
simdata = data.table(i=1:N,X=rnorm(N))
# simulating variables
simdata[,X := rnorm(N)]
simdata[,Z := rnorm(N)]
simdata[,u := rnorm(N)]
simdata[,lw := p$eta*X + Z + 0.2*u ] # log wage
simdata[,xi := -rexp(N)]
simdata[,lr := lw + p$delta0 + p$delta*Z + xi]; # log home productivity
simdata[,eps:=rnorm(N)*0.2]
simdata[,beta := exp(p$nu*X + p$a*xi + eps)]; # heterogenous beta coefficient
# compute decision variables
simdata[, lfp := log(p$rho) + lw >= lr] # labor force participation
simdata[, h := (p$rho * exp(lw)/beta)^(1/p$gamma)] # hours
# make hours and wages unobserved in case individual doesn't work
simdata[lfp==FALSE,h:=NA][lfp==FALSE,lw:=NA]
simdata[,mean(lfp)]
# store time
simdata[, t := t]
return(simdata)
}
p = list(gamma = 0.8,beta=1,a=1,rho=1,eta=0.2,delta=-1.0,delta0=-0.1,nu=0.5) # parameters
N=50000 # size of the simulation
# simulate period 1
sim1 = simulate(p,N,1);
p$eta = 0.4; p$delta0 = -1.1
# simulate period 2 with different eta and delta0 (incuding intercept shifts and variation in wages)
sim2 = simulate(p,N,2);
simdata = rbind(sim1,sim2) # combine the two periods
Question 6 Run the regression of log-hours on log-wages and X how describe the bias in the recovered parameters. Similarly compute the inverse Mills Ratio, include in the regression and comment on the value of the estimate parameters. Is it still biased?
Question 7 Finally slice the \(Z\) variables into deciles (or more splits) and regress log-hours on log-wage, X, dummies for each of the bins, and time \(t\). This mimics the procedure of Blundell Duncan Meghir, does this recovers a coefficient closer to the estimand of interest?