statistical analysis – Page 2 – Economic Thinking In Action

Younger workers today have slightly less attachment to the workforce than younger workers in the past

Big Data. Bureau of Labor Statistics. Survey data. Employment Big Data. Those are all things that calculating worklife expectancy for U.S. workers requires. Worklife expectancy is similar to life expectancy and indicates how long a person can be expected to be active in the workforce over their working life. The worklife expectancy figure takes into account the anticipated to time out of the market due to unemployment, voluntary leaves, attrition, etc.

The goal of our recent work is to update the Millimet et al (2002) worklife expectancy paper and account for more recent CPS data. Their paper uses data from the 1992 to 2000 time period. Our goal is to update that paper using data from 2000 to 2013 and see if estimating the Millimet et al (2002) econometric worklife models with more recent data changes the results in the 2002 paper in any substantive way.

Finding: Overall, the worklife expectancy estimated using more recent data from 2000-2013 is shorter then in the earlier time period (1992-2000) data set. This is true for younger worker (18-early 40’s); younger workers from the more recent cohorts have a shorter expected work life then younger workers in the earlier cohorts. Conversely, while older workers in their 40s and 50s have a slightly longer worklife expectancy in the later time period data set. We are in the process of determining the statistical significance of these differences.

Table 4. Comparsion of Worklife Expectancy for 1992-2000 and 2001-2013 Time Periods
	1992-2000		2001-2013
Age	Less than High School	High School	Less than High School	High School
18	31.469	38.410	30.569	37.314
19	30.926	37.846	30.128	36.833
20	30.306	37.180	29.603	36.237
21	29.670	36.493	29.021	35.590
22	29.027	35.787	28.419	34.917
23	28.365	35.054	27.809	34.231
24	27.685	34.293	27.205	33.539
25	27.007	33.518	26.588	32.830
26	26.319	32.728	25.964	32.108
27	25.643	31.939	25.357	31.387
28	24.958	31.123	24.736	30.646
29	24.271	30.304	24.110	29.892
30	23.590	29.481	23.491	29.136
31	22.892	28.640	22.866	28.371
32	22.191	27.796	22.237	27.599
33	21.487	26.944	21.606	26.819
34	20.783	26.097	20.970	26.034
35	20.095	25.254	20.327	25.239
36	19.400	24.408	19.685	24.446
37	18.707	23.560	19.039	23.648
38	18.018	22.714	18.392	22.850
39	17.324	21.864	17.737	22.044
40	16.627	21.014	17.085	21.242
41	15.944	20.169	16.421	20.432
42	15.264	19.328	15.764	19.627
43	14.595	18.494	15.110	18.825
44	13.931	17.664	14.456	18.024
45	13.272	16.840	13.798	17.220
46	12.616	16.018	13.154	16.429
47	11.972	15.204	12.520	15.641
48	11.328	14.398	11.886	14.859
49	10.682	13.593	11.259	14.081
50	10.053	12.803	10.642	13.311
51	9.432	12.020	10.030	12.550
52	8.802	11.239	9.429	11.798
53	8.199	10.477	8.843	11.057
54	7.593	9.723	8.270	10.333
55	6.996	8.980	7.709	9.618
56	6.422	8.263	7.152	8.912
57	5.872	7.564	6.618	8.230
58	5.339	6.883	6.095	7.560
59	4.812	6.216	5.587	6.908
60	4.307	5.578	5.097	6.280
61	3.840	4.979	4.624	5.677
62	3.400	4.415	4.181	5.112
63	3.024	3.918	3.782	4.593
64	2.708	3.485	3.428	4.128
65	2.422	3.093	3.109	3.700
66	2.180	2.756	2.819	3.312
67	1.970	2.461	2.556	2.960
68	1.787	2.200	2.323	2.646
69	1.624	1.967	2.102	2.359
70	1.471	1.756	1.905	2.101
71	1.348	1.584	1.728	1.869
72	1.238	1.430	1.577	1.670
73	1.134	1.289	1.427	1.484
74	1.042	1.167	1.296	1.322
75	0.965	1.065	1.184	1.181
76	0.904	0.983	1.077	1.054
77	0.834	0.899	0.980	0.942
78	0.784	0.836	0.894	0.843
79	0.735	0.778	0.807	0.750
80	0.694	0.735	0.675	0.636

Notes:

The logistic equation includes independent variable for age, age squared, race, race by age interaction, race by age interaction squared, marital status, martial status by age, occupation dummies, year and year dummies.

The model is first estimated separately for each gender and education level combination for active persons. The model is then estimated again for inactive persons. The educational attainment variables used to estimate our model differ from that of Millimet et al. (2002) In our model, only individuals whose highest level of attainment is high school are included in the high school category. Millimet et al (2002) includes individuals with some college in the high school category.

Replication of the Millimet et al. (2002) work was sufficient and yielded similar results

Overall the goal of our recent work is to update the Millimet et al (2002) worklife expectancy paper and account for more recent CPS data. Their paper uses data from the 1992 to 2000 time period. Our goal is to update that paper using data from 2000 to 2013. The main goal of the paper is to see if estimating the Millimet et al (2002) econometric worklife models with more recent data changes the results in the 2002 paper in any substantive way

As for the results, overall there are several findings. First we were able to create a match CPS data set of 201,797 individuals where as the Millimet et al. (2002) found 200,916 matched individuals.

Overall we match their results very closely as well. For example Millimet et al. (2002) found that a male who was 26 years old with a less than a high school education had a 27.27 years WLE remaining while we found that person had 26.319 years remaining based on our replication of their work. They found that the same age person with a high school had 32.89 years remaining while we found 32.728 years remaining. The replication was particularly good for both less than high school and high school levels of educational attainment.

The WLE numbers are close but not quite as close for college and some college. This is primarily due to the fact that we use different definitions of some college and college then Millimet et al. (2002) did in their 2002 paper

Table 3. Comparsion of Millimet et al. (2002) and Steward and Gaylor (2015) Active to Active Worklife Expectancy Probabilities
	Millimet et al (2002)		Steward and Gaylor (2015) Replication
Age	Less than High School	High School	Less than High School	High School
18	32.331	38.944	31.469	38.410
19	31.801	38.239	30.926	37.846
20	31.247	37.522	30.306	37.180
21	30.684	36.794	29.670	36.493
22	30.080	36.058	29.027	35.787
23	29.450	35.294	28.365	35.054
24	28.766	34.513	27.685	34.293
25	28.035	33.711	27.007	33.518
26	27.270	32.890	26.319	32.728
27	26.495	32.052	25.643	31.939
28	25.710	31.201	24.958	31.123
29	24.923	30.341	24.271	30.304
30	24.131	29.477	23.590	29.481
31	23.345	28.606	22.892	28.640
32	22.556	27.735	22.191	27.796
33	21.775	26.862	21.487	26.944
34	21.006	25.989	20.783	26.097
35	20.233	25.112	20.095	25.254
36	19.452	24.240	19.400	24.408
37	18.681	23.370	18.707	23.560
38	17.921	22.504	18.018	22.714
39	17.178	21.641	17.324	21.864
40	16.459	20.782	16.627	21.014
41	15.734	19.928	15.944	20.169
42	15.031	19.081	15.264	19.328
43	14.333	18.242	14.595	18.494
44	13.669	17.410	13.931	17.664
45	13.020	16.588	13.272	16.840
46	12.381	15.775	12.616	16.018
47	11.758	14.974	11.972	15.204
48	11.144	14.185	11.328	14.398
49	10.538	13.409	10.682	13.593
50	9.952	12.646	10.053	12.803
51	9.379	11.898	9.432	12.020
52	8.836	11.167	8.802	11.239
53	8.299	10.459	8.199	10.477
54	7.775	9.772	7.593	9.723
55	7.265	9.107	6.996	8.980
56	6.767	8.456	6.422	8.263
57	6.261	7.829	5.872	7.564
58	5.800	7.236	5.339	6.883
59	5.397	6.678	4.812	6.216
60	5.016	6.153	4.307	5.578
61	4.678	5.672	3.840	4.979
62	4.350	5.225	3.400	4.415
63	4.060	4.815	3.024	3.918
64	3.797	4.420	2.708	3.485
65	3.574	4.061	2.422	3.093
66	3.395	3.741	2.180	2.756
67	3.224	3.445	1.970	2.461
68	3.047	3.162	1.787	2.200
69	2.873	2.886	1.624	1.967
70	2.691	2.621	1.471	1.756
71	2.528	2.401	1.348	1.584
72	2.362	2.196	1.238	1.430
73	2.170	1.999	1.134	1.289
74	2.002	1.829	1.042	1.167
75	1.898	1.672	0.965	1.065
76	1.743	1.533	0.904	0.983
77	1.592	1.449	0.834	0.899
78	1.514	1.339	0.784	0.836
79	1.461	1.274	0.735	0.778
80	1.374	1.172	0.694	0.735
81	1.273	1.046	0.661	0.687
82	1.222	0.993	0.631	0.656
83	1.121	0.912	0.604	0.623
84	0.874	0.755	0.569	0.585
85	0.433	0.355	0.522	0.532

Notes:

The econometric model described by Millimet et al (2002) and logistic regression equations by gender and education are used to calculate the worklife expectancy estimates. The model is estimated using matched CPS cohorts from 1992–2000 time period as described in the Millimet et al. (2002) paper. The logistic equation includes independent variable for age, age squared, race, race by age interaction, race by age interaction squared, marital status, martial status by age, occupation dummies, year and year dummies. The model is first estimated separately for each gender and education level combination for active persons. The model is then estimated again for inactive persons.

Steward and Gaylor (2015) Matched CPS Sample Sizes for 1993-2013 time period

Overall the goal of our recent work is to update the Millimet et al (2002) worklife expectancy paper and account for more recent CPS data.

The data for all years is shown below. Ultimately there were over 590,000 data points used in the analysis.

Table 2. Matched CPS Sample Sizes 1993-2013
	Female				Male
Year	Less than High School	High School	Some College	College	Less than High School	High School	Some College	College	Total

1993	3,766	7,326	4,898	3,452	3,376	5,619	4,280	3,935	36,652
1994	3,539	7,019	5,357	3,619	3,097	5,477	4,411	4,013	36,532
1995	3,082	6,161	5,086	3,545	2,664	4,815	4,086	3,938	33,377
1997	3,079	6,172	4,771	3,488	2,723	4,857	3,926	3,723	32,739
1998	2,839	6,113	4,873	3,672	2,694	4,952	3,995	3,834	32,972
1999	2,709	6,027	4,987	3,770	2,513	4,830	4,134	3,923	32,893
2000	2,692	5,930	5,009	3,915	2,463	4,899	4,052	4,204	33,164
2001	2,545	5,806	4,971	3,901	2,458	4,919	4,232	4,016	32,848
2003	1,096	3,218	2,579	2,411	1,019	2,701	2,122	2,470	17,616
2004	2,579	6,372	5,803	5,009	2,394	5,307	4,745	4,819	37,028
2005	2,039	5,378	5,146	4,673	1,867	4,632	4,270	4,285	32,290
2006	2,297	5,500	5,608	4,657	2,131	4,953	4,263	4,389	33,798
2007	2,147	5,730	5,466	5,060	2,076	5,133	4,344	4,592	34,548
2008	2,159	5,659	5,787	5,281	2,040	5,212	4,593	4,826	35,557
2009	2,027	5,637	5,780	5,556	2,023	5,062	4,776	4,976	35,837
2011	1,845	4,844	5,106	5,136	1,786	4,603	4,176	4,432	31,928
2012	1,733	4,849	4,930	4,956	1,779	4,693	4,151	4,616	31,707
2013	1,658	4,542	5,061	5,109	1,668	4,579	4,271	4,650	31,538

Total	43,831	102,283	91,218	77,210	40,771	87,243	74,827	75,641	593,024

Notes:

The CPS data was matched using the algorithm similar to Millimet et al (2002) and Peracchi and Welch (1995). Households in rotation 1-4 were matched using the household identifier number to the same household in rotations 5-8 of the following year. Individuals had to have the same sex, race and be a year older in rotation 5-8 to be determined a match.

Comparsion of CPS matched data sets – Millmet et al (2002) to Steward and Gaylor (2015)

Our approach is two fold. First we matched the BLS data cohorts based on the Millimet et al. (2002) and Peracchi and Welch (1995) papers. In a nutshell the CPS matching routine involves matching incoming and outgoing cohorts across a given year. Once the data is matched, we then look at the work status of the individuals to determine if they were active or in active across the year that they were interviewed by the BLS. . We were able to create a match CPS data set of 201,797 individuals where as the Millimet et al. (2002) found 200,916 matched individuals.

Table 1. Comparsion of CPS cohort matched data sets
Year	Millimet et al. (2002)	Steward and Gaylor (2015)
1992/93	37,709	36,652
1994/95	34,418	33,377
1996/97	31,691	32,739
1997/98	32,276	32,972
1998/99	32,083	32,893
1999/2000	32,739	33,164
Total	200,916	201,797

Notes:

STATA or R for data analysis in wage and hour cases?

In the stats world there is somewhat of a debate going on regarding which statistical analyses programs are “better”. Of course, the answer always depends on what you use it for. Some like the open-source, developing nature of R. While others like the established and tried STATA.

In the world of labor and employment economics and in ligation matters that require data analysis of large sets of data, STATA wins hands down. However, the open source nature of R is appealing in some settings; but the many decades of pre-written (and de bugged) programs make STATA the best choice in most employment and wage and hour cases that require analysis of large data sets. Performing basic tabulations and data manipulations in R requires many lines of code while STATA often has the command built in.

Here are some interesting snippets from the web on the R v STATA debate:

http://www.researchgate.net/post/What_is_the_difference_between_SPSS_R_and_STATA_software

The main drawback of R is the learning curve: you need a few weeks just to be able to import data and create a simple plot, and you will not cease learning basic operations (e.g. for plotting) for many years. You will stumble upon weirdest problems all the time because you have missed the comma or because your data frame collapses to a vector if only one row is selected.

However, once you mastered this, you will have the full arsenal of modern cutting-edge statistical techniques at your disposal, along with in-depth manuals, references, specialized packages, graphical interface, a helpful community — and all at no cost. Also, you will be able to do stunning graphics.

http://forum.thegradcafe.com/topic/44595-stata-or-r-for-statistics-software/

http://www.econjobrumors.com/topic/r-vs-stata-is-like-a-mercedes-vs-a-bus

Selecting a weighted random sample in wage and hour analyses

In some wage and hour analyses a statistical random sample is needed to help address liability and damage issues. A sample may be required in employer’s self audit, regulatory investigation, or lawsuit involving FLSA, overtime, and wage and hour issues, such as unpaid meal periods..

In some instances, a weighted sampling routine may be appropriate. For instance, in this example.we are going to select a random sample of 100 employees for an employer’s self audit of its wage and hour practices. Time and payroll data for the sample of employees will be assembled by the employer for the selected individuals.

The sample contains four different types of employees that work at the company. The goal is to have the employee sample be representative of the overall universe of employees at the company.

Roughly half of the employees in the sample are type I employees, 25% are type II, and 20% are type III employees. 5% are type IV employees. The employer maintains the data for each type of employee in separate modules of its database and must access each type of employee separately

In this instance, some type of weighted sampling routine would be appropriate. .For instance, the sample could be selected by first randomizing the employees of each type. Then a weighted sample based on the proportion of each type of employee at the company can be selected. For instance, 50 random employees of type I, 25 random employees of type II, 20 random employees of type III, and 5 random employees of type IV.

Examining the distribution of industry payments made to medical doctors

The Center for Medical and Medicaid Services (CMS) new Open Payments database shows the consulting fees, research grants, travel and other reimbursements made to medical industry in 2013

There are 2,619,700 payments in the CMS data made to 356,190 physicians. The average payment made to physicians was $255.22. The median payment was $15.52

Table 1: Summary of Payments – STATA output

The physicians received an average total of $1,877.11 in payments. The median total payment for the 356,190 physicians in the data was $94.15

Table 2: Summary of Payments – STATA output

Below is the STATA code for the results:
count gen
bysort phy: gen hj =_n
bysort physician_p: gen hj =_n
sum hj, det
count if hj==1
sum tot, det
bysort physician_p: egen hj2 = tot(total)
bysort physician_p: egen hj2 = total(total_am)
sum hj2
sum hj2, det
sort physcian_p
sort physician_p
sum hj2 if _n==1, det
sum hj, det
sum hj2 if hj==1, det

Statistical random sampling messing time records in wage and hour litigation

Many employer time records are still in paper format that is not easily machine readable. Analyzing these data in a wage and hour case typically requires manual entry of the records. However, entering in all the time records is not always feasible. In these situations a statistical random sampling of records can be useful and informative. A solid statistical sampling alows the researcher to calculate error rates which are useful when making inferences concerning the time records.