Baseline Sample and Data Sources
Introduction
To mount our study, we assemble and work within a large baseline sample of high school graduates in Texas.
This baseline sample of high school graduates provides extensive common information on all students in our treatment cohorts (i.e., high school graduates in our baseline sample who eventually enter public colleges and universities in Texas) and on all individuals in our comparison groups (i.e., high school graduates in our baseline sample who we select for our comparison groups).
Below, we describe the limitations of, and the restrictions we impose on, this baseline sample of high school graduates, and we also describe how we enrich this sample extensively with person-level high school data, person-level postsecondary data, and person-level wage data.
Sample Restrictions and Limitations
Our baseline sample of high school graduates — which contains all of our treatment cohorts and comparison groups — has certain unavoidable limitations because of deficits in the relevant data files.
Moreover, we make judicious decisions to exclude certain high school graduates from our baseline sample. When we make decisions to exclude high school graduates from our baseline sample (and therefore from our treatment cohorts and comparison groups), we balance two competing, legitimate interests.
One interest is in producing estimates that are as accurate as possible for the students we study and in maximizing the study’s internal validity. This interest leads us to exclude students for whom we have sub-optimal data.
The other interest is in producing estimates for as many students as possible in order to maximize the study’s generalizability and external validity. This interest leads us to include students on whom we have sub-optimal information.
Below we describe the limitations of, and the restrictions we impose on, our baseline sample of high school graduates.
Rule 1: Exclude High School Graduates from Private High Schools, from Out-of-State High Schools, or if Homeschooled
Our baseline sample of high school graduates includes only graduates from public high schools in Texas.
It does not include high school graduates from private high schools in Texas, graduates from out-of-state high schools, or graduates who were homeschooled.
Because of this limitation, we study only students who entered colleges and universities in Texas after graduating from public high schools in Texas.
Rule 2: Exclude High School Graduates Who Enrolled in Postsecondary Education Out-of-State
We exclude from our baseline sample of high school graduates any high school graduate who subsequently enrolled in postsecondary education out-of-state.
We remove these high school graduates from our baseline sample (and therefore from membership in any subsequent treatment cohort or comparison group) because they have a heightened likelihood of unobservable out-of-state earnings.
Including individuals in our treatment cohorts or in our comparison group with material unobserved earnings would create error in our findings.
Rule 3: Exclude High School Graduates Who Graduated High School Prior to 2007-08
We exclude from our baseline sample of high school graduates anyone who graduated from high school prior to 2007-08.
We make this decision because Texas’ data for high school graduates prior to 2007-08 contains no information (via the National Student Clearinghouse) on whether or not they enrolled after high school in postsecondary education out-of-state, a vital data point for us (see Rule 2 above).
One consequence of our making this sample restriction choice is that we only study treatment cohorts that arose in Texas starting in 2008-09, the year after our baseline sample commences.
Another consequence of this sample restriction is that, in any treatment cohort that we study (regardless of when it originates), we study only students who graduated from high school in 2007-08 or later (and who, as a result, were born in ~1990 and later).
The table below summarizes how our choice to exclude from our baseline sample high school graduates who completed high school prior to 2007-08 limits the entry years that we can study and limits the age range of students in our cohorts.
Exhibit SD1: Entry Years and Approximate Age Range of Treatment Cohorts in Study at Beginning of Follow-up Period

Rule 4: Exclude High School Graduates with No Listed Social Security Number
We exclude from our baseline sample of high school graduates any high school graduate who does not have a valid social security number listed in the Texas data.
We make this choice because social security numbers are the mechanism by which we observe earnings of individuals in Texas’ UI wage records.
Merging High School Data (TEA)
For each high school graduate in our baseline sample, we compile data from Texas Education Agency (TEA).
Data Captured from Texas Education Agency (TEA)
- Date of birth
- Race / ethnicity
- Gender
- High school attended
- High school attendance record
- High school disciplinary record
- Share of high school courses failed
- Number of core AP (Advanced Placement) courses passed
- High school IEP (individualized education plan) status
- High school LEP (limited English proficiency) status
- High school “economically disadvantaged” status (based on eligibility for US Department of Agriculture’s “free or reduced-priced lunch” program, US Department of Health and Human Services’ “Temporary Assistance for Needy Families” program, and other subsidies)
- English and math standardized test scores
For test scores, we capture data from the STAAR test of algebra 1 (usually taken in 9th grade), the STAAR test of English II (usually taken in 10th grade), from the TAKS test of 9th grade math, and from the TAKS test of 10th grade reading.
We do not capture students’ high school GPAs. They are not included in the TEA data files.
Merging Postsecondary Data (THECB, NSC, IPEDS)
For each high school graduate in our baseline sample, we merge data from the Texas Higher Education Coordinating Board (THECB), from the National Student Clearinghouse (NSC), and from the Integrated Postsecondary Education Data System (IPEDS).
Data Captured from the Texas Higher Education Coordinating Board (THECB)
- Institutions entered and dates of enrollment (in Texas)
- Degrees and credentials attained and dates of attainment (in Texas)
- Course enrollments and credit loads, per semester (in Texas)
- Admissions information for students in certain 4-year institutions (in Texas)
- Federal, state, and institution-issued grant aid (THECB’s Financial Aid Database Report)
- Waivers or exemptions from tuition and fees (THECB’s Financial Aid Database Report)
Data Captured from the National Student Clearinghouse (NSC)
- Across the US, institutions entered and dates of enrollment
- Across the US, degrees and credentials attained and dates of attainment
Data Captured from the Integrated Postsecondary Education Data System (IPEDS)
- For any Texas institution attended, published tuition and fee schedules (in-state, out-of-state, and in-district). Some 2-year institutions in Texas have discounts for local (“in-district”) students.
- For any Texas institution attended, institution-wide selectivity, Pell Grant share, total enrollment, and other institution-level descriptors.
Merging Earnings Data (TWC)
For each high school graduate in our baseline sample, we merge quarterly earnings data available to us from the Texas Workforce Commission (TWC).
We tabulate earnings data for students starting in the quarter after their enrollment. This means, for example, that our first year earnings tabulations for a student who enrolls in September of a given year and for a student who enrolls in January of a given year will both involve four quarters of earnings and will both start immediately after their actual enrollment.
When students have extreme outlier earnings we reduce those earnings entries to the 99th percentile of earnings. We do this to mitigate the distortionary effect that occasional outlier earnings values would have on our estimates.
Texas UI wage records, while they contain information on a large majority (>90%) of earnings and workers in Texas, have limitations. For example, they do not record earnings by the self-employed, earnings by federal employees, or out-of-state earnings.