Reading and preparation of the data from NSLY79
library(pacman) p_load(tidyverse, magrittr, janitor, feather)
I read personal income from salaries and personal income from business, and add them together to create total income for a year.
df <- read_csv("data/nlsy79.csv") %>% rename( birthyear = R0000500, race = R0214700, gender = R0214800, afqt = R0618301, sat_math = R0619900, sat_verbal = R0620000, business98 = R6365001, wage98 = R6364601, business00 = R6911101, wage00 = R6909701, business02 = R7609000, wage02 = R7607800, business04 = R8318200, wage04 = R8316300, business06 = T0913900, wage06 = T0912400, business08 = T2078800, wage08 = T2076700, business10 = T3047500, wage10 = T3045300, business12 = T3979400, wage12 = T3977400, business14 = T4917800, wage14 = T4915800) %>% mutate_at(vars(starts_with("business")), ~replace(., . < 0, 0)) %>% mutate_all(~replace(., . < 0, NA)) %>% mutate( income98 = business98 + wage98, income00 = business00 + wage00, income02 = business02 + wage02, income04 = business04 + wage04, income06 = business06 + wage06, income08 = business08 + wage08, income10 = business10 + wage10, income12 = business12 + wage12, income14 = business14 + wage14)
The AFQT is an intelligence test facilitated by the military. The participants take the test at different ages, some of them as young as 16, which influences their score. I apply the same correction as in this paper in order to correct for this.
df %<>% mutate( age_at_afqt = 80 - birthyear, afqt = case_when( age_at_afqt >= 20 ~ afqt - 13700, age_at_afqt == 19 ~ afqt - 10500, age_at_afqt == 18 ~ afqt - 9200, age_at_afqt == 17 ~ afqt - 8000, age_at_afqt <= 16 ~ afqt - 5200), iq = scale(afqt) * 15 + 100)
I adjust the dollar values using the inflation measures found here. 2014 value is set to 1.
df %<>% mutate( income98 = income98 * 28.52 / 19.64, income00 = income00 * 28.52 / 20.75, income02 = income02 * 28.52 / 21.67, income04 = income04 * 28.52 / 22.76, income06 = income06 * 28.52 / 24.29, income08 = income08 * 28.52 / 25.94, income10 = income10 * 28.52 / 26.27, income12 = income12 * 28.52 / 27.66 )
I calculate each participants’ average income for the years recorded while they were in their 40s.
df %<>% mutate( income = case_when( birthyear %in% c(57, 58) ~ rowMeans(select(df, c(income98, income00, income02, income04, income06, income08)), na.rm=T), birthyear %in% c(59, 60) ~ rowMeans(select(df, c(income00, income02, income04, income06, income08, income10)), na.rm=T), birthyear %in% c(61, 62) ~ rowMeans(select(df, c(income02, income04, income06, income08, income10, income12)), na.rm=T), birthyear %in% c(63, 64) ~ rowMeans(select(df, c(income04, income06, income08, income10, income12, income14)), na.rm=T)) ) %>% mutate(income = ifelse(is.nan(income), NA, income))
There are many incomes that are 0. This may not be correct, and instead be due to income sources that are not registered. In order to avoid these low values having too high a weight, I set all yearly incomes below $8.000 to $8.000. (The specific value is completely arbitrary.)
df %<>% mutate(income = replace(income, income < 8000, 8000))
Create a log-transformed income variable
df %<>% mutate(log_income = log(income))
Save the file
df %>% write_feather("data/nlsy79.f")