Each row represents a survey that a subject completed (or didn't complete). It can be very helpful when restructuring the NLS investigator extracts into a longitudinal dataset that's aligned by age (instead of by survey wave). The Age variables can help to align other response variables across subjects. While the 'SurveySource' indicates where to look for their responses.

These variables are useful to many types of analyses (not just behavior genetics), and are provided to save users time.

Format

A data frame with 580,752 observations on the following 6 variables.

  • SubjectTag see the variable of the same name in Links79Pair

  • SurveySource The location of that subject's survey responses that year. Values are NoInterview, Gen1, Gen2C or Gen2YA.

  • SurveyYear The year/wave of the survey.

  • Survey79 The exact date of the administered survey.

  • Age The subject's age at the time of the survey, rounded to 1 decimal. See Details below.

Source

Gen1 information comes from the Summer 2013 release of the NLSY79 sample. Gen2 information comes from the January 2015 release of the NLSY79 Children and Young Adults sample. Data were extracted with the NLS Investigator (https://www.nlsinfo.org/investigator/).

Details

Age uses a variable called AgeCalculateYears (according to a their own response, or their mother's response) if it's available, and uses AgeSelfReportYears (according to a their own response, or their mother's response) if not. The values usually agree, but not always. These two separated variables are available for download, but withheld from the R package (otherwise we would exceed CRAN's maximum size).

The exact date of birth isn't public (only the subject's month of birth). To balance the downward bias of two weeks, their birthday is set to the 15th day of the month to produce AgeCalculateYears.

In the Gen2 Child dataset, self-reported age is stated by month (eg, the child is 38 months old); a constant of 0.5 months has been added to balance the downward bias. In the Gen2 YA and Gen1 datasets, self-reported age is stated by year (eg, the subject is 52 years old); a constant of 0.5 years has been added.

Author

Will Beasley

Download CSV If you're using the NlsyLinks package in R, the dataset is automatically available. To use it in a different environment, download the csv, which is readable by all statistical software. links-metadata-2017-79.yml documents the dataset version information.

Examples

library(NlsyLinks) # Load the package into the current R session.

summary(Survey79)
#>    SubjectTag           SurveySource      SurveyYear        Age        
#>  Min.   :    100   NoInterview:257047   Min.   :1979   Min.   : 0.00   
#>  1st Qu.: 313900   Gen1       :236338   1st Qu.:1985   1st Qu.:18.50   
#>  Median : 619752   Gen2C      : 52038   Median :1990   Median :24.70   
#>  Mean   : 618530   Gen2YA     : 35329   Mean   :1992   Mean   :25.31   
#>  3rd Qu.: 914403                        3rd Qu.:1998   3rd Qu.:32.30   
#>  Max.   :1268600                        Max.   :2010   Max.   :54.90   
#>                                                        NA's   :257047  
table(Survey79$SurveyYear, Survey79$SurveySource)
#>       
#>        NoInterview  Gen1 Gen2C Gen2YA
#>   1979       11512 12686     0      0
#>   1980       12057 12141     0      0
#>   1981       12003 12195     0      0
#>   1982       12075 12123     0      0
#>   1983       11977 12221     0      0
#>   1984       12129 12069     0      0
#>   1985       13304 10894     0      0
#>   1986        8581 10655  4962      0
#>   1987       13713 10485     0      0
#>   1988        8064 10465  5669      0
#>   1989       13593 10605     0      0
#>   1990        7971 10436  5791      0
#>   1991       15180  9018     0      0
#>   1992        8674  9016  6508      0
#>   1993       15187  9011     0      0
#>   1994        8287  8889  6042    980
#>   1996        8494  8636  5396   1672
#>   1998        8764  8399  4896   2139
#>   2000        9755  8033  3385   3025
#>   2002        9047  7724  3189   4238
#>   2004        9058  7661  2455   5024
#>   2006        8944  7654  1756   5844
#>   2008        8936  7757  1200   6305
#>   2010        9742  7565   789   6102
table(is.na(Survey79$AgeSelfReportYears), is.na(Survey79$AgeCalculateYears))
#> < table of extent 0 x 0 >

if (require(ggplot2) & require(dplyr)) {
  dsSourceYear <- Survey79 %>%
    dplyr::count(SurveyYear, SurveySource) %>%
    dplyr::filter(SurveySource != "NoInterview")

  Survey79 %>%
    dplyr::filter(SurveySource != "NoInterview") %>%
    dplyr::group_by(SurveySource, SurveyYear) %>%
    dplyr::summarize(
      age_min = min(Age, na.rm = TRUE),
      age_max = max(Age, na.rm = TRUE)
    ) %>%
    dplyr::ungroup() %>%
    ggplot(aes(x = SurveyYear, ymin = age_min, ymax = age_max, color = SurveySource)) +
    geom_errorbar() +
    scale_color_brewer(palette = "Dark2") +
    theme_minimal() +
    theme(legend.position = c(0, 1), legend.justification = c(0, 1))
}
#> Loading required package: ggplot2
#> Loading required package: dplyr
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
#> `summarise()` has grouped output by 'SurveySource'. You can override using the
#> `.groups` argument.
#> Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
#> 3.5.0.
#>  Please use the `legend.position.inside` argument of `theme()` instead.