--- title: "Reading Legacy Census Redistricting Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Reading Legacy Census Redistricting Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(tinytiger.curl_quiet=TRUE) ``` The US Census Bureau releases redistricting data (P.L. 94-171) following the decennial census. This article walks through the basics of reading that data and getting it into working form. # The P.L. 94-171 Legacy File The redistricting data mandated by P.L. 94-171 is composed of [six tables, five of population characteristics, and one of housing characteristics](https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/complete-tech-docs/summary-file/2020Census_PL94_171Redistricting_StatesTechDoc_English.pdf). In the legacy data format, these six tables (and dozens of levels of geography) are split into four files: File01, File02, File03, and the Geographic Header Record. Files 01--03 have the actual decennial census data for each table, while the geographic header has geographic identifiers (FIPS code, precinct IDs, etc.) and information (area, water, etc.). ![P.L. 94-171 file layout](../man/figures/pl94171_layout.png) These four files have the same rows, each of which is identified by a `LOGRECNO` number. Combining the columns shared across the four files yields the full P.L. 94-171 file The basic unit of Census geography is the block; all other geographies are constructed from them. But the P.L. 94-171 file is not composed of blocks alone. The Census has already tabulated the six table statistics across every possible level of geographies. To get information for a single geography level---blocks, or counties, or school districts---one needs only to subset to the rows which correspond to this geographic level (coded in the `SUMLEV` column). So the basic process for working with legacy P.L. 94-171 data is: 1. Read in the four P.L. files (`pl_read()`, `pl_url()`) 1. Combine the four files into one, and subset to the desired geography level (`pl_subset()`; geography level codes listed in `pl_geog_levels`) 1. Select the desired variables from the six included tables (`pl_select_standard()`) 1. Optionally, combine the processed data with the corresponding [`tigris` shapefile](https://cran.r-project.org/package=tigris). The utility function `pl_tidy_shp()` combines all of these steps into one function for most common use case of tabulating basic redistricting information at the block level. This is demonstrated in [the README](../index.html) # Using the `PL94171` package ```{r setup, message=F, warning=F} library(PL94171) ``` The four components of the P.L. 94-171 file should be downloaded into their own directory. Here, we'll use the example data included in the package, from the 2018 end-to-end Census test in Providence County, Rhode Island, and read it into our R session. In general, you can provide a URL in place of a file path, and the package will read the data from the URL. The `pl_url()` function will automatically construct the URL to the data for a given state and year. ```{r} # `extdata/ri2018_2020Style.pl` is a directory with the four P.L. 94-171 files path <- system.file("extdata/ri2018_2020Style.pl", package = "PL94171") pl_raw <- pl_read(path) # try `pl_read(pl_url("RI", 2010))` ``` This creates a large list where each individual P.L. 94-171 file component is a separate entry in the list. If we look at the top of one of these entries, we see the same structure as in the schematic above: each redistricting variable is a column, the rows are indexed by `LOGRECNO`, and various levels of aggregation are all included as different sets of rows in the same table (notice the countywide population counts in the first two rows). ```{r} head(pl_raw$`00003`) ``` To subset to a desired geography level, we must first identify the corresponding `SUMLEV` code. ```{r} print(pl_geog_levels) ``` Here, we'll look at Census tracts, which are `SUMLEV=140`. ```{r} pl <- pl_subset(pl_raw, sumlev="140") print(dim(pl)) ``` We see that all four components have been combined into one large table, with data for each of the seven Census tracts in the example file recorded in a single row. To extract [commonly-used variables](../reference/pl_select_standard.html) from the 397 columns, we can run the following: ```{r} pl <- pl_select_standard(pl, clean_names = TRUE) print(pl) ``` Above, we set `clean_names = TRUE`, which is the default. This creates a set of variables familiar to the `redist` family of packages. To combine these data with a shapefile, we must use the `tinytiger` package. The `GEOID` column is shared between the P.L. 94-171 data and the TIGER shapefiles from `tinytiger`. ```{r message=F, warning=F, eval = FALSE} library(tinytiger) library(sf) library(dplyr) library(ggplot2) ri_tracts = tt_tracts("RI", county="Providence", year=2020) ``` ```{r, echo = FALSE} library(tinytiger) library(sf) library(dplyr) library(ggplot2) with_retry <- function(fn, ..., max_iter = 5) { out <- NULL i <- 1 try({out <- fn(...)}, silent = TRUE) while (i <= max_iter && is.null(out)) { Sys.sleep(0.5) try({out <- fn(...)}, silent = TRUE) i <- i + 1 } out } ri_tracts = with_retry(fn = tt_tracts, state = "RI", county = "Providence", year = 2020) ``` Then we can join the shapes and data ```{r, eval = !is.null(ri_tracts)} full_join(pl, ri_tracts, by="GEOID") %>% ggplot(aes(fill=pop, geometry=geometry)) + geom_sf(size=0) + theme_void() ```