Fun With R Coding

By Jenny Listman

This project uses a Recursive Partitioning And Regression Tree analysis from the rpart package to examine the effect of 49 gun law subcategories on annual State firearm death rates.

Ten subcategories were found to significantly predict State gun death rate decreases or increases. Data files needed to run this code can be found in my GitHub repo associated with the project.Read the blog post about it here.

Load packages


Obtain and read in datasets.

  1. Gun law data, firearm laws across US States from 1991 to 2017 downloaded from The State Firearm Laws Project and codebook

Presense/absense of a given gun law in a state is coded as “1” or “0”. These and “Year” are being read as numeric by R so will need to be changed to factor variables. State is already read as factor variable.

  1. Download CDC cause of death by State from firearms 1999 to 2016 data. On the CDC Wonder website, sort by State, Year, and cause of injury = Firearm. This produces a file that contains the crude death rate per 100,000 state residents due to firearm per state per year. This was saved as StateGunDeathRate.csv
StateGunDeathRate <- read.csv("./data/StateGunDeathRate.csv")

gunlawdata <- read.csv("./data/gunlawdata.csv")

Gunlawdata contains data going back to 1991, but CDC data only goes back to 1999. Remove years up to 1999 from gunlawdata.

gunlaws <- subset(gunlawdata, year > 1998)

Add total for each category of gun law from codebook and create new variables. Add death rate per 100,0000 to dataframe as outcome variable.

The codebook includes each law, its category, and subcategory which have been coded by The State Firearm Laws Project research team. Some of the law categories are quite broad and the 133 individual laws may be redundant or correlated. I have used law subcategories, but also ran the same analysis using individual laws and again with categories. Subcategories perfoms about the same as when using individual laws as input but using the 14 broader categories makes less accurate predictions. Perhaps the categories (14) lose information but the subcategories (49) keep most of the information provided by the individual laws.

Counts of gun laws present per State per year must be calculated for each subcategory, but the subcategories contain different numbers of laws. Make a list of lists that contains, for each subcategory name, a list of the column names for the laws in that subcategory. These will be used to sum only the columns that fall within a given subcategory.

codebook <-"./data/gunlaw_codebook.xlsx")) %>%

codebook$`Sub-Category` <- as.factor(codebook$`Sub-Category`)

codebook <- codebook %>%
        add_count(`Sub-Category`) %>%
        rowid_to_column() %>%
        mutate(sublevel = as.numeric(`Sub-Category`))

singles <- subset(codebook, n == 1)
single_categories <- data.frame()

for (i in 1:nrow(singles)){
        single_categories <- rbind(single_categories, singles[i,c(3,4,6)])

single_categories <- single_categories %>%
        mutate(mutation = paste0(`Sub-Category`,"=" ,`Variable Name` )) %>%

codenames <- list()
for (i in 1:49){
        lawnames <- as.list(subset(codebook, 
                                   sublevel == i)$`Variable Name`)
        codenames[[length(codenames)+1]] <- lawnames

colnames <- names(gunlaws)

Now add firearm death rate data to gun law data. Then, using mutate and rowSums with the codenames list of lists, make new variables for the total number of laws within each subcategory by State and year. Use janitor clean_names to get rid of spaces and capital letters in variable names.

gunlawsevensplit <-  gunlaws %>%
        subset(year != 2017) %>%
        merge(StateGunDeathRate[,c(1,2,6)], by.x = c("state", "year"), by.y = c("State", "Year")) %>%
        mutate(age_restrictions = rowSums(.[,colnames(.) %in% unlist(codenames[[1]])])
               ) %>%
        mutate(Alcohol = rowSums(.[,colnames(.) %in% unlist(codenames[[2]])])
               ) %>%
        mutate(`Assault weapons ban` = rowSums(.[,colnames(.) %in% unlist(codenames[[3]])])
               ) %>%
        mutate(`Background check records`=backgroundpurge) %>%
        mutate(`Background checks` = rowSums(.[,colnames(.) %in% unlist(codenames[[5]])])
               ) %>%
        mutate(`Background checks mental health records`=mentalhealth) %>%
        mutate(`Background checks state records` = rowSums(.[,colnames(.) %in% unlist(codenames[[7]])])
               ) %>%
        mutate(`Background checks through permits` = rowSums(.[,colnames(.) %in% unlist(codenames[[8]])])
               ) %>%
        mutate(`Background checks time limit`=threedaylimit) %>%
        mutate(`Bulk purchase limit`=onepermonth) %>%
        mutate(`Campus carry` = rowSums(.[,colnames(.) %in% unlist(codenames[[11]])])
               ) %>%
        mutate(`Crime gun identification`=microstamp) %>%
        mutate(Drugs=drugmisdemeanor) %>%
        mutate(Felony=felony) %>%
        mutate(Fingerprinting=fingerprint) %>%
        mutate(`Firearm removal` = rowSums(.[,colnames(.) %in% unlist(codenames[[16]])])
               ) %>%
        mutate(`Gun shows` = rowSums(.[,colnames(.) %in% unlist(codenames[[17]])])
               ) %>%
        mutate(`Gun trafficking` = rowSums(.[,colnames(.) %in% unlist(codenames[[18]])])
               ) %>%
        mutate(`Gun violence restraining orders` = rowSums(.[,colnames(.) %in% unlist(codenames[[19]])])
               ) %>%
        mutate(Immunity=immunity) %>%
        mutate(Inspections=inspection) %>%
        mutate(`Junk guns`=junkgun) %>%
        mutate(`Large capacity magazine ban` = rowSums(.[,colnames(.) %in% unlist(codenames[[23]])])
               ) %>%
        mutate(Liability=liability) %>%
        mutate(Licensing = rowSums(.[,colnames(.) %in% unlist(codenames[[25]])])
               ) %>%
        mutate(Location=residential) %>%
        mutate(`Mental Health` = rowSums(.[,colnames(.) %in% unlist(codenames[[27]])])
               ) %>%
        mutate(`Misdemeanor crimes` = rowSums(.[,colnames(.) %in% unlist(codenames[[28]])])
               ) %>%
        mutate(`Open carry` = rowSums(.[,colnames(.) %in% unlist(codenames[[29]])])
               ) %>%
        mutate(Permitting = rowSums(.[,colnames(.) %in% unlist(codenames[[30]])])
               ) %>%
        mutate(`Personalized gun technology`=personalized) %>%
        mutate(Preemption = rowSums(.[,colnames(.) %in% unlist(codenames[[32]])])
               ) %>%
        mutate(Prohibitors=ammrestrict) %>%
        mutate(Recordkeeping = rowSums(.[,colnames(.) %in% unlist(codenames[[34]])])
               ) %>%
        mutate(Registration = rowSums(.[,colnames(.) %in% unlist(codenames[[35]])])
               ) %>%
        mutate(Reporting = rowSums(.[,colnames(.) %in% unlist(codenames[[36]])])
               ) %>%
        mutate(`Restraining order` = rowSums(.[,colnames(.) %in% unlist(codenames[[37]])])
               ) %>%
        mutate(`Safety locks` = rowSums(.[,colnames(.) %in% unlist(codenames[[38]])])
               ) %>%
        mutate(`Safety training`=training) %>%
        mutate(`School zones`=elementary) %>%
        mutate(Security=security) %>%
        mutate(Stalking=stalking) %>%
        mutate(`No stand your ground`=nosyg) %>%
        mutate(Storage = rowSums(.[,colnames(.) %in% unlist(codenames[[44]])])
               ) %>%
        mutate(`Straw purchase` = rowSums(.[,colnames(.) %in% unlist(codenames[[45]])])
               ) %>%
        mutate(`Theft reporting` = rowSums(.[,colnames(.) %in% unlist(codenames[[46]])])
               ) %>%
        mutate(`Universal background checks` = rowSums(.[,colnames(.) %in% unlist(codenames[[47]])])
               ) %>%
        mutate(`Violent Misdemeanor` = rowSums(.[,colnames(.) %in% unlist(codenames[[48]])])
               ) %>%
        mutate(`Waiting period` = rowSums(.[,colnames(.) %in% unlist(codenames[[49]])])
               ) %>%

rpart automatically uses a 10-fold cross validation, so we don’t have to make a validation data set or add in a cross-validation step. Scale the input variables and separate the data into a train and test set. Typically, this is done by randomly selecting 20% of the data as a test set and 80% as a train set. However, for this data set, it is likely that there are correlations across States for a given year and correlations across years for a given State, or correlations among years within each small range of years. Therefore, the data set was split by odd or even year into two sets, each balanced for State and removing early/recent year bias. A repeat analysis I completed with the typical 20/80 random split gave incredibly similar results.

ml_lawsubcategories <- gunlawsevensplit[,c(1:2,137:186)] %>%
        mutate(state_year = paste0(state, year)) %>%
        mutate_each_(funs(scale(.) %>% as.vector), vars=names(.[4:52])) 

ml_lawsubcategories_train <- subset(ml_lawsubcategories, year %in% c(1999,2001,2003,2005,2007,2009,2011,2013,2015)) %>%
ml_lawsubcategories_test <- subset(ml_lawsubcategories, year %in% c(2000,2002,2004,2006,2008,2010,2012,2014,2016)) %>%

Run the model and look at the results. Predict values for the test set and calculate the root mean squared error. The Metrics package includes a rmse function. Print the model and the cp(complexity parameter) table.

lawsubcategories_model <- rpart(formula = crude_rate ~ ., data = ml_lawsubcategories_train, method = "anova")


lawsubcategories_pred <- predict(object = lawsubcategories_model,
                         newdata = ml_lawsubcategories_test)

rmse_subcategories <- rmse(actual = ml_lawsubcategories_test$crude_rate, 
     predicted = lawsubcategories_pred)


Prune the tree to find the optimal model that avoids overfitting. It turns out to be the same as the above model.

opt_index <- which.min(lawsubcategories_model$cptable[, "xerror"])
cp_opt <- lawsubcategories_model$cptable[opt_index, "CP"]

lawsubcategories_model_opt <- prune(tree = lawsubcategories_model, 
                         cp = cp_opt)
text(lawsubcategories_model_opt, use.n=TRUE, all=TRUE, cex = 0.5)

I need to translate the z-scored values back to the original values they represent in order to make an infographic. The options in rpart visNetwork and networkD3 didn’t produce the exact tree that I wanted or didn’t have the editing options I was looking for; the z-scores are now completely separated from original values, but z-scores don’t tell us anything useful at this point.

I saved the z-scored version of the test set with another name: z_scored_test_data. I sometimes make long-ish variable names to dummy-proof things for myself. Create a version of the test dataset without z-scores. Combine this with the z-scored data using cbind and sort by variable name using order so z-scored & original versions of each variable are adjacent.

Then create a decoder that is a list of dataframes, one dataframe for each law subcategory. Each dataframe in the list contains one column of z-sores and one column of original values (total count of laws in a given subcategory) for each law subcategory. Use from the rlist package to save the output to a file. A copy of the decoder, zscore_decoder.rds, can be found in the data file of the associated GitHub repository.

z_scored_test_data <- ml_lawsubcategories_test

no_z_scores <- gunlawsevensplit[,c(1:2,137:186)] %>% 
        subset(year %in% c(2000,2002,2004,2006,2008,2010,2012,2014,2016)) %>%

names <- names(cbind(no_z_scores, z_scored_test_data))

z_score_codes <- cbind(no_z_scores, z_scored_test_data) %>%
        .[ , order(names)] %>%

names2 <- names(z_score_codes)

z_score_codes <- z_score_codes %>%
        mutate_each_(funs(as.character(.)), names2) %>%
        mutate_each_(funs(as.factor(.)), names2)

decoder <- list()
j = 1
k = 2

for (i in 1:49){decoder[[i]] <- unique(z_score_codes[,c(j, k)])
j = j+2
k = k+2
}, "zscore_decoder.rds")