Fun With R Coding

By Jenny Listman

Analyze NYC 311 Open Data and Make an Animated Choropleth Map

New York City maintains an open data portal making data from many City agencies publicly available. If you’re looking for a good dataset to use for practicing data manipulation, analysis, or visualization techniques, it is a gold mine. NYC’s 311 system takes calls or online complaints and questions about any non-emergency matter, all of which are available on the NYC Open Data website. NYC 311 has a rodent sighting category and rat sub-category, which I used for this project.

You can read my Medium blog post about rats in NYC and some of the factors that might contribute to seasonal changes and Borough or neighborhood differences in rat 311 reports.

This post describes the R code used to process, analyse, and visualize the data.

Load packages used

You will also have to instal ImageMagick on your computer (not an R package) which is wrapped by magick for use in R. It took me a bit of time to figure this out. You might want to start here.

library(tidyverse)
library(data.table)
library(janitor)
library(ggplot2)
library(gganimate)
library(choroplethr)
library(choroplethrZip)
library(magick)
library(filesstrings)

You can use data stored in my GitHub repository associated with this project or you can download an updated version of 311 data from NYC Open Data.

Download NYC 311 data on rat and mouse sightings. I dowloaded data from January, 2010 through July, 2017 and saved it as Rat_and_Mouse_Sightings.csv.

This was read into R with data.table::fread, which is faster than read.csv.

RatMouseSightings <- fread("Rat_and_Mouse_Sightings.csv")

This results in a large data set with many variables not needed for this project.

Look at the first few rows to understand the variables we have.

View(head(RatMouseSightings))

Make a smaller dataframe by removing unneeded columns keeping only “Created Date”, “Descriptor”,“Incident Zip”, and “Borough”. Look at the first few rows again to make sure the correct variables were kept.

RatMouseData <- as.data.frame(RatMouseSightings[,c(2,7,9,25)])
View(head(RatMouseData))

Remove the super large file from memory, since we’re now working with a file that only contains 4 of the original 53 columns (variables).

rm(RatMouseSightings)

Fix the variable names to remove spaces using janitor::clean_names. Then turn multiple variables from character to factor, at the same time.

RatMouseData <- RatMouseData %>% clean_names()

cols <- c("Descriptor", "Incident_Zip", "Borough")

RatMouseData <- RatMouseData %>% mutate_at(cols, funs(factor(.)))

Make a new dataframe with the subset of rows for which Descriptor variable is “Rat Sighting”.

RatData <- subset(RatMouseData, Descriptor == "Rat Sighting")

Turn “Created_Date” variable, currently a string, into POSIXct date format. Then get Month, Year and Month_Year variables because you might want any/all of those for exploratory analysis or plotting. Create them as factor variables instead of numeric or character. Save the dataframe of rat sighting data as a .rds file to preserve variable formatting for the next time you want to use it.

RatData <- RatData %>% 
mutate(Created_Date = as.POSIXct(Created_Date, "%m/%d/%Y %I:%M:%S %p", tz="")) %>%
mutate(Year = as.factor(year(RatData$Created_Date))) %>%
mutate(Month = as.factor(month(RatData$Created_Date))) %>%
mutate(Month_Year = as.factor(format(RatData$Created_Date, "%Y-%m")))

saveRDS(RatData, "RatData.rds")

Read in census data NYCpopsizebyzip.csv (found in my GitHub repository associated with this project) for popoulation size by NYC zip code. Add population count (NYCpopsizebyzip$value) to RatData by matching zip codes. region variable in NYCpopsizebyzip must be changed from numeric to factor in order to use base::merge. Use droplevels to remove Incident_Zip and Borough typos, NAs, or errors from NYC 311 database. Rename value (population count variable from NYCpopsizebyzip) to be self-explanatory.

NYCpopsizebyzip <- read.csv("NYCpopsizebyzip.csv", header=TRUE)

NYCpopsizebyzip$region <- as.factor(NYCpopsizebyzip$region)

RatData <- RatData %>%
        merge(NYCpopsizebyzip, by.x="Incident_Zip", by.y="region") %>%
         rename("popsize" = value) %>%
        droplevels()

Make a tidy Zip dataframe including new variable of calls per Month_Year by zip code with dplyr::count. Rename the newly created variable (automatically named n by dplyr::count default) to be self-explanatory.

TidyRatDataZip <- RatData %>% 
        count(Incident_Zip, popsize, Month_Year) %>%
        rename("count_calls_Zip" = n)

For mapping with choroplethrZip, Zip needs to be named “region” and the mapped factor variable must be named “value” and needs labels for the legend to make sense.

It can be more useful to compare relative instead of absolute counts across regions, when making a map data visualization. Make new variables based on rat reports per population count per region. First make a numeric variable of rat reports per zip code population size, then bin it to create a factor variable which will be used as the mapped “value” in choroplethrZip.

To come up with reasonable breaks for the bins using the base R function cut, look at the distribution of the variable you want to map. Visually inspect the distribution of the factor variable to be mapped onto zip codes by printing a table. I had to change break values a couple of times before I was happy with the results. I then used label to make the bins appear in the map legend in a readable manner. Default labels can be ugly.

TidyRatDataZip <-  TidyRatDataZip %>%
        mutate(ReportPer10K = (count_calls_Zip/popsize)*10000) %>%
        mutate(value = cut(ReportPer10K,
                                     breaks = c(0.09,0.35,0.55, 0.8, 1.1, 1.5, 2.0,3,10,41), 
                                     labels = c("0.09-0.35","0.35-0.55","0.55-0.80","0.80-1.10","1.10-1.50","1.50-2.00", "2.00-3.00","3.00-10.00","10.00-41.00"))) %>%
        mutate(region = Incident_Zip)

table(TidyRatDataZip$value)

Make a smoothed line plot of median call density per Month_Year across all zips within each Borough to show seasonality.

First, create a new data frame with Month_Year in proper chronological order for plotting & create a new variable. I used median number of rat reports per 10,000 residents rather than mean because the distribution was quite skewed.

RatCallsByBoroughMedian <- RatData %>% 
        count(Incident_Zip, Borough, popsize, Month_Year) %>%
        rename("count_calls_Zip" = n) %>%
        mutate(ReportPer10K = (count_calls_Zip/popsize)*10000) %>%
        group_by(Borough, Month_Year) %>%
        summarize(MedianMonthYearReportPer10K = median(ReportPer10K))

Create a dummy variable, “Order”, to make plotting the x-axis easier. “Order” is a number from 1 to 92, which is the number of levels for the factor variable “Month_Year”. If you download your data from NYC Open Data rather than using my older data set, you’ll have a different number of levels for this variable and will have to adjust the code, accordingly. Change names of “Order” variable manually for x-axis using only January and July so x-axis is readable instead of crowded and seasonality is visible. The \n within each xlabel makes them print on two lines instead of a single, unreadable line. Again, I downloaded the data in July, 2017. If you do a new download now, instead of using my stored data, you’ll need to deal with Month_Year levels that are not addressed in my code.

RatCallsByBoroughMedian$Order <- seq_along(1:92) 
xlabels <- c("Jan \n2010","July \n2010","Jan \n2011","July \n2011","Jan \n2012",
            "July \n2012","Jan \n2013","July \n2013","Jan \n2014","July \n2014",
            "Jan \n2015","July \n2015","Jan \n2016","July \n2016","Jan \n2017","July \n2017")

Make a lineplot graph using ggplot::geom_smooth. This also took some fiddling with the parameters of the smoothing function to make it look reasonable. I used theme_minimal to get rid of lots of distracting lines and then used geom_vline to re-insert vertical lines only at January & July axis ticks for a cleaner look.

You can add a caption for the graph, with your Twitter handle or website and the data source, so if someone posts the image, everyone gets credit.

RatCallsByMonthBorough <- ggplot(RatCallsByBoroughMedian,aes(Order, MedianMonthYearReportPer10K)) +
        geom_smooth(aes(group=Borough, color=Borough), na.rm = TRUE, 
                    formula = y ~ s(x, k = 50), method = "gam", n= 100, se = FALSE)+
        theme_minimal() +
        xlab("\nMonth & Year") + 
        ylab("Rat Reports to 311 Per 10,000 Residents\n") + 
        labs(title="Rat Complaints to NYC's 311 By Borough: 2010-2017\n", 
             caption="@jblistman. Data source: NYC Open Data 311 Service Requests") +
        scale_y_continuous(breaks=c(0,1,2,3), labels= c("0","1","2","3")) +
        geom_vline(xintercept=seq(1,92,by=6), colour='gray') + 
        scale_x_continuous(breaks=seq(1,92,by=6),labels=xlabels) +
        theme(axis.text.x=element_text(size = 13), 
              axis.text.y=element_text(size = 13),
              axis.title.x=element_text (size = 14),
              axis.title.y=element_text (size = 14),
              plot.title = element_text(size = 15, hjust = 0.5),
              legend.text = element_text(size = 12),
              legend.title = element_text(size = 12))

RatCallsByMonthBorough

Make a dataframe with the subset of variables needed for the map.

TidyRatMapData <- TidyRatDataZip[,c(3,4,6,7)]

Create objects and order variables needed for maps. 1. Month_Year as list in chronologicall order with base::sort. Combine the sort with base::unique to get rid of duplicate Month_Year levels because each zip has an entry for each Month_Year.
2. Make vector of Hex color codes in a color-blind friendly map color palette derived from the colorbrewer website for 9 factor levels of “value”. I chose a diverging palette to emphasize the highs and lows. You can play around with this. 3. Make a vector of NYC zip codes instead of using nyc_fips (in the Census data set that we merged earlier to get population size) because county fips codes include some census zcta (close to zip codes, but not exactly) that include Long Island, which is definitely not part of NYC.

datesRats <- sort(unique(TidyRatMapData$Month_Year))
mapcolors <- c('#ffffd9','#edf8b1','#c7e9b4','#7fcdbb','#41b6c4','#1d91c0','#225ea8','#253494','#081d58')
nyc_zips <- NYCpopsizebyzip$region

Make maps using zip_choropleth which uses ggplot2 to make map objects and then choroplethr_animate to write png files of each map image.
These pngs can be used as-is with choroplethr_animate which creates code along with the pngs to make a link to animated maps or they can be turned into a gif with magick.

For each Month_Year create a map. This code was adapted from a tutorial by Kimberly Coffey. First, make an empty list that will then be filled with maps. The loop iterates over each Month_Year to create a map. The entire TidyRatMapData must be subsetted by Month_Year (temp.date) for each map. Include a variable totalrats that sums all rat complaints for a given Month_Year so this value can be included in the title of each image. In scale_fill_manual and guides, override the vector of color codes that correspond to binned values so that zip codes with no rat complaints for a given month are colored gray.

RatMapsAnimate <- list()

for (i in 1:length(datesRats)) {temp.date <- datesRats[i]
df <- subset(TidyRatMapData, Month_Year == temp.date)
totalrats <- sum(df$count_calls_Zip)
title = paste("Total Rat Reports to NYC's 311 in", temp.date, ":", totalrats)
RatMapsAnimate[[i]] = zip_choropleth(df, zip_zoom = nyc_zips, title=title, reference_map=FALSE) + 
        scale_fill_manual(values=mapcolors,drop=FALSE, 
                          na.translate = TRUE, na.value = "gray",
                          name="Rat Reports to \nNYC's 311 Per \n10,000 Residents")+
        labs(caption="@jblistman. Data source: NYC Open Data 311 Service Requests")+
        scale_colour_manual(values=NA) +              
        guides(colour=guide_legend("0", override.aes=list(colour="gray")))
} 

Use choroplethr_animate to write a png image for each map plus a file with html code to combine pngs as an animation. The animation can be viewed in a browser, but we’ll also turn the pngs into a gif.

choroplethr_animate(RatMapsAnimate)

choroplethr_animate creates files with numerical names that are out of order, alphabetically. Rename files choropleth_1.png … choropleth_9.png to choropleth_01.png … choropleth_09.png so when they are automatically turned into a gif with magick, they are ordered alphabetically as well as numerically.

filestrings::nice_file_nums will do this automatically for an entire directory. The current working directory already contains the png files that need to be renamed.

nice_file_nums(dir = ".", pattern = ".png")

Convert the .png files to a .gif with ImageMagick. First, change your working directory to the folder in which you’ve written the png files. I’ve named the resulting file ratmaps.gif and chose a delay of 40 between frames. Play with the delay depending on your preferences for time between frames and name your own file (or keep ratmaps, whichever). system() executes code as if you’re using the terminal.

system("convert -delay 40 *.png ratmaps.gif")

If you don’t want to save all the gigantic .png files, remove them from the working directory.

file.remove(list.files(pattern=".png"))