Map Sandbox Project

A sandbox for designing and implementing comparative thematic maps.

Motivation and Goals

This project is an attempt to build a tool for visualizing a side-by-side comparison of neighborhood livability metrics. My initial focus is on crime statistics, but what I’m building should be extensible to other kinds of socioeconomic attributes. In this post I describe my motivation and goals for the project.

Backdrop

I became interested in crime mapping about a year ago when I was relocating to St. Louis for a new job and had done a bit of research weighing the pros & cons of different neighborhoods. St. Louis has a reputation of having a higher crime rate than most other cities in the U.S., consistently being toward the top of nation-wide rankings, and this certainly played on my mind at the time. I ended up moving into a downtown loft, which is a fifteen minute walk from where I work. After living here for nearly a year, I’ve grown to love this diverse city and all it has to offer: it has a rapidly growing, supportive entrepreneur community, a vibrant tech community with a wide range of active meetup groups, a great selection of restaurants and microbreweries, the beautiful historic Forest Park filled with top-notch museums, frequent neighborhood festivals, a vital sports culture, and on and on….

St. Louis, then, seems to embody a glaring contradiction: How can such a vibrant city rank as one of the most dangerous cities in the U.S. to live? It challenges the natural tendency of popular media to present an overly simplified cut-and-dry narrative that is easy to comprehend. St. Louis residents, both long-time natives and recent transplants, generally feel the predominant national conversation about St. Louis is unfairly dominated by crime statistics, overshadowing all of its other redeeming qualities, which can have a tangible impact on the local economy by deterring businesses and individuals from moving here.

There is a recent push by community leaders to try to change the conversation by focussing on a critical flaw in how the statistics are reported: the numbers calculated for St. Louis are aggregated only at the core city level and exclude the surrounding suburbs of the larger metropolitan area, which is in sharp contrast to most other cities where the crime rate is calculated for the entire metropolitan area. City and county officials and community activists are exploring the possibility of an eventual city-county merger, but in the short term, regional police leaders are considering formally combining city and county crime reporting when compiling Missouri’s statistics for the FBI, which issues the annual city crime rankings. Such a merger would have lowered the city’s rank in 2012 from 3rd to 8th.

To some, combining city and county statistics may seem like a statistical slight of hand to improve the overall numbers without actually addressing the underlying problems leading to elevated crime. It’s true that such a measure is aimed at the perception of crime, which is only a small part of law enforcement’s overall strategy; they are also actively exploring and deploying a variety of measures to lower crime, such as hot spot and neighborhood policing and grassroots community initiatives such as SiRV to help reduce the factors that produce a culture of violence in communities.

A different approach

My interest in crime mapping is limited to the scope of addressing how crime statistics are calculated and conveyed: I believe that better tools for visualizing and understanding crime rates can be built and made freely available, and that people would benefit from such an effort.

The core problem I’d like to solve is: how can one meaningfully compare crime in one region to crime in some other region? If a national ranking is to be meaningful, then parameters used for each city, such as the defined boundary, must be programmatically determined in a way that ensures the resulting statistics are properly normalized across all cities. It shouldn’t be possible for an individual city to alter the way they report crime in order to effect the results of the calculation. The effort in St. Louis to combine city and county reporting is an attempt to correct for this lack of normalization, which is a good thing. But it shouldn’t be necessary if the underlying calculation is properly normalized to begin with.

I feel crime rates aggregated over an entire city or metro area are not of much value. It’s much more useful to be able to compare areas on a human scale of several blocks within one’s neighborhood. Being able to compare neighborhoods between cities side-by-side is useful, because it provides a relational context to better interpret meaning of the numerical values.

So, one of my central goals is to build a tool that allows the user to choose a neighborhood in one city and compare it to a neighborhood in another city, and do this comparison in a statistically meaningful, ‘apples-to-apples’, way so that they could judge for themselves the relative safety of a neighborhood. Such a tool could also be extended to other types of data, such as socioeconomic data and livability metrics, to gain a more holistic profile of a neighborhood.

This type of tool does not exist freely on the web or in a mobile app. And so as a software engineer and physicist, I feel I am uniquely qualified to do the end-to-end ground work required to build such a thing. For me, it’s critical to be very transparent about how the data is being manipulated: ideally this type of tool would allow the user to trace the full computational path if they so desired, to find out what assumptions/choices lead to the final visualization.

There are several sites that provide crime maps on the web: CrimeReports, CrimeMapping, RAIDS Online, and SpotCrime. As far as I can tell, the sole visualization mode offered by these sites is to show the crime incident data as scatter points on a map, which doesn’t provide much insight. The key service these sites provide, which is by no means trivial, is that they aggregate the crime data from many different city municipalities into a unified interface. Simply showing the raw incident data has the virtue of not introducing any subjective biases into the data, which can sometimes happen when building a thematic map (e.g. a heat map, or choropleth map, etc) if one isn’t careful.

On the other end of the spectrum, there are several feature-rich GIS systems available capable of advanced geographical data analysis, but these have a fairly steep learning curve that makes them more suitable for experts than for the average web or mobile app user. The leader in this space (by a long stretch) is the ArcGIS software, but there are also some open source tools out there as well, such as QGIS.

I believe it’s possible to find a balance between these two extremes that provides the user more insight than what can be gleaned by simply showing the raw incident data, but at the same time provides a streamlined elegant interface that doesn’t require expert-level skill to interpret the information being visualized.

There are a handful of innovative sites that have taken a step in this direction by offering more informative thematic maps. The real estate site Trulia provides a density map of crime aggregated over census blocks (as of Dec. ‘13, they don’t have data for St. Louis). The open government site AxisPhilly has created an interactive choropleth map showing how crime is changing over time in Philadelphia, aggregated by neighborhood, and the independent news site MinnPost provides a similar interactive map showing crime statistics broken down by neighborhood.

In the next section I’ll sketch an outline of what I hope to build.

Project goals

I’m doing this project ‘just for fun’, purely as an evening/weekend side-project. Aside from the reasons I outlined above, I’m drawn to this project because there is a mix of problems to be solved that engage different parts of my brain: data wrangling/modeling/analytics, infographics and UI/UX design, and software engineering.

My long term goal for this project is to build an iOS app for side-by-side comparison of the crime density between two different neighborhoods (either within the same city, or between two different cities). In due course, I will write out a set of user stories (probably using waffle.io, which is integrated with GitHub issues). However, realistically I am still a month or two away from embarking on that.

In the short term, this project will be a series of exploratory spikes as I make my way through the fundamentals of geographical analytics and figure out how to best achieve the desired features. I have already begun working my way through various articles and books – in a later post I’ll summarize these resources.

Here is a rough sketch of features and related issues that I’ll be exploring:

  • obtaining and working with crime incident data: be able to translate between shape files and geojson files. Also understand the hierarchy of crime categories.
  • transform between map projections: do computations in a planar distance-preserving projection, but then project into whatever projection is appropriate for map visualization (typically spheroidal lat/long coordinates).
  • binning of points on a discrete spatial grid: try to minimize subjectivity in the following (i.e. make determinations algorithmically if possible)
    • choice of grid geometry: square, hexagonal, Voronois, or irregular polygons like census blocks.
    • choice of grid element size (e.g. 100 feet, 300 feet, 1000 feet, etc).
    • be able to project between grids (i.e. the Census Bureau has various indicator data assigned to block polygons, like population, so one needs to be able to translate these values onto other grids).
  • applying Gaussian smoothing to estimate incident density: what is the best choice of smoothing radius?
  • time binning and smoothing: what are good time interval values for binning and for smoothing (moving average)?
  • determine day/night for an incident, given time of day and lat/long.
  • be able to apply spatial clustering algorithms, such as computing the nearest neighbor index and Getis Ord G*. As part of this, understand the notion of a random distribution as a null hypothesis.
  • also be able to cluster based on crime profile similarity metric
  • choice of color scale for visualizing density: probably best choice wi ll be to base on quantiles of density distribution for region(s) of interest

Blog goals

This blog is primarily for my own personal record keeping: it’s a running account as I figure out how to build this tool. At times my notes may have the feel of a tutorial, but anyone happening across this site should keep in mind that I’m learning this material as I go and do not pretend to be an expert in crime mapping or geoanalytics. On the other hand, sometimes it’s useful to see how someone else was able to make sense of a subject they are learning for the first time.

This is also a companion blog accompanying a GitHub repository where I plan to make my code available. Initially I will be tapping into the scientific computing platform Mathematica (recently rebranded as Wolfram Language), which is my go-to tool for these kinds of projects (I worked at Wolfram for nearly six years before pivoting my career into mobile app development). Eventually I’ll be transitioning over to coding everything in python, since there are so many open-source geo-processing tools available. As a side effect, I think it will be useful to have two distinct implementations to help verify and validate as I go along.