never rat on your friends, and always keep your mouth shut

Monday, January 11, 2010

Why the maps annoy me

The NYT Netflix maps are just the latest in a series of ways geography is being used "interestingly" to make a "point" of some sort that's never actually asserted by the cartographer, but, rather, inferred by the user. This kind of stuff annoys me probably in the same way my using stats annoys scientists (not to say I'm a geographer. I'm not. But I fret about these issues a whole lot). It's facile and leads to self-satisfied conclusions that the data doesn't, actually, support.

Problem 1: The areas are broken up by ZIP code. I think the problems this calls forward are highlighed by the case of Hyde Park, which straddles at least two ZIP codes. You force a kind of "neighborhood" upon the environment that doesn't actually exist there. In some ways this is good, since it randomizes (to a degree) the boundaries. But in other ways it's bad, since 60637 starts to mean something sociologically/anthropologically, not just, you know, postally.

In general, if you're trying to determine something about patterns of some sort, you want to split the study area into quadrats (the number of quadrats determined by comparing the total number of observations to the entire study area). That, of course, requires a certain amount of granularity about the data that netflix might not provide (or the NYT not be interested in working through). But we can't assume that we know something about "a ZIP code" based on the fact that they rented x movie more than y.

(This returns to the one way in which ZIP codes are good, as I hinted above. They are only largely based on preexisting boundaries (of cities, towns, etc.) as opposed to entirely, like ward boundaries. In that sense, they shake up the possible sample you get in each code. The idea of gerrymandering a ZIP code only makes sense in LA.)

So I don't particularly think that ZIP codes are a revealing means of looking into what's going on. Fun, yes. Which leads me to point 2.

Problem 2: It's irresponsible to throw out data like this and let it sit to be played with, in my opinion, without another variable or something to provide context. What the NYT has provided us with is basically a big toy. As I said to Ben, it's interesting, but only like a crossword puzzle is interesting. I love crossword puzzles, and I love doing them, but I don't post about them or forward links to them, since I can't escape Postman's criticism of crossword puzzles as basically what overeducated and understimulated people do out of intellectual boredom. Is anyone surprised that the South Side of Chicago likes Tyler Perry? So what does it mean to point it out, other than to recycle something people would have already pretty much assumed? Without context, the analysis becomes circular, flattering the viewer into making conclusions he or she already suspected.

Problem 2a: There are no numbers that would help us analyze the data better. As in, we have no idea how many Netflix subscribers are in each ZIP code, either in toto, or as a percentage of population. Furthermore, we don't know how many movies, total, get shipped to each ZIP code. Finally, we have no idea how much space is between #1 and #5 in any ZIP code, yet those present fixed differences in coloring. That one lone ZIP code that really loves Rachel Getting Married? Maybe there's just one household in the entire ZIP with a one-at-a-time plan that has a serious erection for TV on the Radio. Again, this is a lack of context.

Problem 3: Autocorrelation. Basically, this means that similar observations tend to cluster. It's kind of a problem for geography, from my understanding, since one is always trying to figure out how much of the data is tainted by autocorrelation. If the point of the maps is to show that, yes, this shit is hell of spatially correlated, well, big deal. Again, there's nothing new in telling me that shit spatially correlated. I know that it's statistically very likely for adjacent ZIP codes to have similar renting patterns. I would like to know, in seeing this data, what kind of built in issues it has with autocorrelation, etc. Here's where something like Moran's I comes in handy. It tells you whether the dataset is correlated or not, allowing you to then more comfortably make conclusions about the distribution of rental patterns.I clean this up in the comments

And then, when there are breaks, like in HP, we're not equipped to understand if that's random or an actual blip, since, again, we're provided with such crappy data. We all *assume* that it's because of the UofC that Slumdoggy was so popular in 60637 (or at least that's what Mario Small suggested in his blog post that alerted me to the site in the first place), but we don't *know* that. And we have no way, with what we're given, to guess how much is the UofC or not.

Which leads, again, to circular and convenient conclusions that flatter our prejudices.

So I've got to go to a booze tasting now, but that's my crank attitude for now.

I'll add one last thing: the eye is a bad mathematician, and it is way too eager to see patterns where there are none, which is why it's so easy to lie with maps. Of course, I understand that this is all fun and a way to burn some time (see Postman and crosswords above), but I've seen the explosion of this kind of mapping shit lately as a threat to real geospatial analysis. I dunno. Forward this on to Conzen and see if he thinks I'm crazy. I'm willing to be told I am. But it doesn't change the fact that, in my work, I have to compete with jokes like the google books map that accompanies every novel.

OKOKK... last thing... What I would've liked is a per movie distribution as a hotzone, so, not bounded by ZIP codes. That would've been more interesting and sociologically useful, since it would help account for the variations in potential renting diversity within each ZIP.


CZA said...

Agreed. Though this line of argument:
"If the point of the maps is to show that, yes, this shit is hell of spatially correlated, well, big deal. Again, there's nothing new in telling me that shit spatially correlated"

doesn't take you all that far. After all, couldn't one say there's nothing new in telling us that maps are not only imperfect representations of reality, but encode and reproduce power relationships? Or that our mainstream organs of knowledge making aren't intellectually rigorous?

I like to think that the good these mapping tools do to promote geographical literacy and get people to visualize their world in different ways outweighs the pitfalls. A lot of people will merrily go about the business of flattering themselves (a wonderful way of putting it, btw), me included, but I also think plenty of other people will bump up against the inadequacy of this first wave of tools; the people with tactical knowledge beyond the comprehension of the NYT/Google man on the skyscraper. They'll push for (perhaps get hired to develop) more refined tools that will threaten the real problem: the inaccessibility of ESRI and the theoretical expertise housed in higher education.

Basically, we can re-stage the debate around consumer culture: inherently insidious and depoliticizing, or do we give the audience a little more credit and say that there's a lot of play, some of which might develop into serious play. Your analysis has a little of the high/low culture elitism: it seems to assume that such gauche maps will only produce regressive, conservative meaning and never inspire any sort of higher order alternative thinking about the world we live in

My counter crank: you need to come back from Paris soon and mingle with the masses.

Moacir said...

Yeah, I realise the inherent elitism in the comment, which is why I started with the idea of how scientists bristle when they see their techniques coopted in a loosey goosey way. I don't think the comparison to high/low culture is exactly right, though.

The issue is taking data (like from an xls), re-presenting it in a different way (spatially), and expecting conclusions to simply appear from the re-presentation. It doesn't work that way, since by presenting the data spatially, there are new concerns to take into account.

Part of me is happy to see geographic literacy increase. No, all of me is. But this is doing more than that. It's preaching the message that "if you map it, a conclusion will come." (If it wasn't, then it's totally just an idle plaything.) But, again, just mapping it isn't enough, just like turning a concordance into a Wordle isn't enough. It's a step in the analytical process, but not the last one.

I should've done a better job with autocorrelation, so let me start from scratch:

You have a field of polygons (ZIP codes of NYC) all colored different ways. The eye is famously bad at telling you if the colors are clustered, random, or dispersed. And *by how much*. Now, autocorrelation assumes, via the first law of geography, that everything that happens is related (no independent observations), but whatever happens close is more related. This means that we would expect data about the human world to be clustered.

Before doing statistical analysis on a set of geospatial data, however, you need to know how autocorrelated the data is. How much more clustered or dispersed than random it is, since most statistical tools (regression, etc.) assume/require independence. Of coruse, regression requires a second variable, and we don't get that here, anyway. Just amathematical ordinals.

So for the purposes of seeing what your neighbors rent (or dont), tests like Moran's I (testing for general correlation) or Getis-Ord (testing for clustering of certain values, iirc) aren't really necessary. But they are necessary if you want to quantify or make strong claims about clusters.

Another way: how much of Tyler Perry's reach is random chance, how much is that any two neighboring polygons will likely have similar values, and how much is something "anthropological"? The maps as presented are unable to even start answering any of these questions.

Yet aren't the maps there precisely for us to try and answer these questions for fun? And this *promotes* literacy? By encouraging bad reading?

Finally, I didn't lash out at ordinal data enough in my original post, but after I thought about what I'd LIKE the maps to look like, I think I decided that the best, given ZIP code-level data, would be per-movie rental as a percentage of total households or netflix households per ZIP. Which feels more useful knowing: a) Slumdoggy was the third most popular movie in this ZIP or b) 45% of netflix households rented this movie (with the maximum movie being 50%, the mean being 25%, and the variance being .014 or whatever).

Moacir said...

As a helpful example of a map I liked, Very Large Array had a map yesterday of points advertised in craigslist as "Williamsburg." For some reason, point-level data is already more appealing to me than polygon-level. But the claims were modest: This is what Williamsburg looks like, if you believe Craigslist.

It would've been interesting to see what that "Greater Billyburg" is ALSO called in Craigslist, since I doubt every apt rented in "East Williamsburg" is called that in Craigslist, but I could be wrong.

Still, all it shows is expanded boundaries and gives a sense of how much repetition gives the expanded boundary meaning--we can see and perhaps even count how many points out of the total fall outside of "official" Williamsburg, whatever that is.