Deleting Smokers – Part 2
Part 2 of this series looks at refining our script pseudo-code to get input data prepped for further analysis
Way back when the dinosaurs roamed the earth, we saw the first post in this series on developing a Python script for some analysis. With Python being the topic of the moment in the current GIS course, I’ve been encouraged to pick up where I left off and continue developing this script. You might recall that the aim was to take a collection of points (here being tobacco retailers) and test out the effect of reducing the number on tobacco retailers, of limiting availability. This would mean systematically removing retailer points and then evaluating average additional costs to consumers. This is not a straightforward thing to do at all, not least because we needed to have some objective way of removing retailers. We decided we might look at imposing a minimum distance between retailers, such as 500 m, and then using that criterion as a way to remove them. That’s the theory behind this and what I’ve been trying to figure out is how I can set the data up to do this. I’ll reinsert a figure used in the previous post to illustrate:
The key here is being able to set up what the authors refer to as a “proximity relationship”, meaning one relationship between two points and how far apart they are. A script would randomly select a proximity relationship and delete one of the two points (again, at random) and then continue iterating through all the points again until there were no points closer than a certain distance. Unfortunately, there’s no off-the-shelf tool that I’m aware of that will do this, so I’ve got to come up with some systematic way of doing this.
Initially, there were two main tools that came to mind.
- Near (and its tabular cousin Generate Near Table)
- OD Cost Matrix
Near does a fairly straightforward thing – take a vector layer and find the nearest feature in another layer. Usually, there are two separate layers, such as water wells and roads. In my case, I’m wanting to find the distances between features in the same layer. When this tool is run, new attributes get added to the input layer that add the ID of the nearest feature and its straight line distance away (there are other options as well, like bearing and angle). Generate Near Table does the same thing but puts the results into a table rather than a layer.
To give myself something to work with, I clipped out a series of points from the national layer of tobacco retailers, keeping it close to home so that I could more easily pick up any obvious locational errors – here’s my sandbox – 329 points:
The setup for the Near tool is at right – notice that the Input Features and the Near Features are the same layer. This allows me to find the distance to the nearest retailer for each input point.
In the output, the useful attributes are NEAR_FID (the object ID of the closest retailer) and NEAR_DIST (the distance to that retailer). When running this tool I set a search radius of 500 m, meaning that the searching stopped at that distance. That’s why there’s a -1 in record 5 – there were no retailers within 500 m of that point.
So this sort of gives me the proximity relationship between any two points – but, I’ve got two instances of each relationship. For instance, I’ve got an entry (record 1) that shows the closest retailer to OBJECTID 1 is FID 274. Further down the list, there will be its reciprocal entry: OBJECTID 274 and its nearest retailer, FID 1. This gets me part of the way there but will still require an extra bit of work, as I only want one proximity relationship per pair.
The same holds for the table that comes out of Generate Near Table.
I next looked at OD Cost Matrices. The “OD” stands for origin-destination. These layers come to us as part of Network Analysis, where we’re using linear, connected networks (such as roads, or fiber optic cables, or sewer pipes) and doing analysis based on how close features area along the network (think Google Maps and finding directions from Point A to Point B. Instead of using as-the-crow-flies distances, we’re using distances along, in this case, the road network. The nice thing about OD matrices is they explicitly calculate distances between points that are set up as origins and destinations.
For quick display the connecting lines are shown as straight but the calculated distances in the output are along the network
For this tool to work, I need to have a road network that has been set up to do network analysis – luckily, there’s one in J:\Data\NetworkAnalysisData called RoadsNA. Once added to the map, I can set up this tool (Analysis tab > Network Analysis) by importing Origins and Destinations (the same layer in this case) and then setting a few parameters:
I’ve set the mode to “Driving Distance” so it’s distance based rather than time based and have limited the number of destinations to 4 (just to keep things manageable). Here’s my result:
Each of the purple lines joins two of the retailer points. In the table you can also see that each input point has four records – one for each of the four destinations set in the tool, ranked from 1 (closest) to 4 (furthest away). In the first record, see how it’s found the distance to itself (scrolling to the right you would see the distance to be 0)? FID 274 is the closest, followed by 56 and then 277. This is arguably a more realistic output as it’s a real-world network distance.
But this layer has proven problematic to work with, especially when trying to deal with removing each pair’s reciprocal. If I sort these by distance, it becomes easy to see that the reciprocals line up one after the other:
Ah, if life were simple enough that all I need to do is delete every other record, I’d be a happy chappy. Nor is it quite so simple to just delete everything other than rank 2. On closer inspection, there are some inherent problems, mainly caused by this layer ranking the points. It’s a difficult one to fully explain, suffice it say that while a point may be the closest retailer to one, that other retailer may have another one closer but not close to the first. (Ed., clear as mud, thank you.) This is important, because with this dataset I’ve only got 329 points – worst case is I eyeball things and manually delete the right ones. But this is my sandbox dataset – the full dataset has close to 6,000 retailers and I DO NOT WANT TO HAVE TO GO THROUGH ALL THOSE ONE BY ONE. I’ll go insane (well, more insane). There must be some way to automate this.
So, where does this leave us? I’ve got three outputs that get me part of the way to where I need to be. But a bit more work is required. To be thorough, I decided that I would be best off writing a script that will do this next bit for me. The biggest issue appears to be how to remove a retailer pair’s reciprocal. I have found that one way to sort though stuff like this is LSD. Ha – not the LSD you might be thinking of, no, Long Slow Distance runs. I can’t tell you how many difficult problems I have found solutions for with LSD, often within the first few minutes of the trip, er run. Anyway, on today’s run I worked out in my head the rough pseudo-code I would need to do this.
Thinking about the OD Matrix results, with OriginID being the point in question and DestinationID being the ID of the nearest retailer, the proximity relationship is between those two values. If I swap them, those are the values for the reciprocal pair. So psuedo-code would look like this:
For each record in the Near table,
- Get the attribute values for OriginID and DestinationID
- Find the record with a Select by Attribute where OriginID = DestinationID AND DestinationID = OriginID .
- Delete that record.
That’s confusing, I know. The Select by Attribute finds the record where the OriginID and the DestinationID have been swapped – the reciprocal of the first proximity relationship. By deleting it I am left with only one of those to work with.
Pshew!
Still awake?
So far, this has all been about prepping my data for the real show – the bigger picture of deleting retailers so that there are none within 500 m of each other. That will be a bigger, more involved script which I still need to think more about. But this is a good start.
Many people use Python to automate tasks, things that you do over and over again, as a way of saving time. In this context, we’re using Python as the analysis tool, to make GIS do the things it wasn’t really designed to do, make it jump through some spatial hoops. It can be fun; it can also be frustrating. But it does help to keep the Alzheimer’s at bay.
In subsequent posts, we’ll delve deeper into this task. Hope you’ll come along for the journey.
C