Deleting Smokers – Closing the Loop(s)
The last of our series on developing a Python script for some smoking reduction analysis.
Long term readers of the GIS Blog (Hi Mum!) may recall a series of posts on the development of a Python script for some analysis. As part of the smoking reduction analysis I’ve been involved with over the years, we wanted to look at the impacts on accessibility to tobacco from imposing some distance limits on tobacco retailers. In other words, how would setting a minimum distance between retailers affect how easy it is to buy tobacco? After a survey of the literature, we settled on three distances to evaluate: 450 m, 300 m and 150 m. The previous posts set up the background, tweaking some psuedo-code and developing one of the key data preparation scripts. In this post, we put this script to bed (or rather, get it out of bed and running!) and get some results.
So as a quick recap, we needed a script that would look at the 5,131 existing tobacco retailer points we’ve got in our existing database, figure out out close they all are to each other and then sequentially go through each, defining which ones are within a given minimum distance and then randomly delete one of them, and then keep doing this until none of the retailers is within that minimum distance. And then, do it all over again 1,000 times so we end up with a 1,000 iterations of this – giving us an ensemble of possible configurations. The random deletion means that everyone could be different and we have a massive dataset to do some more general statistics on.
Well I’m very pleased to say that we’ve brought this all to fruition, though with 100 iterations rather than 1,000. In this post I’ll try to highlight the key parts of this script without getting too bogged down in the details (Ed.: a tall order).
In one of those earlier posts we saw the first data preparation script to find the proximity relationships between pairs of points. With those, it’s then a matter of going through each pair and randomly deleting one of the two. The tricky bit is going through the whole data set once, randomly deleting a point in a pair, but then doing it over and over again until none remain within that set distance. This immediately suggests loops that continue until a certain condition is met. The Near tool has been a critical tool in this script as it not only allows us to calculate the straight-line distance to the nearest retailer, but also allows us to limit how far away to look. So if our minimum distance is 450 m, any retailer with no other retailer within 450 m will be assigned a -1. When all the retailers get a NEARDIST value of -1, then that iteration is done and the script can start all over on a new sequence. So overall, we’ve got three main loops to this script:
- Inner loop: go through each proximity pair and delete one retailer at random
- Middle loop: continue inner loop until all retailers have a NEARDIST = -1
- Outer loop: Run the middle loop 100 times
So it sort of looks something like this:
The middle loop controls the inner loop and tells it when to stop, while the outer loop runs the middle loop a set number of times. Let’s have a quick look at the individual loops that control this beast.
We start with some comments on the script to help others understand what’s going on:
Then some preliminaries: importing Python libraries, and setting up some variables (Note: this version is using a 150 m distance):
(I haven’t cleaned up all my comments, those lines starting with “#” – these are mostly early tries at making something work.) “nearDist” is the variable that holds the minimum distance between retailers, “fields” lists the main attribute fields some of the tools will be using and “startLayer” is used to give a base name to all the output layers.
The outer loop then runs the show – it’s fairly simple:
Here, a For loop is implemented to run the code within it a set number of times. “i” is a counter that gets incremented every time the loop runs. Counting starts at 0 with Python and the 101 is the maximum number of times it’s to run, incrementing by 1 each time it runs. With this I actually get 101 iterations, since we start at 0 but stops when we get to 101.
I use a Print statement just to let me check where in the process the script is, allowing me to keep an eye on progress.
“copyOutput” is just a wee bit of text that I can tack on to the end of each layer’s base name (Full150) so I know which iteration it came from. The syntax means that each layer will be given a name Full150_i (where “i” of course changes with each iteration). You’ll see the effect of this later.
CopyFeatures creates a copy of the original dataset as a starting point with the name Full150_i
Later on I’m going to be using Select by Attribute and, things being as they are, I need to create a Feature Layer to do this on. I’ve struggled for a long time to understand exactly what a Feature Layer is in this context and I think the best way to understand it is that it’s like a layer on a map – normally, Select by Attribute can only be used on a layer that’s on a map – this step is essentially adding a layer to a map, but it’s all virtual (Ed.: Aren’t we all?)
Next, the middle loop comes into play:
A simple but important line – it’s a while loop, meaning that it keeps running as long as the condition maxDist > 0.0. As soon as that’s not true (i.e. maxDist = -1), the loop stops and returns control to the outer loop. What’s maxDist? We’ll see in more detail later, but for now, it’s the maximum distance as calculated by the Near tool, keeping in mind that we want things to stop when all the NEARDIST values = -1.
From here we go into the inner loop, the guts of this script. In a nutshell, we iterate through each proximity pair (which has its own unique identifier), deleting one of the two at random. It’s quite long in here so I won’t show the whole thing, but it’s actually got a few sub-loops of its own.
Since retailers have been deleted whenever this loop runs, I’ve got to use an earlier script to create a unique ID (pairID) for each proximity pair at the beginning of every sweep. Since I’m making changes to the layer I first need to create an update cursor:
The cursor is primed to look at the fields in the current layer, set up at the beginning. The “paircounter” gives us a starting point for pairIDs. This next bit assigns the unique ID for each proximity pair features:
Note that the paircounter is incremented by 1 at the end of this loop so next time around the pairID is one unit larger. The loop continues until each row has been gone through. With the PairIDs in place another while loop allows us to iterate through each proximity pair. “dcount” gives us a place to start:
What’s with the 3712? This is the number of possible proximity pairs + 1 (determined from the base data layer before the script runs). This is just a way to limit the number of times the loop runs for efficiency. While dcount is less that 3712, the loop gets a set of values to chose from, 0 or 1 (the basis of our random deletion). “whereClause” is text that will be used in a Select by Attribute step – it translates to: select the records where PairID = dcount (again, dcount will change with each iteration in this loop). There’s a risk that I’ll select a dcount that only has one record and to ensure I work with dcount pairs, this next little bit does a quick count of selected records to make sure there are two. If so, it moves on to the next step (if not, it goes to the next value of dcount):
Now, we’re finally at the heart of this script – if there are two records in the selection, pick one at random and delete it:
The “choice” line randomly selects either 0 or 1 and then one or the other of the two selected records is deleted. “dcount” increases by one and we move to the next pair.
Pshew! Now it’s time to start digging ourselves from the bottom of these loops. This one continues until all the NEARDIST values are -1. Once there, the Near tool runs again to update the distances to the nearest retailers:
Before we leave here, all the PairID values are reset to -1 so on the next iteration they can be allocated a new PairID. This loop continues until all the NEARDIST values are -1. We then return to the outer loop and start it all over again. When i hits 101, the script exits and we get a most satisfying message on the Python console:
Can you imagine doing this all manually? I can’t. Wouldn’t dream of it. Wouldn’t pay anyone to do it either.
A few notes. On my desktop machine, this took very close to a week of continuous running to fully execute. Once finished, a quick look at the Catalog shows the base layer and all the iterated output layers:
Full150 was the starting point with 5,131 retailers and all the other layers are the outputs from the script (remember that bit of script that added the underscore and the iteration count up at the beginning?). Around 1,820 retailers were deleted from each layer, but the locations of the ones remaining will differ from layer to layer due to the random deletions.
So. What a long process to get here. My part is (mostly) done so one of my last steps is to get these results to my more statistically-minded colleagues (in the nicest possible way) for that end of the analysis. What would be useful for them is spreadsheets so my last step is to convert all the layers’ attribute tables to Excel using the (aptly named) Table to Excel tool. Rather than sit in front of my machine and run this 101 times, I put a very quick ModelBuilder model together to do the hard (boring) work for me. At the last minute we realised that we needed the distances to the nearest retailer as well, so I simple added that in as a step (this is the 300 m version):
A nice feature here is the Iterate Feature Classes iterator. This lets me point at a geodatabase and it will run the subsequent tools on each layer present. Et voila! All of my layers are now spreadsheets which I share with my colleagues thorugh OneDrive:
Nice. And that brings us to the end of this sordid tale. I’ve learned a lot along the way and would be the first to say that this script isn’t very elegant. I’m sure some of our MAC students would have some good suggestions on how to improve it. I’m open to suggestions!