A Smokin’ Script
This post covers the development of a Python script to help carry out some analysis, saving me hours of time in the process.
It’s been a good month for keeping the Alzheimer’s at bay here at GIS Central. An analysis project I’ve been working on has thrown up some interesting challenges which I’ll cover in the next few posts. You may recall an earlier missive where we talked about mapping the proximity of tobacco retailers to secondary schools. My colleagues at Otago were interested in doing some similar analysis around people trying to quit smoking but with an nice additional twist.
They had a list of home addresses of people who had been using Quitline to try and stop smoking (about 540 in total). So one question was around if ready access to tobacco around their home address had any influence on their success with quitting. That’s an easy enough thing to do after our previous project. So looking at zones around their home address, we could tally up the number of retailers within 0 – 500 m from home and 500 – 1000 m, just like we did previously with the secondary schools. Plug those numbers into a spreadsheet and we can look for correlations with success rates and exposure to tobacco retailers. For about 140 of those respondents, we also had their work address, so we could also do the same sort of analysis around their place of work. For an individual, those mapped results might look something like this (Note: these points are not from any of the study respondents):
Areas of darker gray are within 500 m of an address along the road network while the lighter areas are within 500 – 1000 m. All the yellowy points are tobacco retailers. This is done using Network Analyst tools and makes use of a dataset of roads that models the connnectivity between road segments. For the areas above (known as service areas to Network Analyst), I set distances moving outward from the points along the road network and ended up with polygons that encompass those areas. So now we can do some Spatial Joins and count how many retailers are in each of the zones.
Having both home and work address threw up an interesting additional factor – what about the exposure on their way to and from work? We can easily use GIS and a suitable road network to find routes between locations (just like what Google Maps does when you get directions) so why not here? It sounded like a pretty simple thing to do at first (as these things often do) and we could expect to see results such as this for the route:
The brown line is the shortest path route along the road network between the home and work points. Again using Network Analyst, I can load in the address points as “Stops” and it will find the shortest path between the two as a line. We do this by adding a new Route layer, setting the analysis parameters and clicking go. If I now buffer that route out to some distance, say 150 m, we create an area on either side of the route with which we can then do another spatial join and do retailer counts:
As mentioned earlier, to do this properly we need a road network (or any linear network) with the connectivity between road segments explicitly included. This is done by creating a Network Dataset from a collection of line features. This is a subtle point – any old set of lines won’t do. We have to take those lines and run the Create Network Dataset tool – this establishes the connectivity from one road section to another. If we’ve got the additional info we can identify certain road sections as one way, or set where certain kinds of turns are prohibited to try and better reflect reality. Once done, we can use the network dataset for finding routes or creating service areas (those zones around the points). Typically, with routes, we’ll find shortest distance route where the solver uses the length of each road segment in calculations, but if we have information on, say, speed limits, or the time it takes to traverse each segment, then we could find the quickest route. If we’re getting really tricky, we could attach a “scenic” value and find the most scenic route from points A to B (think, for instance, of getting to Geraldine. You could take State Highway 1, which is faster and shorter, or you could take the Inland Scenic Route which is longer but more, well, scenic) .
So then for this part of the analysis, it’s a pretty straightforward set of steps:
- Add the work and home addresses to the map;
- Add a new Route layer;
- Load the addresses into the Stops layer
- Solve the Route
- Buffer the route
- Rinse and repeat
The only complicating factor to deal with next is that I’ve got over 140 home-work combinations and the home and work addresses are in separate layers. It would take me roughly about five minutes to do a single home-work pair – that would mean on the order of 700 minutes (almost 12 hours) to do them all manually. I didn’t think that was a particularly good use of my time so I set out to see if I could speed the process up.
With many tools in ArcGIS, you can set them up to run in batch mode. For instance, if I want to do a set of clips, I could go to the Clip tool in ArcToolbox (Analysis Tools > Extract), right-click on the tool name and select Batch…
You can then set the tool up to run multiple times with different inputs and let it run in the background. So that’s a simple way to automate things.
Two problems arose when I started down this track – first, there was no batch option for the Route tool (ouch) and two, I had my home addresses (540 of them) and work addresses (140 of them) in two separate layers and needed to be sure that I could match up the right work address to the right home address (they share a unique identifying number called “UniqueID” which I can use to link them). This was actually the trickier part of the problem which quickly brought me to an impasse.
Enter Python. Though it seemed like a daunting task initially (well, it remains daunting, really), I decided to have a go with writing a script to do all the heavy lifting for me. Making the decision to invest time and energy into writing a script is always a bit fraught – in this case, I could easily spend close to those 11 hours writing this script so would it be worth it? In retrospect, it most definitely was, given that I ended up rerunning this analysis multiple times. From that perspective, the effort of writing the script was well worth it.
Noooooow…how do I make a script interesting…? I’ll show you some of the highlights of the script, the guts of it, so you can see the flow of things.
In addition to the sequence of steps outlines above, there was one important additional step. I needed to find a way to link the unique IDs of the two addresses together, but I had more home addresses than work addresses so rather than create another layer of just the home points that matched work points, I wanted the script to loop through all the work addresses, grab the UniqueID and use that to select the matching record in the home address layer. Then, add the address pairs as stops in a route layer and create a separate route for each pair, buffer the result and then move to the next work address record. Make sense?
First, some script necessities. The code below sets my working directory and scratch directory (where temporary files get created):
Next I set a few variables such as where the network dataset lives, and the filenames of my work and home address layers, as well as which field holds the UniqueID attribute:
Doing this makes it easier if I need to come back and change layers or fields – I only have to do it once using variables. Next I’ve got to work out a way to systematically move through each of the records in the work address layer.
There are a number of ways to make a script iterate through the records of a table. I chose to use a “while” loop. The loop starts on the first record in the work addressses table using a “searchcursor”. This is basically a place keeper that allows you to move around in a table. Here’s the code that creates the search cursor:
The searchcursor is called “rows” and gets inserted at the start of the first record in fc1 (the WorkAddressesTM feature class). The next bit of code gives me a way to easily refer to the current record:
From here on in, when I want to refer to the current record I just need to use “row”. Now we’re ready to get this script rolling by creating a while section in the code:
Roughly translated, this is the start of the loop. It will continue until there are no more rows (records) to move through, so when it reaches the end of the table, it will exit the while loop and carry on with the rest of the script. Nested inside the while loop I will put all the code that does the work of selecting the right records, creating the route and then buffering it. Next step is to grab the UniqueID of the selected record:
Translation: get the value of the attribute with the variable name of “field” for the current row and assign this to the variable “record” (I’ve already set the field name as the “UniqueID” above).
Next, I’ll want to use that value to select that record (I’m sure there’s a more elegant way to do this…). When coding tools that have multiple inputs, it’s often easier to create variables that represent those inputs. I’m gearing up to do a Selection by Attribute so as a next step I create a “where Clause” that will go into my query.
Translation: this is the gist of my query – select the record where UniqueID = the unique id I grabbed above. That value is a number in my table so the “str()” converts the number to a bit of text so I can make the selection. I’ll use this clause in a bit to select the current record.
Next I’ll create a new Route layer and tell it to use the “network” as its network dataset, give it a name that adds the UniqueID to “Route_”, uses the Length attribute to find the shorted path and assigns the result to the “routeLayer” variable (Note: anything that comes after a # is considered a comment and isn’t executed – documenting your code this way is good practice):
Next I want to add the Stops which means selecting the correct home-work pairs. In order to do selections with Python a normal feature class has to be converted to a feature layer. I’m a bit fuzzy on exactly how that’s different from a regular feature class but thems the rules. So the code below creates a new feature layer (called “fc1_lyr”) and then grabs the value of the UniqueID attribute and uses that to select that record. That record is then added as a Stop (lots of default values inside the quotation marks):
Now create a new feature layer for the home points and use the same where clause to select the matching home record and add it as a Stop (I’m not too concerned about which is the start and which is the end point for this analysis):
We’ve got everything set up so so we’ll run the Solve command to get the route:
So now I’ve got my route. Next I’ll buffer it to a distance of 150 m and name the output “Route_(UniqueID)_buffer” and assign it to the “bufferLayer” variable. Because of the way Route works, the UniqueID won’t get carried through into my new buffer layer so I’ll add a new field and copy the UniqueID into that field using the Field Calculator. This becomes crucial later on:
Right – so that’s everything I want done for each home-work pair. I’m ready to tell the script to move to the next record and do it all over again:
The While loop will run this bit over and over again until it reaches the end of the work address table. Just for the sake of completeness I’ll show you the whole While loop so you can see how it all fits together. With Python we used indents to organise the flow of the script – they’re quite important. You’ll notice a few (well, a lot of) extra bits of code which I’ll just leave as some Eleusinian mysteries:
I also use a lot of print statements so I can monitor how the script if progressing – this is very useful for debugging your scripts as well.
Great – but we’re not done yet I’m afraid. By the time we get to the end of the While loop, I will have 140 individual buffer layers. It would be more useful if all of these were joined together into one layer that I could use for my spatial joins (that means doing only one spatial join as opposed to 140 individual spatial joins), so my final step is to create a list of all the buffered layers and then use the Merge tool to bring them all into one layer. There’s an extra step in there (reverse) that just ensures they are done in a more convenient order:
When this code is run, messages will be sent out to a text window so I can monitor the progress. Below is a (short) video that shows the script in action for a set of three work addresses and ten home addresses. (Note: these are random points and do not show any of the study respondents addresses.) Work addresses are in green and home addresses are in blue, each labelled with their UniqueID. Retailers are in orange. As it runs you can see tools being executed and layers being created:
So hopefully it’s clear that routes only get created when there is a home-work pair with matching UniqueIDs. As a review, the script has moved through my tables systematically, joining the correct home-work address pairs, found the shortest routes, buffered each route and then merged them all together. When it was debugged and running nicely, the script took about five hours to run through all the points (and faster than I could have done it manually) which meant I could get on with doing other stuff. As I mentioned earlier, I ended up having to rerun this analysis four or five more times (for reasons I won’t go in to) so in the long run, the time spent writing this script payed off. I could have spent up to five full working days running this manually and would no doubt have made more than a few mistakes along the way so this made a huge difference for me.
For the next step in this analysis I took the work and home zones and all my buffered routes and did some spatial joins to tally up all the retailers in the different zones. I could have easily (well perhaps easily isn’t quite the right word) extended my script to do those additional steps but I didn’t think it was really worth my while.
So we’ve delved into a lot of the gory details of this analysis in this post, probably more than you would ever really want to know, but I hope it demonstrates the value of scripts and how they can often (though not always) save you time. For database administrators, scripts are invaluable for automating mindless tasks. For analysts, they can frequently save you time and errors when analyses need to be run multiple times. I wouldn’t say I’m the greatest Python coder – not by a very longshot – but knowing enough about scripting can be quite valuable.
Some of you are familiar with ModelBuilder, a built-in tool in ArcGIS that allows you to build up executable models by dragging and dropping tools and connecting them up. I was tempted to try this but in the past I found setting up iterations like this to be problematic. Maybe in my next spare moment I’ll see how easy it would be to set up something similar using that instead of writing a script.
Don’t hold your breath…