In Part 3 of a series on developing a Python script, some progress is detailed.  A short data-massaging script helps prepare the input data for the main script.

In this ongoing series about developing a Python script for some analysis, we first covered what the script needs to do and then also covered some data issues.  After several full on days of banging my head against a digital wall, I think I’ve made a bit of progress!  Recall that we’re looking at some analysis that looks at reducing the number of tobacco retailers based on a minimum distance criterion (i.e., there should be no tobacco retailers within a given distance of another retailer).  The trick is in systematically removing retailers in a defensible and logical way.

At the end of our last post, I had the input data down to three options for setting up the proximity relationship: a point layer of retailers that the Near tool has been run on (thus providing distances from each point to the nearest retailer), a similar output from the Generate Near Table tool that is a table rather than a point layer, and the output from an OD Cost Matrix.  To be honest, at the start of this I though the cost matrix would be the way to go, but after last week, I think it’s a bit more problematic than it’s worth.

So I’ve decided to focus on the point layer with the near distances in the table.  Some benefits of this are that I’m working directly with the points that will need to be deleted, cutting down on any extra to-ing an fro-ing between other layers or tables.  Also, the Near tool allows me to limit the distances that are calculated so I’ll easily be able to know which didn’t have a retailer within that minimum distance (I should be able to do this with the OD Cost Matrix but I haven’t been able to make that work yet).  One downside is that I’ll be working with as-the-crow-flies straight line distances rather than distances along the road network (which would be more realistic but at these distances might not make much difference).

So, here’s what my input data table looks like after running the Near tool:

Each point has an OBJECTID, a unique identifier that Pro uses to keep track of each point.  The NEAR_FID is the OBJECTID of the nearest point and NEAR_DIST is the straight line distance between the two.  So the point with OBJECTID = 1 is 143.898713 metres away from the point with NEAR_FID = 274, for example.  When running this tool, I set the search distance to 500 m, meaning that if it doesn’t find another point within that distance, it will return a value of -1 for both NEAR_FID and NEAR_DIST.  You can see two examples of that in the table above.

In the last post I was rattling on about reciprocals and needing to link them.  Here’s what I mean – if I sort the table by NEAR_DIST and scroll down past all the -1s, you can see that for every OBJECTID/NEAR_FID distance, there’s its opposite (reciprocal), where the OBJECTID and NEAR_FID are swapped:

I’ve highlighted one pair to illustrate – each point is listed separately but they have the same distance.  This is a good thing as this is the proximity relationship between these two points.  Later in the script I’ll need to delete one of these – but first I need to know which points are these de facto relationships.

(While we’re here, notice that there are some points that don’t have reciprocals – at first these were keeping me awake at night but I’m now a lot less worried about them.  More on these later.)

Okay – so while this allows me to know the proximity relationships, it’s a bit too complicated at this stage to work with these, so I’m wanting to do a bit of data massaging, for which I’ll use a wee script to identify those points that have reciprocals and assign them a unique id – I can then later use this in the main script as my proximity relationship identifier (did that statement make any sense at all?  (Ed. no, not really, but carry on…)

Naively, I thought I’ll just whip up this script in a few minutes and move on.  That’s when the banging me head up against a digital wall started – and boy, does my aura hurt.

Here’s the guts of what I wanted to script to do (psuedo-code):

  • With the points sorted by ascending OBJECTID, go through each point and pull out the NEAR_FID
  • Search though all the other points to find a point where the OBJECTID and the NEAR_FID are reversed (I could do this by finding another point where NEAR_DIST is the same but there is a very slim (veeeeery slim) chance that there might be more than one).
  • Give both of those points the same unique identifier
  • If NEAR_DIST = -1, leave the point alone
  • If there is no reciprocal, leave the point alone.

In theory this sound like it should be pretty straightforward.  To make this work, I’ll add a new attribute to my point layer, called PairID and initially populate it with values of 0.  Then I’ll do a quick Select by Attribute on records where NEAR_DIST = -1 and reset their PairID to -1 (I want to test that the script correctly leaves those records alone):

Now for the script.  I won’t go into excruciating detail here, just the main points (Ed. pshew!).   At the top of the script, I’ve set up a counter that allows me to iterate through all the records and a list that holds the names of the field in my attribute table:

Next, since I’m wanting to change attribute values, I’m going to use a thing called an UpdateCursor that’s specifically designed to work in loops and make changes to tables.  The layer I’m working with is called “points” inside the script – this next line creates the UpdateCursor:

I next need a  for loop, the guts of which are below:

Allow me to (try to) translate –  the “for” signals that I’m going to loop through each row (record) in the points table and do something.

The “if” line is a conditional statement; if the comparison that follows is true, do something, if it’s not, then don’t do anything.  Since Python starts counting at 0, “row[2]” is a way to access the attribute value in the third field, which happens to be PairID.  If that value is NOT equal to 0, do nothing, but if it is, execute the next line.

The next line, “row[2] = count” assigns the current value of count to the PairID and the line afterwards, “cursor.updateRow(row)” writes that value to the table.

Note the differences between what “==” and “=” mean – the former is a comparison, the latter assigns a value.  Anyway, I’ve just reset the value of PairID to make it the same as the value of the count variable.  (Why?  We’ll see why later.)

The next two lines assign the value of OBJECTID to a new variable called “oid” and the value of NEAR_ID to the new variable “near” for this record.

The “print()” statement is a useful way of keeping track of where the script is and if it’s working properly – they’re great for on-the-fly debugging and I tend to make extensive use of them.  This line just prints out the value of these new variables.

So all I’ve done so far is to start with the first record, reset its PairID to count and grabbed its OBJECTID and NEAR_FIDs.  I next need to find its reciprocal (if there is one).  I’m going to do this with with a Select by Attribute that looks for a record where OBJECTID = near and NEAR_FID = oid (by just swapping their values).  Good practice when using something like Select by Attribute is to write the text for the query on a separate line, as the “whereClause”.  It might look a little odd and I’ll show it again:

The ““””” (try that at home) is Python’s way of saying what’s inside the triple quotes is all text.  The use of {} instead of text is a way that I can pass in variable values so that the whereClause is actually different every time it loops.  the “.format(near,oid)” just means that for the first set of {}s insert the value of near and for the second use the value of oid.

These lines do the selection (there should only be one record that gets selected at most) and the resets the value of the PairID attribute to count – now both these records have the same PairID – thus their proximity relationship.

Next, , increments the value of count by 1.   lets me know for sure that this has happened so I can follow the progress.

This is the last line of the for loop so the script now goes back to the top of the loop and does all those steps to the next row, this time with count being equal to 2.  It will continue this until if row[2] == is no longer true.  Then it pops out of the loop and does the last few lines:

ensure that the selected record gets cleared and copies the resulting layer to a new layer, here called “TestPoints2018copy”  The last thing the script does is print out “Done” so I know it’s finished.

In the end, here’s the resulting table, sorted by PairID:

I struggle to see how I could do this manually without having to go through every single point – a mind-numbing prospect that would be quite prone to me making mistakes.  I would also struggle to see how I could do this with ModelBuilder.

To tidy things up, notice the PairID 5 has only one record.  Why is this?  Recall that the I limited the search distance that the Near tool used to 500 m.  There might be some cases (such as PairID = 5) where the nearest point is within 500 m BUT that other point has another point that is closer than PairID 5.  That make sense?  Probably not, so here’s a picture to illustrate:

The points are labelled with their OBJECTID.  Points 65 and 66 are in a proximity relationship – each is the closest point  to the other one.  The closest point to 7 is 65, but since 65 is physically closer to 66, point 7 doesn’t have a reciprocal.  This is okay, but I do need to think about how I’m going to handle this in the next stages of the process (I think I have a plan for that).

What this data-massaging allows me to do is to clearly model the proximity relationships.  As a next step, I’ll work out how to cycle through each of the PairIDs and randomly delete one of the two points, then rerun Near, then redo this whole thing until I end up with just -1s for the NEAR_ID.  Then I’ll know that there are no points within 500 m of another point.  That will (hopefully) be the subject of the next, thrilling instalment.

Reflecting on this script, it feels a bit clumsy to me – using both an UpdateCursor and Select by Attribute when I feel like I should just be able to do it with one UpdateCursor.  I haven’t found a way to add a new row to the already selected row yet – I’m sure many au fait scripters out there will look at this and scoff, so I’m open to suggestions on how to do it better.

The last thing I’ll say on this relates by banging my head against a digital wall – this was mainly around getting the format for the whereClause right.  I tried for several days, all sorts of iterations and nothing seemed to work.  Then I realised that it was probably because I was trying to run this script as a stand alone script, outside of Pro.  In theory, I could run this anytime, with or without Pro open, but the reality was that for this script, and one line in particular (the Select by Attribute one), it needed to be run from inside Pro.  Selection is a dynamic thing that only really exists inside a project, so while the script ran, it was never actually selecting anything until I copied and pasted the whole script into Pro (the Python window, to be exact), hit enter, and then it ran pretty flawlessly first time.  Oh, the humanity!

If anyone reading this is still awake, contact me and the first five people get a free Moro bar (Ed. you’re assuming there are more than five people reading this…let alone one…).

Now onto more scripting!

C