Some quick data amalgamation tasks are tackled with spatial joins, Near and some table joins.

Over the years we’ve spent time talking about the smoking reduction research I’ve been involved with. (Ed. Who’s this we, kemosabe?)  It’s had me doing geocoding (and this), Network Analysis, a lot of Python scripting and lots of little data amalgamation jobs here and there.  As we look to finish up some recent work, I found I needed to add some data from a range of sources to my existing database and thought it might be useful to cover some of that here.  Along the way, we get to see some of the more useful and fundamental vector spatial analysis tools.

First off, I’ve got a layer of points containing the location and attributes of 5,131 tobacco retailers across New Zealand circa 2018.  Regular readers (Ed. Both of them.) may recall that we’ve lately been looking at modelling the effect of reducing the number of retailers on accessibility to tobacco, and for some of the statistical analysis we need to know some of the details around levels of economic deprivation, whether a retailer is in an urban or a rural setting as well as distances to the nearest other retailer and secondary school.

We start with my layer of retailers:

In the interests of keeping things simple, I’ve stripped down my attribute table to just the basics:

UniqueID2018 can be used to link each feature to the full dataset if needed.

Data-wise, I can calculate the nearest distance to other retailers using a tool, so I’ve got what I need for that.  For the level of economic deprivation, I’m going to need to extra data.  Otago Uni is the source for the New Zealand Deprivation Index (DPI) so that’s my first stop.  We’re focused on 2018 data (or thereabouts) and here’s what’s available from their website:

Lots to choose from but I don’t see anything in spatial format though I do see plenty of Excel and Text files.  On close inspection there are different versions to consider – I could look at “Statistical Area 1” data or “Meshblock” data or “Statistical Area 2” data.  What the heck’s all that about?  With a bit of digging, Otago links their DPI to the regular censuses, which are themselves linked to areas of differing sizes.  Meshblocks are what used be the finest scale of data from the census that could be released publicly.  Since the 2013 census, Stats New Zealand has been promoting the use of “Statistical Areas” for mapping census data.  SA1 is the finest resolution scale aiming to have between 100 and 200 residents and no more than 500.  SA2 aims for 1,000 – 4,000 residents and are larger in size than SA1 areas.   Compared to previous censuses, SA1 is roughly the meshblock scale while SA2 is more like census area units.  These are nested areas meaning that the two different scales share the same boundaries where they are close.

Anyway, the decision was made that we wanted to know the DPI and urban/rural values at the SA1 level but also needed to know which SA2 area each retailer was in, so going back to the data above, I downloaded the NZDep2018 SA1 Excel spreadsheet- here’s what it looks like:

Decoding: in the spreadsheet there are 29,889 rows (SA1 areas).  Each SA1 area has a numerical code to uniquely identify it (that will be more important later).  The NZDep2018 column has the deprivation index value, which ranges from 1 (least deprived) to 10 (most deprived).  The blanks are for SA1s that have no one usually resident in them.  NZDep2018_Score is the raw value Otago used to define the DPI index value and URPopnSA1_2018 is the usually resident population for that SA1.  For continuity, information on the SA2 area that each SA1 is nested within is provided as a code and a name.

So, having this as a spreadsheet is handy, but what do I do with it spatially?  Very happily, all those SA1 areas have been mapped and are freely available from Stats NZ:

This layer is downloadable as a shapefile or in a geodatabase so after a bit of downloading and unzipping, this layer goes on my map:

Getting there – but while I’ve got the SA1s on the map, the DPI data is still sitting in the Excel spreadsheet, so I need a way to link them.  Table Joins to the rescue!  Notice anything common between the SA1 attribute table and the DPI spreadsheet?  They both contain the SA1 code though are called different things.  Using the values, I can join the data in the spreadsheet to the spatial layer and get the benefit of both. I do this by adding the spreadsheet to the map (not essentially, actually) and then right-clicking on the spatial layer and looking for Joins and Relates > Add Join and setting up the link:

When OK is clicked, the spreadsheet rows are joined to their corresponding SA1 record:

Nice – so in one fell swoop I’ve added the DPI data, the usually resident population and along for the ride came the SA2 details (which we needed anyway – happy coincidence (Ed. but was it really a coincidence?).  Since table joins like this are temporary, it’s a good idea to make this permanent by exporting this layer to a new layer (right-click on layer name > Data > Export Features).

Next we need some rural/urban data.  Happily, the good people at Stats NZ have already done the hard yards and that one, too, and the data are right there in the DataFinder, only a download away:

With these on the map, we’re just about ready to rumble.

Next up are the distance data.  Recall we need two distances for each retailer: distance to closest other retailer and distance to nearest secondary school.  The Near tool comes in handy for these as a way to calculate the straight-line distance between vector features.  We’ll first do the distance to the other retailers.

The way this tool works is that it adds two new attributes to an existing layer (unlike most tools which create a new output layer).  Here it is setup:

Notice that I’m using the same layer as the Input and the Near Features – we would normally use this tool to find the distance to some other layer, but here I’m interested in the other features in the same layer.  The two new fields that will be added to my will be NEAR_FID that holds the object ID of the other retailer and NEAR_DIST that will be the straight-line distance to that other retailer.  Here’s my table after I run the tool:

For the school distances, I’ve got a slight problem.  If I rerun the Near tool again with schools as the Near Features, it will simply overwrite the NEAR_FID and NEAR_DIST attributes already there but I need to have both.  Not to worry, I’ve got another option with this kind of proximity analysis – Generate Near Table.  This does the same thing, but the output is a standalone table rather than an attribute in my spatial layer.   The output looks like this:

Notice anything useful here?  The IN_FID is the ID from my retailers layer – I can use this to join this table to my spatial layer (these joins are super useful).  To avoid any conflicts with attribute names, before I do the join I add a new attribute called School_Dist and use Calculate Field to copy the NEAR_DIST value into it.  All right Mr DeMille, I’m ready for my close up.  After the join and an export, I’ve got both distances in my retailer layer.

Now, how do I add the DPI and urban/rural data to the retailers?  I’ve got all the data I need but they are spread across three layers of two different types, points and polygons.  How do I bring them all together in my point layer of retailers?

I certainly hope that all those ERST202 and 606 students are shouting the same thing at their screens: spatial joins, one the simplest but most powerful of vector spatial analysis tools.  With a spatial join, the polygon attributes that a point sits within are transferred into the point’s attribute table, thus bringing everything together.  I’ve got three options: Identity, Intersect and Union.  Union won’t work here because it only works on polygons.  Both Identity and Intersect will work but I’m going with Intersect because I can join both of my polygon layers to the points in one step, rather than two that would be needed with Identity.  So here’s Intersect,  primed and ready to go:

I’m specifying that my output will be Point though it should do that automatically (given the choice it will produce an output in the simplest type, here, points).  Et voila!  All the data together in one point layer.

Nice!  (I’ve turned off some of the attributes to make it easier to see the critical ones.)  Last job is to use the Table to Excel tool to export this to something my Otago colleagues are happy to work with:

Like many GIS analyses, all I’ve got to show for my efforts is a spreadsheet, but hopefully in this post you’ve gotten to see some of the blood and guts that goes into outputs like this.  I thought it might be worthwhile to go through this because there was a nice variety of (vector) tools needed to get what we needed: proximity analysis with Near and Generate Near Table, table joins, and spatial joins.  All up, with downloading data from Stats and the running of tools, this might have taken me about, oh, maybe and hour and a half?  There was a bit of thought that went into sequencing and matching the right data for the job, but it gives my colleagues something to work on and I can now sit back and relax…but not for long.  With GIS you’re never idle for long.  Really must get on to finishing that darned Python script