Think Globaly, Extract Locally
We use Zonal Statistics as Table to extract raster data to polygons…but not without some frustration.
In our last instalment we covered extracting global data to a set of 15,831 points. I’ll show those points again below to jog our collective memories.
As we saw, it’s proved a challenge in many ways that could occupy many many of these posts (Ed. no thanks). I’ll outline one of the many ones here, as a cautionary tale, the moral of which is, roughly, Analyst, know thy tools. Oh, and thy data.
With points, this was relatively straightforward – I could use a spatial join (Identity or Intersect) to combine the points with the vector polygons. For the raster data Extract Multi Values to Points worked well. With a bit of planning, I could do all this by sequentially adding the data from my global layers to the points. And this we did (though not without some swearing…).
It later became clear that there could be some advantages to doing this on a catchment scale rather than as points. In other words, all our points are in a river catchment so if we can instead summarise the data from the catchments associated with each point, it would get us a better model. In theory, this sounds reasonably straightforward; in practice, it threw up some issues. (Ed. Some? How about a LOT.)
Before going much further I need to say something about the catchments. After all, catchments come in all sorts of different sizes and shapes, right? We can talk about the catchment for the Waimakariri River and just as easily talk about the catchment for one of its tributaries, say the Cam River. One sits within the other and they can behave differently, depending on lots of factors (rainfall, slopes, land cover, etc). To look at this on a global scale we’ve been using the HydroBASINs dataset. We’ve talked about this one recently (cue more colourful language) – this dataset has 12 levels of catchments, from continent-sized at level 1 to your more local scale catchments at level 12. After spending a bit of time figuring out which was the appropriate level (not an insignificant task – I could tell you about it but it would put you right to sleep (Ed. if you’re not already…) we settled on level 10. And here’s where the problems begin to arise.
With points, we can be pretty sure that a point will always be within one raster cell or one vector polygon, so extracting values is always (well, almost) unambiguous. With catchment areas, it becomes trickier because a catchment may cross over several polygons in my input data. With raster cells, depending on the resolution, a catchment may cross over many different raster cells. For example, in the image below, my level 10 catchment contains around 570 different raster cells from a slope layer:
So how do I best summarise the data within each catchment? As usual with lots of GIS questions – it depends. Here it mainly depends on what kind of data are being summarised. They could be categories (like different soil types or land covers), or measured values (like rainfall or runoff), or already averaged values (like population density). This means that each layer may need some different massaging – here the moral of the story is: Analyst, know thy data.
Now I’ll get the tricky stuff by talking about one layer in particular. Given that E.Coli can often be traced to the density of livestock, we’re using several raster layers of animal counts for chickens, sheep, pigs, cattle, goats, buffaloes, ducks and horses, courtesy of a global dataset from FAO. These are raster layers with a resolution of 0.0833 decimal degrees, roughly 60 km2 – chickens shown below:
The values in the raster are counts of the number of chickens per square kilometer. Oddly enough, the raw values are floating point (i.e. they have a decimal point with places past the decimal) rather than integers, like you would get by going out and counting the number of chickens you see, but that’s a topic for another day.
Our first issue arises when I try to summarise these values within my level 10 catchments. Since I’m trying to summarise a raster layer with a vector polygon, my best options are limited to two tools: Zonal Statistics and Zonal Statistics as Table. They both do the same thing: summarise raster values within zones with stats like mean, median, sum, max, min and a few others. The difference is that the former produces a new spatial output layer while the latter produces a new output table, which could then be joined to another layer. Which one you use depends on what you’re needing for your analysis. I’ve chosen to use Zonal Stats as a Table, mainly to reduce the number of spatial layers I need to manage – doing it this way means I can keep the number of spatial layers low (as in just 1) but do more table joins to add in the global data values. Here we go.
Here’s the tool set up to do one of my first livestock layers:
My Zone Data layer is the vector polygon layer of Level 10 Catchments – this are the areas within which I’ll add up all the raster cells values from the Input Value Raster (my chickens). Notice that I could use either a raster or vector layer for this zone data. This being a raster analysis tool, I need to be aware that the first thing this tool will do is convert my polygons to raster before adding things up (stay tuned – there’s another post in that statement – sorry). Here’s a quick look at the L10 Catchments attribute table:
There are 941,012 catchments in this layer that cover the whole globe. I’ve already added the Strahler Order, population density, area in Km2, and extracted the Olsen P mean and median values. Since it’s a feature class in a geodatabase, I’ve automatically got the Shape_Length and Shape_Area (in decimal degrees). To join the output table from this tool to the catchments layer, I need a unique identifier that’s present in both. The only real candidate is OBJECTID, which is why you see that as the second parameter – the output table will include this which will always allow me to join the table back to this layer.
The Input Value Raster is the layer I’m wanting to summarise. It’s called “6_Ch_2015_Aw.tif” and its full of chickens. ZS_L10_ChickAWT is the table that will be produced.
Next is the Statistics Type which I’ve set to Sum, so it will add up all the values of the cells within each catchment, hopefully giving me a rough estimate of the total number of chickens in each catchment.
For future reference, other options here are:
Let’s look at the output:
SUM is what’s of interest in this table. OBJECTID_12 is the attribute I can use to link this table back to the catchments layer. Of note here is that this table has only 808,092 records. That means I’m missing over 132,000 catchments, around 14% of the total. Not happy with that. Have those chickens crossed the road? And why?
Before I move on I really need to do some quality assurance to be happy that this output is working as I expected it to, so that means making sure the SUM values are reasonable and figuring out where all my other catchments went. These are essential sanity checks to make sure all is tickity-boo.
First, the SUM values. We’ll start by joining the output table to the L10 Catchments layer – that’s a right click on the catchment layer name > Joins and Relates > Add Join:
When run, the table values are added to my catchments layer:
Here I can see the first record has <Null> for SUM – but first I’ll check out the second record to give my self confidence that the result makes sense. By selecting the record in the table and clicking Zoom To I can see where it is:
We’re at the top of the Red Sea where the Suez Canal enters and it about 136 km2 in area. Zooming in and turning on my chickens layer gives me a sense of how many raster cells the catchment covers:
Clearly this catchment covers parts of five chicken grid cells and one area where there appears to be no data (we can see through to the basemap). Working with raster data at this scale is hard and we can’t easily display grid cell boundaries – I’m judging this based on what the visible areas tell me about the size of each cell. By clicking on individual cells I can just see the count in a pop up window and roughly add up the values as a check:
Looking at the five cell values, I get a rough total of around 38,000, which doesn’t agree. But if I take into account the relative amount of each cell that’s covered by the catchment and adjust them accordingly, the numbers relate better – this is what the tool does, so I’m pretty happy here. I check several others to give myself more confidence. Now onto the Null values.
Zooming to the selected record, I can clearly see the problem:
The whole of this particular catchment falls outside the extent of my raster data. If I click on that area, the pop up window tells me the value there is “NoData”, so there’s simply nothing there, no data to extract. And I’ve got over 132,000 of these. What to do?
Realistically I think I’ve got two options:
- Accept that I’ll have null values in my output, or
- Assume that in areas where there are no data, the count is 0
I going to have to think about this…my second option means adding information that I have no evidence to support. Conferring with my colleague, we agreed that the first choice was the most defensible. On to the remaining layers!
Zonal Statistics has been a good way to summarise the raster data within my vector catchments, provided I exercise caution and do some extra mahi around having confidence in the outputs. Another key part of this analysis is summarising vector data within the catchments, a topic I’m going to have to leave for another day, so stay tuned.
C