Filtering data¶

Manaully selecting data from an array by giving indices or ranges works if you want contiguous chunks of data, but sometimes you want to be able to grab arbitrary data from an array.

Let's explore the options for more advanced indexing of arrays.

import numpy as np

We'll use the following array in our examples. To make it easier to understand what's going on, I've set the value of the digit after the decimal place to match the index (i.e. the value at index 3 is 10.3):

a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])

Filter by boolean¶

Let's say that we want all the values in the array that are larger than 4.0. We could do this by manually finding all those indices which match and constructing a list of them to use as a selector:

large_indices = [0, 2, 3, 4, 6, 8]
a[large_indices]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

Or, we can use a list of boolean values, where we set those elements we want to extract to True and those we want to be rid of to False:

mask = [True, False, True, True, True, False, True, False, True]
a[mask]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

These lists of True and False are referred to as boolean arrays.

With a larger array it would be tedious to create this list by hand. Luckily NumPy provides us with a way of constructing these automatically. If we want a boolean array matching the values which are greater than 4, we can use the same sort of syntax we used for multiplication, but use > instead:

a > 4

array([ True, False,  True,  True,  True, False,  True, False,  True])

Or, diagramatically (using ■ for True and □ for False):

This mask can be saved to a variable and passed in as an index:

mask = a > 4
a[mask]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

Or, in one line:

a[a > 4]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

which can be read as "select from a the elements where a is greater than 4"

Practice¶

Extract all the values in a which are:

Less than 5
Greater than or equal to 8.2
Equal to 6.0

answer

Setting from filters¶

Just like at the beginning of the course when we set values in an array with:

a[4] = 99.4

We can also use the return value of any filter to set values. For example, if we wanted to set all values greater than 4 to be 0 we can do:

a[a > 4] = 0
a

array([0. , 2.1, 0. , 0. , 0. , 1.5, 0. , 3.7, 0. ])

Extra: Filtering multi-dimensional data¶

These techniques work with any dimensionality of data if setting values, but there are some issues to be aware of when dealing with higher-dimensions if you are extracting subsets. For example, let's take a 3D grid and select some values from it:

grid = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]

grid[grid > 4]

array([5, 6, 7, 8, 9])

In this case, it has lost the information about where the numbers have come from. The dimensionality has been lost.

NumPy hasn't much choice in this case, as if it were to return the result with the same shape as the original array grid, what should it put in the spaces that we've filtered out?

[[? ? ?]
 [? 5 6]
 [7 8 9]]

You might think that it could fill those with 0, or -1 but any of those could very easily cause a problem with code that follows. NumPy doesn't take a stance on this as it would be dangerous.

In your code, you know what you're doing with your data, so it's ok for you to decide on a case-by-case basis. If you decide that you want to keep the original shape, but replace any filtered-out values with a 0 then you can use NumPy's where function. It takes three arguments:

the boolean array selecting the values,
an array or values to use in the spots you're keeping, and
an array or values to use in the spots you're filtering out.

So, in the case where we want to replace any values less-than or equal-to 4 with 0, we can use:

np.where(grid > 4, grid, 0)

array([[0, 0, 0],
       [0, 5, 6],
       [7, 8, 9]])

Note that this has not affected the original array:

grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

One final way that you can reduce your data, while keeping the dimensionality is to use masked arrays. This is useful in situations here you have missing data. The advantage of masked arrays is that operations like averaging are not affected by the cells that are masked out. The downside is that for much larger arrays they will use more memory and can be slower to operate on.

masked_grid = np.ma.masked_array(grid, grid <= 4)
print(masked_grid)

[[-- -- --]
 [-- 5 6]
 [7 8 9]]

np.mean(masked_grid)

7.0

Exercise¶

The rain data set represents the prediction of metres of rainfall in an area and is based on the data from ECMWF. It is two-dimensional with axes of latitude and longitude.

with np.load("weather_data.npz") as weather:
    rain = weather["rain"]
    uk_mask = weather["uk"]
    irl_mask = weather["ireland"]
    spain_mask = weather["spain"]

Calculate the mean of the entire 2D rain data set.
Look at the uk_mask array, including its dtype and shape
Filter the rain data set to contain only those values from within the UK.
- Does [] indexing, np.where or masked_arrays make the most sense for this task?
Calculate the mean (and maximum and minimum if you like) of the data
Do the same with Ireland and Spain and compare the numbers