Manaully selecting data from an array by giving indices or ranges works if you want contiguous chunks of data, but sometimes you want to be able to grab arbitrary data from an array.
Let's explore the options for more advanced indexing of arrays.
import numpy as np
We'll use the following array in our examples. To make it easier to understand what's going on, I've set the value of the digit after the decimal place to match the index (i.e. the value at index 3 is 10.3):
a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])
Let's say that we want all the values in the array that are larger than 4.0. We could do this by manually finding all those indices which match and constructing a list of them to use as a selector:
large_indices = [0, 2, 3, 4, 6, 8]
a[large_indices]
Or, we can use a list of boolean values, where we set those elements we want to extract to True
and those we want to be rid of to False
:
mask = [True, False, True, True, True, False, True, False, True]
a[mask]
These lists of True
and False
are referred to as boolean arrays.
With a larger array it would be tedious to create this list by hand. Luckily NumPy provides us with a way of constructing these automatically. If we want a boolean array matching the values which are greater than 4, we can use the same sort of syntax we used for multiplication, but use >
instead:
a > 4
Or, diagramatically (using ■ for True
and □ for False
):
6.0 |
2.1 |
8.2 |
10.3 |
5.4 |
1.5 |
7.6 |
3.7 |
9.8 |
■ |
□ |
■ |
■ |
■ |
□ |
■ |
□ |
■ |
This mask can be saved to a variable and passed in as an index:
mask = a > 4
a[mask]
Or, in one line:
a[a > 4]
which can be read as "select from a
the elements where a
is greater than 4"
6.0 |
2.1 |
8.2 |
10.3 |
5.4 |
1.5 |
7.6 |
3.7 |
9.8 |
■ |
□ |
■ |
■ |
■ |
□ |
■ |
□ |
■ |
6.0 |
8.2 |
10.3 |
5.4 |
7.6 |
9.8 |
6.0 |
8.2 |
10.3 |
5.4 |
7.6 |
9.8 |
Just like at the beginning of the course when we set values in an array with:
a[4] = 99.4
We can also use the return value of any filter to set values. For example, if we wanted to set all values greater than 4 to be 0 we can do:
a[a > 4] = 0
a
6.0 |
2.1 |
8.2 |
10.3 |
5.4 |
1.5 |
7.6 |
3.7 |
9.8 |
■ |
□ |
■ |
■ |
■ |
□ |
■ |
□ |
■ |
6.0 |
2.1 |
8.2 |
10.3 |
5.4 |
1.5 |
7.6 |
3.7 |
9.8 |
0.0 |
2.1 |
0.0 |
0.0 |
0.0 |
1.5 |
0.0 |
3.7 |
0.0 |
These techniques work with any dimensionality of data if setting values, but there are some issues to be aware of when dealing with higher-dimensions if you are extracting subsets. For example, let's take a 3D grid and select some values from it:
grid = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(grid)
grid[grid > 4]
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 | 9 |
□ | □ | □ |
□ | ■ | ■ |
■ | ■ | ■ |
5 | 6 | 7 | 8 | 9 |
In this case, it has lost the information about where the numbers have come from. The dimensionality has been lost.
NumPy hasn't much choice in this case, as if it were to return the result with the same shape as the original array grid
, what should it put in the spaces that we've filtered out?
[[? ? ?]
[? 5 6]
[7 8 9]]
You might think that it could fill those with 0
, or -1
but any of those could very easily cause a problem with code that follows. NumPy doesn't take a stance on this as it would be dangerous.
In your code, you know what you're doing with your data, so it's ok for you to decide on a case-by-case basis. If you decide that you want to keep the original shape, but replace any filtered-out values with a 0
then you can use NumPy's where
function. It takes three arguments:
So, in the case where we want to replace any values less-than or equal-to 4 with 0
, we can use:
np.where(grid > 4, grid, 0)
□ | □ | □ |
□ | ■ | ■ |
■ | ■ | ■ |
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 | 9 |
0 | 0 | 0 |
0 | 5 | 6 |
7 | 8 | 9 |
Note that this has not affected the original array:
grid
One final way that you can reduce your data, while keeping the dimensionality is to use masked arrays. This is useful in situations here you have missing data. The advantage of masked arrays is that operations like averaging are not affected by the cells that are masked out. The downside is that for much larger arrays they will use more memory and can be slower to operate on.
masked_grid = np.ma.masked_array(grid, grid <= 4)
print(masked_grid)
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 | 9 |
■ | ■ | ■ |
■ | □ | □ |
□ | □ | □ |
5 | 6 | |
7 | 8 | 9 |
np.mean(masked_grid)
The rain
data set represents the prediction of metres of rainfall in an area and is based on the data from ECMWF. It is two-dimensional with axes of latitude and longitude.
with np.load("weather_data.npz") as weather:
rain = weather["rain"]
uk_mask = weather["uk"]
irl_mask = weather["ireland"]
spain_mask = weather["spain"]
rain
data set.uk_mask
array, including its dtype
and shape
rain
data set to contain only those values from within the UK.[]
indexing, np.where
or masked_arrays
make the most sense for this task?