Introduction to NumPy

Operations on NumPy arrays

One of the most powerful features of NumPy is its ability to manipulate entire arrays of numbers in one go.

Numerical operations

In Python, you can multiply a single number by another to get a new number:

In [1]:
single_number = 3.14
single_number * 2
Out[1]:
6.28

However, if you try to multiply a list by a number it will give a perhaps strange result:

In [2]:
python_list = [3.14, 2.71, 1.18]
python_list * 2
Out[2]:
[3.14, 2.71, 1.18, 3.14, 2.71, 1.18]

This is hapenning because Python's lists are not restricted to only hold numbers, nor must they only hold one consistent type, and so they do not have any special logic to account for the case where they do only have numbers in them. The only safe way to interpret * that works for all Python lists is "duplicate the array".

NumPy, however, is designed to deal with numerical data and so interprets the request differently:

In [3]:
import numpy as np
In [4]:
numpy_array = np.array([3.14, 2.71, 1.18])
numpy_array * 2
Out[4]:
array([6.28, 5.42, 2.36])
3.14
2.71
2.36
×
2
6.28
5.42
2.36

Here, each number has been multiplied by 2 individually.

You can perform any standard numerical operations to NumPy arrays, including *, +, /, - and **. You can also use comparison operations like ==, > and <=. If your array contains booleans (True/False) the you can also use the binary logic operations such as | ("or") and & ("and") as well as the unary logical operator ~ ("not").

In all of these cases, it will apply the operation to each element of the array indivudually and give you back an array of the same size.

One big benefit of this is an improvment in speed. To demonstrate this, let's try doubling all the values in a large list of 1 million values:

In [5]:
large_python_list = list(range(1_000_000))
large_numpy_array = np.arange(1_000_000)

Doing this with plain Python could be done with a list comprehension:

In [6]:
%%timeit

[i*2 for i in large_python_list]
52.1 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But NumPy allows us to do:

In [7]:
%%timeit

large_numpy_array * 2
1.41 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

You might see different results on your computer but speedups of anything from 10 to 100 times is common on an example like this. There are plenty of operations which might see speedups of 1000 times or more.

Practice

  • Try multiplying the array with different numbers
  • Subtract $3.04$ from each element of the array
  • Use a comparison operator to ask if each number is greater than $2.5$.

answer

Functions

As well as simple numerical operations, you will often also want to perform more complex operations on your data. For example, the cosine of a number. We can do this in plain Python with the math module:

In [8]:
import math

math.cos(single_number)
Out[8]:
-0.9999987317275395

This works, but has the same problem as above in that it doesn't work as you want with a Python list:

In [9]:
math.cos(python_list)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [9], line 1
----> 1 math.cos(python_list)

TypeError: must be real number, not list

To help with this, NumPy provides a large number of operations via the numpy namespace. They work the same way as the Python functions for single numbers:

In [10]:
np.cos(single_number)
Out[10]:
-0.9999987317275395

But they also work with Python lists:

In [11]:
np.cos(python_list)
Out[11]:
array([-0.99999873, -0.90830067,  0.38092482])

You see here that even though we passed it a Python list, it has returned the result as a NumPy array. We can also pass in a NumPy array directly:

In [12]:
np.cos(numpy_array)
Out[12]:
array([-0.99999873, -0.90830067,  0.38092482])
np.cos(
3.14
2.71
2.36
)
-0.999
-0.908
 0.381

There is a cost to passing in Python lists compared with using an array directly, as it has to convert it from one to the other. If you can, it's best to keep things as NumPy arrays throughout your computations.

Abstractions

The ability for NumPy functions to work on plain numbers as well as NumPy arrays allows us to write code which works for both single values, as well as arrays of numbers. This avoids the need for type checks and makes our code more expressive.

Imagine we have a function, poly as part of our code which does some maths to its input, e.g.

$$ \mathrm{poly}: a \mapsto 4a - a^4 $$
In [13]:
def poly(a):
    return a * 4 - a ** 4

We can of course call this function with a single number and it give the result:

In [14]:
poly(single_number)
Out[14]:
-84.65171216000002

Staying in the world of pure Python, we can try to pass a list, but of course it doesn't work as lists cannot be raised to a power:

In [15]:
poly(python_list)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [15], line 1
----> 1 poly(python_list)

Cell In [13], line 2, in poly(a)
      1 def poly(a):
----> 2     return a * 4 - a ** 4

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

If we pass an array it works since NumPy arrays can apply operations to their elements automatically:

In [16]:
poly(numpy_array)
Out[16]:
array([-84.65171216, -43.09580481,   2.78122224])

The key thing here is that one function can work on lots of different types of data. If you write your code to be able to deal with a single number, then NumPy will automatically make it able to do that same calculation to a whole load of numbers.

You do need to make sure that the code you write in your functions can work with NumPy arrays through. So you should use the NumPy functions like np.cos rather than math.cos. So to do:

$$ \mathrm{trig} : a \mapsto \sin(a) - \cos(a) $$

you should write:

In [17]:
def trig(a):
    return np.sin(a) - np.cos(a)

which works with single numbers:

In [18]:
trig(single_number)
Out[18]:
1.0015913846440263

and Python lists:

In [19]:
trig(python_list)
Out[19]:
array([1.00159138, 1.32661861, 0.54368119])

and, of course, NumPy arrays:

In [20]:
trig(numpy_array)
Out[20]:
array([1.00159138, 1.32661861, 0.54368119])

Extra: Plotting arrays

As our arrays get longer and more complex, it's difficult to see what's going on by just looking at the numbers. Let's see how we can easily plot the data as a line graph. Let's make our data to be plotted:

In [21]:
# Numbers from 0 to 20. 100 of them.
x = np.linspace(0, 20, 100)

y = trig(x)

First, we need to import matplotlib, the defacto standard plotting tool for Python:

In [22]:
import matplotlib.pyplot as plt

Then, we need to make a place for the plotting to happen which we do with the plt.subplots() function. This returns two things, a Figure (the whole page, which may contain multiple plots) and and Axes (the space in which we will plot).

We then draw on the axes with ax.plot and pass it the $y$ values:

In [23]:
fig, ax = plt.subplots()

ax.plot(y)
Out[23]:
[<matplotlib.lines.Line2D at 0x7ff2fac474f0>]

It has done the plot and the the $y$ values are correct, but the $x$ axis has just been taken as the integer indexes of the array. If we want to label the $x$ axis then we can pass two arguments to plot:

In [24]:
fig, ax = plt.subplots()

ax.plot(x, y)
Out[24]:
[<matplotlib.lines.Line2D at 0x7ff2faadd6f0>]

If you have more complex data than this and are wanting to plot multiple traces over the top with a legend and axes labels, then it's a sign that you might be better off using pandas for your analysis.

Exercise

There is a NumPy data file at the URL https://milliams.com/courses/intro_numpy/weather_data.npz which you should download into your current folder. You can do this either by clicking that link and downloading the file via your browser (make sure to copy it to the directory alongside your notebooks or scripts), or by running the following code in a Notebook cell:

import urllib.request
urllib.request.urlretrieve("https://milliams.com/courses/intro_numpy/weather_data.npz", "weather_data.npz")

You can then open the file (which contains multiple arrays, we just want the one called "rain_history" for now) with the following code which will give you a NumPy array called data:

with np.load("weather_data.npz") as weather:
    data = weather["rain_history"]
  • (optional) Plot the data as a line graph
  • Find the mean average, $\mu$, value in the data
  • Find the standard deviation, $\sigma$, of the data
  • Write a function which returns the ratio between the standard deviation and the mean, $\dfrac{\sigma}{\mu}$
  • Check that your funtion works with both Python lists as well as the data array

answer