Load and Write Basic Data Files¶
Basic (text) file handling, NumPy, pandas, and DateTime. For interactive reading and executing code blocks and find pybum.ipynb or Python (Installation) locally along with JupyterLab.
Data can be stored in many different (text) file formats such as txt or csv files. Python provides the open(file)
and write(...)
functions to read and write data from nearby every text file format, respectively. There are packages such as csv
(for csv files), which simplify handling specific file types. This section illustrates the use of the load(file)
and write(...)
functions, and introduces the pandas module with its capacity to import and export numeric data along with row and column headers.
Load (Open) Text File Data¶
The open
command loads text files as file object in Python. The syntax of the open
command is:
open("filename", "mode")
where:
filename
is the file to open (e.g.,"data.txt"
); if the file is not in the script directory, the file name needs to be extended by the full directory (path) to the data file (e.g.,"C:/experiment1/data.txt"
).mode
defines the access type to the file and it can take the following values:"r"
 read only (default value if no"mode"
value is provided (the file cannot be modified nor overwritten).rb"
 readonly in binary format; the binary format is advantageous if the file is not a text file but media such as pictures or videos."r+"
 read and write."w"
 write only; a new file is created if a file with the same name does not yet exist."wb"
 writeonly file in binary mode."w+"
 create, write and read."wb+"
 write and read in binary mode."a"
 append new data to a file; the writepointer is placed at the end of the file and a new file is created if a file with the provided file name does not yet exist."ab"
 append new data in binary mode."a+"
 both append (write at the end) and read."ab+"
 append and read data in binary mode.
When "r"
or "w"
modes are used, the file pointer (i.e, the blinking cursor that you can see for example in Word documents) is place at the beginning of the file. For "a"
modes, the file pointer is placed at the end of the file.
It is good practive to read and write data from and to a file within a with
statement to avoid file lock issues. For example, the following code block creates a new text file within a with
statement:
with open("data/new.csv", mode="w+") as file:
file.write("And yet it moves.")
Readonly¶
Once the file object is created, we can parse the file and copy the file data content to a desired Python data type (e.g., a list, tuple or dictionary). Parsing the data uses forloops (other loop types will also work) to iterate on lines and line entries. The lines represent strings, where the data columns can be separated by using the builtin string function line_as_list = str().split("SEPARATOR")
, where "SEPARATOR"
can be ","
(comma), ";"
(semicolon), "\t"
(tab), or any other sign. After reading all data from a file, use file_object.close()
to avoid that the file is locked by Python and cannot be opened by another program.
The following example opens a text file called purenumbers.txt (located in a subfolder called data) that contains float numbers between 0.0 and 10.0. The file has 17 data rows (e.g., for 17 experimental runs) and 4 data columns (e.g., for 4 measurements per experimental run), which are separated by a TAB ("\t"
separator) The code snippet uses the builtin function readlines()
to parse the file lines, splits the lines using the "\t"
separator, and loops over the line entries to append them to the list variable data_list
, if entry
is numeric (verified with the try
 except
statement). data_list
is a nested list that is initiated at the beginning of the script and a sublist (nested list) is appended for every file line (row).
file_object = open("data/purenumbers.txt") # read file with default "mode"="r"
data_list = [] # this will be a nested list with 17 sublists (rows) containing 4 entries (columns)=
for line in file_object.readlines():
line_as_list = line.split("\t") # converts the line into a list using a tab (\t) separator
data_list.append([]) # append an empty sublist for every file line (17 rows)
for entry in line_as_list:
try:
# try to append the entry as floating point number to the last sublist, which is pointed at using [1]
data_list[1].append(float(entry))
except ValueError:
# if entry is not numeric, append 0.0 to the sublist and print a warning message
print("Warning: %s is not a number. Replacing value with 0.0." % str(entry))
# verify that data_list contains the 17 rows (sublists) with the builtin list function __len__()
print("Number of rows: %d" % data_list.__len__())
# verify that the first sublist has four entries (number of columns)
print("Number of columns: %d" % data_list[0].__len__())
file_object.close() # close file (otherwise it will be locked as long as Python is still running!) alternative: use withstatement
print(data_list) # print the data
Number of rows: 17
Number of columns: 4
[[2.59, 5.44, 4.06, 4.87], [4.43, 1.67, 1.26, 2.97], [4.04, 8.07, 2.8, 9.8], [2.25, 5.32, 0.04, 5.57], [6.26, 6.15, 5.98, 8.91], [7.93, 0.85, 5.88, 5.4], [4.72, 1.29, 4.18, 2.46], [7.03, 1.43, 5.53, 9.7], [5.2, 7.87, 1.44, 1.13], [3.18, 5.38, 3.6, 7.32], [5.37, 0.62, 5.29, 4.26], [3.48, 2.26, 3.11, 7.3], [1.36, 1.68, 3.38, 6.4], [1.68, 2.31, 9.29, 3.59], [1.33, 1.73, 3.98, 5.74], [2.38, 9.69, 0.06, 4.16], [9.3, 6.47, 9.14, 3.33]]
Tip
Recall the with
statement from the above example. With the with
statement, we do not have to write file.close()
.
Create and Write Files¶
A file is created with the "w"
… file open modes (see above) or with open(file_name, mode="a")
.
Tip
When mode='w'...
, the provided file is opened with the pointer at position zero. Writing data will make the pointer overwrite any existing data at the position. That means any existing data in the opened file will be overwritten. To avoid overwriting of existing file data use mode='w'...
.
Imagine that the aboveloaded data_list
represents measurements in mm and we know that the precision of the measuring device was 1.0 mm. Thus, all data smaller than 1.0 are within the error margin, which we want to exclude from further analyses by overwriting such values with nan (notanumber). For this purpose, we first create a new list variable new_data_list
, where we append nan values if data_list[i, j] <= 1.0
and otherwise we preserve the original numeric value of data_list
.
With open("data/modifieddata.csv", mode="w+")
, we create a new csv (commaseparated values) file in the data subfolder. A forloop iterates on the sub_lists of new_data_list
and joins them with a commaseparator. In order to join the list elements of i
(i.e., the sublists) with ", ".join(list_of_strings)"
, all list entries first need to be converted to strings, which is achieved through the expression [str(e) for e in row]
. The "\n"
string needs to be concatenated at the end of every line to create a line break ("\n"
itself will not be visible in the file). The command new_file.write(new_line)
write the sublistconvertedtostring to the file "data/modifieddata.csv"
. Once again, new_file.close()
is needed to avoid that the new csv file is locked by Python.
# create a new list and overwrite all values <= 1.0 with nan
new_data_list = []
for i in data_list:
new_data_list.append([])
for j in i:
if j <= 1.0:
new_data_list[1].append("nan")
else:
new_data_list[1].append(j)
print(new_data_list)
# write the modified new_data_list to a new text file
new_file = open("data/modifieddata.csv", mode="w+") # lets just use csv: Python does not care about the file ending (could also be file.wayne)
for row in new_data_list:
new_line = ", ".join([str(e) for e in row]) + "\n"
new_file.write(new_line)
new_file.close()
[[2.59, 5.44, 4.06, 4.87], [4.43, 1.67, 1.26, 2.97], [4.04, 8.07, 2.8, 9.8], [2.25, 5.32, 'nan', 5.57], [6.26, 6.15, 5.98, 8.91], [7.93, 'nan', 5.88, 5.4], [4.72, 1.29, 4.18, 2.46], [7.03, 1.43, 5.53, 9.7], [5.2, 7.87, 1.44, 1.13], [3.18, 5.38, 3.6, 7.32], [5.37, 'nan', 5.29, 4.26], [3.48, 2.26, 3.11, 7.3], [1.36, 1.68, 3.38, 6.4], [1.68, 2.31, 9.29, 3.59], [1.33, 1.73, 3.98, 5.74], [2.38, 9.69, 'nan', 4.16], [9.3, 6.47, 9.14, 3.33]]
Modify Existing Files¶
Existing text files can be opened and modified in either mode="r+"
(pretending that information needs to be read before it is modified) or mode="a+"
. Recall that "r+"
will place the pointer at the beginning of the file and "a+"
will place the pointer at the end of the file. So if we want to modify existing lines, "r+"
is the good choice and if we want to append data at the end of the file, "a+"
is the good choice (+
is not strictly needed in the case of "a+"
). This section shows to examples: (1) modification of existing data in a file using "r+"
, and (2) appending data to an existing file using "a+"
.
First example  replace data in file: In the previous example, we eliminated all measurements that were smaller than 1 mm because of the precision of the measurement device. However, we have retained all other values with twodigit accuracy  an accuracy which is not given. Consequently, all decimal places in the measurements must also be eliminated. To achieve this we have to round all measured values with Python’s builtin rounding function (round(number, ndigits
) to zero decimal places (i.e., ndigits = 0
).
In this example, an exception IOError
is raised when the file "data/modifieddata.csv"
does not exist (or is locked by another software). An if
statement ensures that rounding the data is only attempted if the file exists.
The overwriting procedures first reads all lines of the file into the lines
variable. After reading the lines, the pointer is at the end of the file and file.seek(0)
puts the pointer back to position 0 (i.e., at the beginning of the file). file.truncate()
purges the file. Yes, the original file is blank for a moment and all file contents are stored in the lines
variable. Rounding the data happens within a forloop that:
Splits the commaseparated line string (produces
lines_as_list
).Creates the temporary list
_numeric_line_
, where rounded, numeric value are stored (the variable is overwritten in every iteration).Loops over the line entries (
line_as_list
), where an exception statement appends rounded (to zero digits), numeric values and appends"nan"
when an entry is not numeric.Writes the modified line to the
"data/modifieddata.csv"
csv file.
Finally, the csv is closed with modified_file.close()
.
try:
modified_file = open("data/modifieddata.csv", mode="r+") # reopen the above data file in readwrite
except IOError:
print("The file does not exist.")
if modified_file:
# go here only if the file exists
lines = modified_file.readlines() # read lines > pointer moves to file end
modified_file.seek(0) # return pointer to file beginning
modified_file.truncate() # clear file content
for line in lines:
line_as_list = line.split(", ") # converts the line into a list using comma separator
_numeric_line_ = []
for e in line_as_list:
try:
_numeric_line_.append(round(float(e), 0)) # try to convert line entry to float and round to 0 digits
except ValueError:
_numeric_line_.append(e) # for nan values
# write rounded values
modified_file.write(", ".join([str(e) for e in _numeric_line_]) + "\n")
print("Processed file." )
modified_file.close()
Processed file.
Theoretically the above code snippet can be rewritten as a function to modify any data in a file. In addition, other threshold values or particular data ranges can be filtered using if
 else
statements.
Second example  append data to file: By coincidence, you find on one of the measurement protocols that there is an 18th experimental run that is not in the electronic measurement data file due to a data transmission error. Now, we want to add the data to the aboveproduce csv file manually. Entering the data does not take much work, because only 4 measurements were performed per experimental run and we could already manually apply the above filters ("nan"
and rounding) in a list variable called forgotten_data
.
This example uses the os
module (recall Packages, Modules and Libraries) to verify if the data file exists with os.path.isfile()
(the os.getcwd()
statement is a gadget here). This example features another way of directly opening and writing to the data file using a with
statement (i.e., a context manager). The with
context is particularly useful for file handling because the file object only exists in the indented with
context block, which makes the tedious call to file.close()
obsolete.
The essential part that writes the line to the data file is file.write(line)
, where line
corresponds to the aboveintroduced ", ".join(listofstrings) + "\n"
string.
import os
print(os.getcwd())
forgotten_data = [4.0, 3.0, "nan", 8.0]
if os.path.isfile("data/modifieddata.csv"):
with open("data/modifieddata.csv", mode="a") as file_object:
file_object.write(", ".join([str(e) for e in forgotten_data]) + "\n")
print("Data appended.")
else:
print("The file does not exist.")
C:\Users\schwindt\jupyter\nblectures
Data appended.
{% include challenge.html content=”The code block ', '.join([str(e) for e in a_list]) + '\n'
is a recurring expression in the above code snippets. How does a function look like that automatically generates this code block for lists of different data types?
NumPy¶
NumPy provides highlevel mathematical functions for linear algebra including operations on multidimensional arrays and matrices. The opensource NumPy (for Numerical Python) package is written in Python and C, and comes with comprehensive documentation (download the latest version on the developer’s web site or read the developer’s online tutorial).
Installation¶
NumPy can be installed within Anaconda (recall instructions) and the developers recommend to use a scientific Python distribution (Anaconda) with SciPy Stack.
With the provided environment.yml (hypy
) for Anaconda, NumPy is already installed (more information in the installation section). To install NumPy in another conda environment, open Anaconda Prompt (Start > type Anaconda Prompt) and type:
conda activate ENVIRONMENTNAME
conda install numpy
Usage¶
The NumPy module (package) is typically imported with import numpy as np
. Array handling is the baseline of linear algebra and NumPy, where arrays represent nested data lists. To create a NumPy array, use np.array((values))
, where values
is a sequences of values.
Tip
This section provides insights in some basic functions provided with NumPy, but does not (or cannot) cover all NumPy functions and data types. Generally speaking, be sure that whatever mathematical operation you want to perform, NumPy offers a solution. Just checkout the NumPy documentation, have a look at NumPy’s functions and methods overview, or use your favorite search engine with the search words numpy FUNCTION.
The rounded parentheses indicated that the value sequence represents a tuple that may contain lists for creating multidimensional array. Thus, for creating an 2x3 array (with some random values), we can write:
import numpy as np
an_array = np.array(([2, 3, 1], [4, 5, 6]))
print(an_array)
[[2 3 1]
[4 5 6]]
NumPy arrays (data type: ndarray) have many builtin features, for example to output the array size:
print(type(an_array))
print("Array dimensions: " + str(an_array.shape))
print("Total number of array elements: " + str(an_array.size))
print("Number of array axes: " + str(an_array.ndim))
<class 'numpy.ndarray'>
Array dimensions: (2, 3)
Total number of array elements: 6
Number of array axes: 2
There are many types of np.array
s and many ways to create them:
print(np.array([(2, 3, 1), (4, 5, 6)])) # the same as an_array
print(np.array([[2, 3, 1], [4, 5, 6]], dtype=complex))
[[2 3 1]
[4 5 6]]
[[2.+0.j 3.+0.j 1.+0.j]
[4.+0.j 5.+0.j 6.+0.j]]
Arrays of zeros or ones, or empty arrays can be created with integer or float. When creating such arrays, be aware of using tuples (i.e., sequences embraced with rounded parentheses) to define array dimensions:
print(np.zeros((2,6)))
print(np.ones((2,6), dtype=np.float64)) # other dtypes: int16, np.int16, float, np.float32, np.complex32
print(np.empty((2,6)))
print(np.empty((2,6), dtype=np.int16))
[[0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0.]]
[[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]]
[[1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1.]]
[[2 0 3 0 1 0]
[4 0 5 0 6 0]]
Data type sizes
NumPy data types have different sizes (in bytes) and the more digits, the larger the variable size. For example, np.float64
has an itemsize of 8 bytes (64/8), while np.float32
has an itemsize of 4 bytes (32/8) only. Use ndarray.itemsize
(e.g., an_array.itemsize
) to find out the size of an array in bytes. For analyses of large datasets, the data type become very important regarding computation speed and storage.
NumPy provides the arange(start, end, stepsize)
function to create numeric sequences. Such sequences represent arrays (ndarray
) can then be reshaped (reorganized in columns and rows).
print("1D array:")
print(np.arange(0, 10, 2)) # 1D array
print("\n2D array:")
print(np.arange(0, 12, 2).reshape(2, 3)) # 2D array
print("\n3D array:")
print(np.arange(1, 13, 1).reshape(2, 2, 3)) # 3D array
print("\n1D Linspace (start, end, numberofelements):")
print(np.linspace(0, np.pi, 3))
1D array:
[0 2 4 6 8]
2D array:
[[ 0 2 4]
[ 6 8 10]]
3D array:
[[[ 1 2 3]
[ 4 5 6]]
[[ 7 8 9]
[10 11 12]]]
1D Linspace (start, end, numberofelements):
[0. 1.57079633 3.14159265]
Random numbers can be generated with NumPy’s random number generator np.random
and its .random(range_tuple)
function.
rand_array = np.random.random((2,4))
print(rand_array)
[[0.0204844 0.91185321 0.00152947 0.79774412]
[0.45685876 0.65600015 0.55038482 0.03690686]]
Builtin array functions enable finding minimum or maximum values, or sums of arrays:
print("Sum of 12elements onesarray: " + str(np.ones((2,6)).sum()))
print("Minimum: " + str(an_array.min()))
print("Maximum: " + str(an_array.max()))
Sum of 12elements onesarray: 12.0
Minimum: 1
Maximum: 6
Color Arrays¶
Arrays may also contain color information, whee colors represent a mix of the three base colors red, green, and blue. One color is defined as [redvalue, greenvalue, bluevalue]
and a value of 0 means that a color tone is not present, while 255 is its maximum value. When all color tone values are zero, there is no color, which corresponds to black; when all color tones are maximum (255), the color mix corresponds to white. This way, array elements can be lists of color tones and plotting such arrays produces images. The following example produces an array with 5 colorlist elements, which could be plotted as a very basic image with 5 pixels (one black, red, green, blue, and white, respectively):
color_set = np.array([[0, 0, 0], # black
[255, 0, 0], # red
[0, 255, 0], # green
[0, 0, 255], # blue
[255, 255, 255]]) # white
Array (Matrix) Operations¶
Array calculations (matrix operations) follow the rules of linear algebra:
A = np.random.random((2,4))
B = np.random.random((4,2))
print("Subtraction: " + str(A.transpose()  B))
print("Elementwise product: " + str(A.transpose() * B))
print("Matrix product (option 1): " + str(A @ B))
print("Matrix product (option 2): " + str(A.dot(B)))
Subtraction: [[ 0.1115262 0.48000352]
[ 0.1494478 0.31398052]
[ 0.38778125 0.18341211]
[ 0.0014262 0.5411479 ]]
Elementwise product: [[0.78023517 0.24540468]
[0.01212987 0.11177306]
[0.04761943 0.21543691]
[0.37867353 0.05979503]]
Matrix product (option 1): [[1.218658 1.0311045 ]
[0.73402296 0.63240968]]
Matrix product (option 2): [[1.218658 1.0311045 ]
[0.73402296 0.63240968]]
Further elementwise calculations include exponential (**
), geometric (np.sin
, np.cos
, np.tan
, etc.), and boolean operators:
print("A to the power of 3: " + str(A**3))
print("Exponential: " + str(np.exp(A)))
print("Square root: " + str(np.sqrt(A)))
print("Sine of A times 3: " + str(np.sin(A) * 3))
print("Boolean where A is smaller than 0.3: " + str(A < 0.3))
A to the power of 3: [[0.83278802 0.00897507 0.11465192 0.23383377]
[0.02992313 0.14581371 0.18020002 0.25637808]]
Exponential: [[2.56210893 1.23098678 1.62548021 1.85165172]
[1.36404919 1.69272506 1.7591499 1.88753709]]
Square root: [[0.96996429 0.45586852 0.6969959 0.7849064 ]
[0.55718724 0.72549272 0.75155218 0.79704006]]
Sine of A times 3: [[2.42414331 0.61897047 1.40075657 1.73351616]
[0.91648323 1.50711546 1.60581841 1.78019148]]
Boolean where A is smaller than 0.3: [[False True False False]
[False False False False]]
Array Shape Manipulation¶
Sometimes it is necessary to stack a multidimensional array into a vector, or recast the shape of an array. There are a couple of options to manipulate the shape of an array:
print("Flattened matrix A (into a vector):\n" + str(A.ravel()))
print("\nTranspose matrix A and append B:\n" + str(np.array([A.transpose(), B])))
print("\nTranspose matrix A and append B and cast into a (4x4) array:\n" + str(np.array([A.transpose(), B]).reshape(4,4)))
Flattened matrix A (into a vector):
[0.94083072 0.20781611 0.48580329 0.61607806 0.31045762 0.52633969
0.56483068 0.63527285]
Transpose matrix A and append B:
[[[0.94083072 0.31045762]
[0.20781611 0.52633969]
[0.48580329 0.56483068]
[0.61607806 0.63527285]]
[[0.82930452 0.79046114]
[0.05836831 0.21235917]
[0.09802204 0.38141858]
[0.61465186 0.09412495]]]
Transpose matrix A and append B and cast into a (4x4) array:
[[0.94083072 0.31045762 0.20781611 0.52633969]
[0.48580329 0.56483068 0.61607806 0.63527285]
[0.82930452 0.79046114 0.05836831 0.21235917]
[0.09802204 0.38141858 0.61465186 0.09412495]]
NumPy File Handling and np.nan
¶
In the above examples on file handling, measurement data were loaded from text files, manipulated, and modified text file were written. The data manipulation involved the introduction of "nan"
(notanumber) values, which were excluded because measurements <1 mm were considered errors. Why didn’t we use zeros here? Well, zeros are numbers, too, and have significant effect on data statistics (e.g., for calculting mean values). However, the "nan"
string value caused difficulties in data handling, in particular regarding the consistency of function output. NumPy provides with the np.nan
data type a powerful alternative to the tedious "nan"
string.
NumPy also has a text file load function called np.loadtxt(filename, *args, **kwargs)
, which imports text files as np.array
s of float values. The default float value type can be adapted with the optional keyword dtype
. Other optional (keyword) arguments are:
delimiter=STR
(e.g.,delimiter=';'
), where default is"None"
usecols=TUPLE
(e.g.,usecols=(1, 3)
will extract the 2^{nd} and 4^{th} column) also one integer value is possible to read just on single columnskiprows=INT
(e.g.,skiprows=2
skips the first two lines), where default is0
more arguments are available and listed in the numpy documentation.
The following example loads the csv file data/modifieddata.csv containing integer and "nan"
string values, which are automatically converted to np.nan
.
experiment_data = np.loadtxt("data/modifieddata.csv", delimiter=",")
print("This is the data 4th line (row): " + str(experiment_data[3, :]))
print("The data type of the 3rd (%s) entry is: " % str(experiment_data[3, 2]) + str(type(experiment_data[3, 2])))
This is the data 4th line (row): [ 2. 5. nan 6.]
The data type of the 3rd (nan) entry is: <class 'numpy.float64'>
In addition, or as an alternative, the function np.load()
picks up data from filelike .npz
, .npy
, or pickled (saved Python objects) data sources (more information is available in the NumPy docs).
Statistics¶
The above examples featured some array functions to assess basic array parameters such as the minimum and maximum. NumPy provides many more functions for array statistics such as the mean, median, or standard deviation, including functions that account for np.nan
values. The following example illustrates some statistic function with the experimental data from the above examples. Note the usage of nanmean
instead of mean
and statistics along array axis, where the optional keyword argument axis=0
corresponds to columns and axis=1
to statistics along rows in 2dimensional arrays (maximum axis number corresponds to the array dimensions n minus 1, i.e., maximum axis=n1
).
print("Mean value (without nan): " + str(np.mean(experiment_data))) # no applicable result
print("Mean value with np.nan: " + str(np.nanmean(experiment_data)))
print("Mean value along axis 0 (columns): " + str(np.nanmean(experiment_data, axis=0)))
print("Mean value along axis 1 (rows): " + str(np.nanmean(experiment_data, axis=1)))
Mean value (without nan): nan
Mean value with np.nan: 4.626865671641791
Mean value along axis 0 (columns): [4.11111111 4.25 4.53333333 5.55555556]
Mean value along axis 1 (rows): [4.25 2.5 6.25 4.33333333 6.75 6.33333333
3. 6. 3.75 4.75 4.66666667 3.75
3. 4.25 3.25 5.33333333 6.75 5. ]
The following sections give a tabular overview of statistical functions in NumPy (source: NumPy v.1.13 docs). The listed functions only represent fundamental statistic functions and NumPy provides many more options, which can be leveraged using any search engine with NumPy and the desired function as search keywords.
Basic statistic functions
Function 
Description 

Minimum of an array or along an axis, ignoring 

Maximum of an array or along an axis, ignoring 

Range of values (max  min) along an axis. 

qth percentile of data along a specified axis. 

qth percentile of data along a specified axis, ignoring 
Mean (average), standard deviation, and variances
Function 
Description 

Median along an (optional) axis. 

Weighted average along an (optional) axis. 

Arithmetic mean along an (optional) axis. 

Standard deviation along an (optional) axis. 

Variance along an (optional) axis. 

Median along an (optional) axis, ignoring 

Arithmetic mean along an (optional) axis, ignoring 

Standard deviation along an (optional) axis, while ignoring 
Correlating data (arrays)
Function 
Description 

Pearson (productmoment) correlation coefficients. 

Crosscorrelation of two 1dimensional sequences. 

Estimate covariance matrix, based on data and weights. 
Generate and plot histrograms
Function 
Description 

Histogram of a set of data. 

Bidimensional histogram of two data samples. 

Multidimensional histogram of some data. 

Count number of occurrences of each value in array of nonnegative ints. 

Indices of the bins to which each value in input array belongs. 
Can NumPy do MATLAB®?¶
Are you considering to switch to Python after starting softly into programming with MATLAB®like software? There are many reasons for enhancing data analyses with Python and here are some facilitators for previous MATLAB® users:
MATLAB® matrices can be loaded and saved with
scipy.io.loadmat(matrixfilename)
(useimport scipy
).NumPy’s
np.array
replaces MATLAB®’s matrix notation (even though there is the historic, deprecated NumPy data typenp.matrix
).Import many MATLAB® features from
np.matlib
(e.g.,from numpy.matlib import rand, zeros, ones, empty, eye)
or more generallyimport numpy.matlib as M
).Find the NumPy equivalent of many MATLAB® function in the NumPy documentation.
To emulate MATLAB®’s plot functions use the
pylab
package and import it asfrom pylab import *
.
⚠ This overwrites all other (standard) definitions of theplot()
function andarray()
objects. So this usage is deprecated. Read the plotting section for comprehensive plotting instructions with Python.
MATLAB® is a registered trademark of The MathWorks.
Exercise
Practice numpy and csv file handling in the Reservoir design exercise.
Pandas¶
pandas is a powerful module (package) for data analyses and manipulation with Python. It has can handle NumPy arrays, and both packages jointly represent a powerful data processing engine. The power of pandas lies in processing data frames, data labeling (e.g., workbooklike columns names), and flexible file handling functions (e.g., the builtin read_csv(csvfile)
). While NumPy arrays enable calculations with multidimensional arrays (beyond 2dimensional tables) and low memory consumption, pandas DataFrame
s efficiently process and label tabular data with more than ~100,000 rows. Because of its labelling capacity, pandas also finds broad application in machine learning. In summary, pandas’ functionality builds on top of NumPy and both packages are developed by the SciPy (Scientific computing tools for Python) community that also develops matplotlib
(see the plotting section and IPython (Jupyter’s Python kernel).
Installation¶
pandas’ developer’s recommend to install pandas with the SciPy stack in Anaconda, similar to the recommendations for installing NumPy. With the provided environment.yml (hypy
) for Anaconda, NumPy is already installed (more information in the Python (Installation) section). To install NumPy in another conda environment, open Anaconda Prompt (Start > type Anaconda Prompt) and type:
conda activate ENVIRONMENTNAME
conda install pandas
Usage¶
pandas standard import alias is pd
: import pandas as pd
. The following sections provide an overview of basic pandas functions and much more functionalities are documented in the developer’s docs.
Data frames & Series¶
The below code block illustrates the creation of pandas data frames, the core object of pandas. Note the difference between a 1dimensional series (corresponds to a onecolumn data frame), and a 2dimensional data frame with row (=index) and column names. The default row names number rows starting from 0 (unlike Office software that starts at row no. 1), without column names. Column names can be initially defined as a list and replaced with a dictionary that maps the initial list entries to new names.
import pandas as pd
print("A 1column pd.DataFrame:\n"+ str(pd.Series([3, 4, np.nan]))) # a simple pandas data frame with one column
row_names = np.arange(1, 4, 1)
wb_like_df = pd.DataFrame(np.random.randn(row_names.__len__(), 3),
index=row_names, columns=['A', 'B', 'C'])
print("\nThis is a workbooklike (row and column names) data frame:\n" + str(wb_like_df))
print("\nRename column names with dictionary:\n" + str(wb_like_df.rename(
columns={'A': 'Series 1', 'B': 'Series 2', 'C': 'Series 3'})))
print("\nTranspose the data frame:\n" + str(wb_like_df.T))
A 1column pd.DataFrame:
0 3.0
1 4.0
2 NaN
dtype: float64
This is a workbooklike (row and column names) data frame:
A B C
1 1.551689 0.425844 1.120399
2 0.472708 0.619897 1.491136
3 1.909126 0.273118 2.425986
Rename column names with dictionary:
Series 1 Series 2 Series 3
1 1.551689 0.425844 1.120399
2 0.472708 0.619897 1.491136
3 1.909126 0.273118 2.425986
Transpose the data frame:
1 2 3
A 1.551689 0.472708 1.909126
B 0.425844 0.619897 0.273118
C 1.120399 1.491136 2.425986
A pandas DataFrame
object can also be created with a dictionary, where the dictionary keys define column names and the dictionary items constitute the data of each column:
df = pd.DataFrame({'Flow depth': pd.Series(np.random.uniform(low=0.1, high=0.3, size=(4,)), dtype='float32'),
'Sediment': ["yes", "no", "yes", "no"],
'Flow regime': pd.Categorical(["fluvial", "fluvial", "supercritical", "critical"]),
'Water': "Always there"})
print("A dictionarybuilt data frame:\n" + str(df))
print("\nFrame data types:\n" + str(df.dtypes))
A dictionarybuilt data frame:
Flow depth Sediment Flow regime Water
0 0.210366 yes fluvial Always there
1 0.234890 no fluvial Always there
2 0.247299 yes supercritical Always there
3 0.164717 no critical Always there
Frame data types:
Flow depth float32
Sediment object
Flow regime category
Water object
dtype: object
Builtin attributes and methods of a pandas DataFrame
enable easy access to the top and the bottom of a data frame and many more features (recall: use dir(dict_df)
or read the developer’s docs):
print("Head of the dictionarybased dataframe (first two rows):\n" + str(df.head(2)))
print("\nEnd (tail) of the dictionarybased dataframe (last row):\n" + str(df.tail(1)))
Head of the dictionarybased dataframe (first two rows):
Flow depth Sediment Flow regime Water
0 0.210366 yes fluvial Always there
1 0.234890 no fluvial Always there
End (tail) of the dictionarybased dataframe (last row):
Flow depth Sediment Flow regime Water
3 0.164717 no critical Always there
Example Create a pandas.DataFrame
(of Froude Numbers)¶
In hydraulics, the Froude number Fr characterizes the flow regime as “fluvial” (Fr<1), “critical” (Fr=1), and “supercritical” (Fr>1). The precision of measurement devices in physical flume experiments makes the exact determination of the critical moment a challenge and forces researchers to apply an interval around 1, rather than the exact value:
Fr 
(0.00, 0.95( 
(0.95, 1.00( 
(1.00) 
)1.00, 1.05) 
)1.05, inf( 

Flow 
fluvial 
nearby critical (slow) 
critical 
nearby critical (fast) 
supercritical 
pd.DataFrame( ... )
objects are a convenient for to classify and store flume experiment data:
Fr_dict = {0.925: "fluvial", 0.975: "nearby critical (slow)", 1.0: "critical", 1.025: "nearby critical (fast)", 1.075: "supercritical"}
Fr_measured = np.random.uniform(low=0.01, high=2.00, size=(10,))
Fr_classified = [Fr_dict[min(Fr_dict.keys(), key=lambda x:abs(xm))] for m in Fr_measured]
obs_df = pd.DataFrame({"measured": Fr_measured, "flow regime": Fr_classified})
print(obs_df)
measured flow regime
0 1.462930 supercritical
1 0.430861 fluvial
2 0.998070 critical
3 0.131475 fluvial
4 0.174419 fluvial
5 0.455032 fluvial
6 1.235265 supercritical
7 0.519341 fluvial
8 0.659579 fluvial
9 0.978529 nearby critical (slow)
Append Data to a pandas.DataFrame
¶
The at
, loc
, concat
, and append
methods of pandas provide direct options for inserting rows or columns into a pd.DataFrame
. However, all three builtin methods are approximately one order of magnitude slower than if we take the detour via a dictionary. This applies especially to data frames with more than 10,000 elements. This means that the fastest method to insert a data set is:
Convert an existing
pd.DataFrame
object to a dictionary withpd.DataFrame.to_dict()
(e.g.,dict_of_df = df.to_dict()
).Update the dictionary with new data
Append rows with
dict_of_df.update{"existingcolumnname": {"newrowname": NEW_DATA}}
Append columns with
dict_of_df.update{"newcolumnname": {"existingrownames": NEW_DATA(size=existingnumberofrows}}
Reconvert *dictionary to
pd.DataFrame
withdf = pd.DataFrame.from_dict(dict_of_df)
The following code blocks illustrates both adding a row and a column to an existing pandas data frame.
import random
# convert data frame to dictionary
dict_of_obs_df = obs_df.to_dict()
# append new row
new_row_index = max(dict_of_obs_df["measured"]) + 1
dict_of_obs_df["measured"].update({new_row_index: 0.996})
dict_of_obs_df["flow regime"].update({new_row_index: "nearby critical (slow)"})
# append column
dict_of_obs_df.update({"with sediment": {}})
for k in dict_of_obs_df["measured"].keys():
dict_of_obs_df["with sediment"].update({k: bool(random.getrandbits(1))})
# rebuild data frame
obs_df = pd.DataFrame.from_dict(dict_of_obs_df)
print(obs_df.tail(3))
measured flow regime with sediment
8 0.659579 fluvial False
9 0.978529 nearby critical (slow) True
10 0.996000 nearby critical (slow) True
NumPy Arrays and pandas Data Frames¶
The major difference between a NumPy array
and a pandas DataFrame
is that NumPy array only have one single data type (dtype
), while a pandas DataFrame
can have differents data types (one dtype
per column). This is why a NumPy array
can be seamlessly converted to a pandas DataFrame
, but the opposite conversion can cause high computational cost. pandas comes with a builtin function to convert a pandas DataFrame
into a NumPy array
, where numeric variables are maintained where possible. If one column of the pandas DataFrame
is nonnumeric, the conversion involves copying the object, which then causes high computational cost. Note that pandas DataFrame
index and column labels are lost in the conversion.
print(obs_df.to_numpy())
[[1.4629304199712003 'supercritical' True]
[0.4308609175144578 'fluvial' True]
[0.9980702542754516 'critical' False]
[0.13147505751224464 'fluvial' False]
[0.1744193402007127 'fluvial' False]
[0.45503162541640574 'fluvial' True]
[1.2352651606844145 'supercritical' True]
[0.5193413669752244 'fluvial' False]
[0.6595793984115736 'fluvial' False]
[0.9785285127361709 'nearby critical (slow)' True]
[0.996 'nearby critical (slow)' True]]
Access Data Frames Entries¶
Elements of data frames are accessible by the column and row label (df.loc[index=row, columnlabel]
) or number (df.iloc
):
print("Label localization results in: " + str(df.loc[2, "Flow depth"]))
print("Same result with integer grid location: " + str(df.iloc[2, 0]))
Label localization results in: 0.2472994
Same result with integer grid location: 0.2472994
Reshape Data Frames¶
Single or multiple rows (index) and columns can be extracted from and combined into new or existing DataFrame
objects:
print(pd.DataFrame([df["Flow depth"], df["Sediment"]]))
0 1 2 3
Flow depth 0.210366 0.23489 0.247299 0.164717
Sediment yes no yes no
The df.stack()
method pivots the columns of a data frame, which is a powerful tool to classify data that can take different dimensions (e.g., the volume and weight of 1 m^{3} water  read more about the stack method).
print(df.stack()[0])
df.unstack() # unstack data frame
Flow depth 0.210366
Sediment yes
Flow regime fluvial
Water Always there
dtype: object
Flow depth 0 0.210366
1 0.23489
2 0.247299
3 0.164717
Sediment 0 yes
1 no
2 yes
3 no
Flow regime 0 fluvial
1 fluvial
2 supercritical
3 critical
Water 0 Always there
1 Always there
2 Always there
3 Always there
dtype: object
Big datasets often contain large amounts of data with many labels, where we are often only interested in a small subset of data. Data frame subsets can be created with df.pivot(index, columns, **values)
(Pivot method):
print("Pivot table for \'Flow regime\':\n" + str(df.pivot(index="Sediment", columns="Flow depth")["Flow regime"]))
print("\nPivot table for \'Water\':\n" + str(df.pivot(index="Sediment", columns="Flow depth")["Water"]))
Pivot table for 'Flow regime':
Flow depth 0.164717 0.210366 0.234890 0.247299
Sediment
no critical NaN fluvial NaN
yes NaN fluvial NaN supercritical
Pivot table for 'Water':
Flow depth 0.164717 0.210366 0.234890 0.247299
Sediment
no Always there NaN Always there NaN
yes NaN Always there NaN Always there
In addition, df.pivot_table(index, columns, values, aggfunc)
(Pivot table function) enables inline Officelike function application to one or more rows and/or columns.
print("\'mean\' for \'Flow depth\':\n" + str(df.pivot_table(index="Sediment", columns="Flow regime", values="Flow depth", aggfunc=np.mean)))
'mean' for 'Flow depth':
Flow regime critical fluvial supercritical
Sediment
no 0.164717 0.234890 NaN
yes NaN 0.210366 0.247299
Read more about reshaping and pivoting data frames in the developer’s docs.
File Handling (csv, Workbooks, and More)¶
pandas can read from and write to many data file types, which makes it extremely powerful in analyzing any data output. The following table summarizes the most important file types for numerical hydraulic, morphodynamic and fluvial landscape analyses and more file type handlers can be found at the developer’s docs.
File type 
pandas read 
pandas write 
Usage example 

CSV 
Reading from data loggers (e.g., discharge, flow depth) 

Google BigQuery 
Analyze social media 

JSON 
Manipulate BASEMENT model files 

HTML 
Process web site data 

Python Pickle Format 
Cache memory dump 

SQL 
Retrieve and write data to SQL data bases 

Workbooks (Excel / Open doc) 
Interface with nonprogrammers (Open only works in read mode) 
The following code block illustrates how the above produced data/modifieddata.csv file can be loaded with new file names and saved to a workbook with pandas. pandas uses the xlsxwriter or openpyxl packages to process workbooks, depending on which packages are available in the Python environment.
measurement_data = pd.read_csv("data/modifieddata.csv", sep=",", header=None, names=["Test 1", "Test 2", "Test 3", "Test 4"])
print("Header of data/modifieddata.csv:\n" + str(measurement_data.head(3)))
measurement_data.to_excel("data/modifieddatawb.xlsx", sheet_name="20250101 Tests")
Header of data/modifieddata.csv:
Test 1 Test 2 Test 3 Test 4
0 3.0 5.0 4.0 5.0
1 4.0 2.0 1.0 3.0
2 4.0 8.0 3.0 10.0
Note
pandas tries to convert all data into dtype=float
, but as soon as there is only one text variable in a column, the entire column will be saved as stringdata type in a workbook.
Alternatively, a pandas ExcelWriter
object can be created to write multiple pd.DataFrame
objects to a workbook, on one or more sheets. Here is an example, where the nonnumeric "nan"
strings are first replaced in measurement_data
with np.nan
to yield a purely numeric data frame in two steps (# (1)
and # (2)
):
measurement_data = measurement_data.replace("nan", np.nan, regex=True) # (1) replace "nan" with np.nan
measurement_data = measurement_data.apply(pd.to_numeric) # (2) convert all data to numeric
# write workbook with pd ExcelWriter object
with pd.ExcelWriter("data/modifieddatawbEW.xlsx") as writer:
measurement_data.to_excel(writer, sheet_name="20250101 Tests")
df.to_excel(writer, sheet_name="pandas example")
Categorical Data¶
string variables that represent statistical relevant categories are the baseline for data classification and statistics. pandas provides a special data type (dtype="category"
) to facilitate statistical analyses.
In the above Froudenumber example, we used five categories to classify the flow regime as a function of the Froude number, which can serve as categories. After an experiment, where no water was flowing in one test and the probe broke in the last test, we want to categorize our measurements to filter valid tests only:
flow_regimes = ["fluvial", "nearby critical (slow)", "critical", "nearby critical (fast)", "supercritical"]
observation_examples = ["fluvial", "dry", "critical", "nearby critical (slow)", "measurement error"]
Fr_cat = pd.Categorical(observation_examples, categories=flow_regimes, ordered=False)
print(pd.Series(Fr_cat))
0 fluvial
1 NaN
2 critical
3 nearby critical (slow)
4 NaN
dtype: category
Categories (5, object): ['fluvial', 'nearby critical (slow)', 'critical', 'nearby critical (fast)', 'supercritical']
Data Frame Statistics¶
pandas has efficient routines to perform workbooklike row or column sorting (e.g., df.sort_index()
or df.sort_values()
), and enables the fast calculation of data frame statistics with df.describe()
, where 25%, 50%, and 75% represent the ith percentiles:
measurement_data.describe()
Test 1  Test 2  Test 3  Test 4  

count  18.000000  16.000000  15.000000  18.000000 
mean  4.111111  4.250000  4.533333  5.555556 
std  2.298053  2.792848  2.386470  2.617188 
min  1.000000  1.000000  1.000000  1.000000 
25%  2.250000  2.000000  3.000000  4.000000 
50%  4.000000  4.000000  4.000000  5.500000 
75%  5.000000  6.000000  6.000000  7.000000 
max  9.000000  10.000000  9.000000  10.000000 
Statistical pandas data frame methods overlap with NumPy methods and include:
df.abs()
calculates asbolute valuesdf.cumprod()
calculates the cumulative productdf.cumsum()
calculates the cumulative sumdf.count()
counts the number of nonnull observationsdf.max()
calculates the maximum valuedf.mean()
calculates the mean (average)df.min()
calculates the minimum valuedf.mode()
calculates the modedf.prod()
calculates the productdf.std()
calculates tthe standard deviationdf.sum()
calculates the sum
print("Mean:\n" + str(measurement_data.mean()))
print("Median:\n" + str(measurement_data.median()))
print("Standard deviation:\n" + str(measurement_data.std()))
Mean:
Test 1 4.111111
Test 2 4.250000
Test 3 4.533333
Test 4 5.555556
dtype: float64
Median:
Test 1 4.0
Test 2 4.0
Test 3 4.0
Test 4 5.5
dtype: float64
Standard deviation:
Test 1 2.298053
Test 2 2.792848
Test 3 2.386470
Test 4 2.617188
dtype: float64
pandas has many more builtin functionalities, for example to plot histograms or any data using the matplotlib
library, and machine learning. The following pages of the Python introduction on hydroinformatics occasionally make use of pandas and illustrate more functionalities.
Apply Custom (Own) Functions to Data Frames¶
pandas data frame have a builtin apply(fun)
method that enables applying a custom function to (parts of) a pd.DataFrame
object. We will use here the feet_to_meter
function from the Functions page, which is available at the course repository in the fun.converter.py
directory (during lectures only).
from fun.converter import feet_to_meter
# create data frame with random integers
df = pd.DataFrame({"Feets": np.random.randint(0, 100, size=6),
"Meters": np.ones(6) * np.nan})
# apply feet_to_meter to the Meters columns of the data frame
df["Meters"] = df["Feets"].apply(feet_to_meter)
print(df)
Feets Meters
0 59 17.9832
1 60 18.2880
2 85 25.9080
3 18 5.4864
4 20 6.0960
5 3 0.9144
Dates and Time¶
pandas involves methods for calulations and labeling with date and time values with pd.Timestamp
, which converts datetimelike strings into timestamps or creates timestamps from keyword arguments:
print(pd.Timestamp('20250101T12'))
print(pd.Timestamp(year=2025, month=1, day=1, hour=12))
print(pd.Timestamp(2025, 1, 1, 12))
20250101 12:00:00
20250101 12:00:00
20250101 12:00:00
The expression pd.Timestamp(2025, 1, 1, 12)
mimics the powerful datetime.datetime
API (Application Programming Interface) of the datetime
Python package, which provides sophisticated methods for handling timedependent values. While pandas’ builtin timestamps are convenient for creating time series within pd.DataFrame
objects and workbooklike tables, datetime
is one of the best solutions for timedependent calculations in Python. datetime
is available by default (i.e., it must not be condainstalled) and is efficiently applicable for example to data that were collected over several years including leap years. The datetime
package comes with many attributes and methods, which are documented in detail in the Python docs.
The standard usage is:
import datetime as dt
start_date = dt.datetime(2024, 2, 25, 22, 30, 0)
end_date = dt.datetime(year=2024, month=3, day=2, hour=2, minute=15, second=30)
print("Datetime variables can be subtracted:\n" + str(end_date  start_date))
print("The result is a %s object." % type(end_date  start_date))
Datetime variables can be subtracted:
5 days, 3:45:30
The result is a <class 'datetime.timedelta'> object.
dt.timedelta
objects can also be separately defined:
time_diff = dt.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=23, weeks=0)
act_time = start_date
print("Iterate from start to end date with stepsize=time_diff:")
while act_time <= end_date:
print(act_time.strftime("%Y%m(%h)%d, %H:%M:%S"))
act_time += time_diff
Iterate from start to end date with stepsize=time_diff:
202402(Feb)25, 22:30:00
202402(Feb)26, 21:30:00
202402(Feb)27, 20:30:00
202402(Feb)28, 19:30:00
202402(Feb)29, 18:30:00
202403(Mar)01, 17:30:00
That’s it for the introduction to data and file handling. There is much more to data processing than on this page and the next pages will occasionally feature more tools.
Exercise
Practice pandas and its csv file handling routines, as well as basic datetime handling in the Flood return period calculation exercise.