{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Files, NumPy & Pandas\n",
    "\n",
    "Basic (text) file handling, *NumPy*, *pandas*, and *DateTime*. For interactive reading and executing code blocks [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/hydro-informatics/jupyter-python-course/main) and find *b06-pybum.ipynb* or {ref}`install-python` locally along with {ref}`jupyter`.\n",
    "\n",
    "```{admonition} Watch this section in video format\n",
    ":class: tip, dropdown\n",
    "<iframe width=\"701\" height=\"394\" src=\"https://www.youtube-nocookie.com/embed/fEYXfwvQfmo\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\n",
    "<p>Watch this section as a video on the <a href=\"https://www.youtube.com/@hydroinformatics\">@hydroinformatics channel on YouTube</a>.</p>\n",
    "```\n",
    "\n",
    "# Load and Write Basic Data Files\n",
    "\n",
    "Data can be stored in many different (text) file formats such as *txt* or *csv* files. Python provides the `open(file)` and `write(...)` functions to read and write data from nearby every text file format. In addition, there are packages such as `csv` (for *csv* files), which simplify handling specific file types. The following sections illustrate the use of the `load(file)` and `write(...)` functions. The later shown *pandas* module provides more functions to import and export numeric data along with row and column headers.\n",
    "\n",
    "(open-modes)=\n",
    "## Load (Open) Text File Data \n",
    "\n",
    "The `open` command loads text files as file object in Python. The syntax of the `open` command is: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "open(\"file-name\", \"mode\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where:\n",
    "\n",
    "* `file-name` is the file to open (e.g., `\"data.txt\"`); if the file is not in the script directory, the *filename* needs to be extended by the full directory (path) to the data file (e.g., `\"C:/experiment1/data.txt\"`).\n",
    "* `mode` defines the access type and it can take the following values:\n",
    "    - `\"r\"` - read-only (default value if no `\"mode\"` value is provided); the file cannot be modified nor overwritten.\n",
    "    - `\"rb\"` - read-only in binary format; the binary format is advantageous if the file is not a text file but media such as pictures or videos.\n",
    "    - `\"r+\"` - read and write.\n",
    "    - `\"w\"` - write-only; a new file is created if a file with the provided `file-name` does not yet exist.\n",
    "    - `\"wb\"` - write-only in binary mode.\n",
    "    - `\"w+\"` - create, write and read.\n",
    "    - `\"wb+\"` - write and read in binary mode.\n",
    "    - `\"a\"` - append new data to a file; the write-pointer is placed at the end of the file and a new file is created if a file with the provided `file name` does not yet exist.\n",
    "    - `\"ab\"` - append new data in binary mode.\n",
    "    - `\"a+\"` - both append (write at the end) and read.\n",
    "    - `\"ab+\"` - append and read data in binary mode.\n",
    "\n",
    "When `\"r\"` or `\"w\"` modes are used, the file pointer (i.e, the blinking cursor that you can see, for example, in Word documents) is placed at the beginning of the file. For `\"a\"` modes, the file pointer is placed at the end of the file.\n",
    "\n",
    "It is good practice to read and write data from and to a file within a `with` statement to avoid file lock issues. For example, the following code block creates a new text file within a `with` statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/new.csv\", mode=\"w+\") as file:\n",
    "    file.write(\"And yet it moves.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(read)=\n",
    "## Read-only\n",
    "\n",
    "Once the file object is created, we can parse the file and copy the file data content to a desired Python {ref}`data type <var>` (e.g., a list, tuple or dictionary). Parsing the data works with {ref}`for-loops <for>` (other loop types will also work) to iterate on lines and line entries. The lines represent *strings* and data columns can be separated by using the built-in *string* function `line_as_list = str().split(\"SEPARATOR\")`, where `\"SEPARATOR\"` can be `\",\"` (comma), `\";\"` (semicolon), `\"\\t\"` (tab), or any other sign. After reading all data from a file, use `file_object.close()` to avoid that the file is locked by Python and cannot be opened by another program.\n",
    "\n",
    "The following example opens a text file called *pure-numbers.txt* ([download pure-numbers.txt](https://raw.githubusercontent.com/hydro-informatics/jupyter-python-course/main/data/pure-numbers.txt) into a local sub-folder called *data*) that contains *float* numbers between 0.0 and 10.0. The file has 17 data rows (e.g., for 17 experimental runs) and 4 data columns (e.g., for 4 measurements per experimental run), which are separated by a *TAB* (`\"\\t\"` separator) The below code block uses the built-in function `readlines()` to parse the file lines, splits the lines using the `\"\\t\"` separator, and loops over the line entries to append them to the list variable `data_list` only if `entry` is numeric (verified with the `try` - `except` statement). `data_list` is a nested list that is initiated at the beginning of the script and a sub-list (nested list) is appended for every file line (row)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of rows: 17\n",
      "Number of columns: 4\n",
      "[[2.59, 5.44, 4.06, 4.87], [4.43, 1.67, 1.26, 2.97], [4.04, 8.07, 2.8, 9.8], [2.25, 5.32, 0.04, 5.57], [6.26, 6.15, 5.98, 8.91], [7.93, 0.85, 5.88, 5.4], [4.72, 1.29, 4.18, 2.46], [7.03, 1.43, 5.53, 9.7], [5.2, 7.87, 1.44, 1.13], [3.18, 5.38, 3.6, 7.32], [5.37, 0.62, 5.29, 4.26], [3.48, 2.26, 3.11, 7.3], [1.36, 1.68, 3.38, 6.4], [1.68, 2.31, 9.29, 3.59], [1.33, 1.73, 3.98, 5.74], [2.38, 9.69, 0.06, 4.16], [9.3, 6.47, 9.14, 3.33]]\n"
     ]
    }
   ],
   "source": [
    "file_object = open(\"data/pure-numbers.txt\")  # read file with default \"mode\"=\"r\"\n",
    "\n",
    "data_list = []  # this will be a nested list with 17 sub-lists (rows) containing 4 entries (columns)=\n",
    "\n",
    "for line in file_object.readlines():\n",
    "    line_as_list = line.split(\"\\t\")  # converts the line into a list using a tab (\\t) separator\n",
    "    data_list.append([])  # append an empty sub-list for every file line (17 rows)\n",
    "    for entry in line_as_list:\n",
    "        try:\n",
    "            # try to append the entry as floating point number to the last sub-list, which is pointed at using [-1]\n",
    "            data_list[-1].append(float(entry))\n",
    "        except ValueError:\n",
    "            # if entry is not numeric, append 0.0 to the sub-list and print a warning message\n",
    "            print(\"Warning: %s is not a number. Replacing value with 0.0.\" % str(entry))\n",
    "\n",
    "# verify that data_list contains the 17 rows (sub-lists) with the built-in list function __len__()\n",
    "print(\"Number of rows: %d\" % data_list.__len__()) \n",
    "\n",
    "# verify that the first sub-list has four entries (number of columns)\n",
    "print(\"Number of columns: %d\" % data_list[0].__len__())\n",
    "\n",
    "file_object.close()  # close file (otherwise it will be locked as long as Python is still running!) alternative: use with-statement\n",
    "print(data_list)  # print the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "Recall the `with` statement from the above example. With the `with` statement, we do not have to write `file.close()`.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(create)=\n",
    "## Create and Write Files \n",
    "\n",
    "A file is created with the `\"w\"` or `\"a\"` modes (e.g., `open(file_name, mode=\"a\")`).\n",
    "\n",
    "```{tip}\n",
    "When `mode=\"w\"`, the provided file is opened with the pointer at position zero. Writing data will make the pointer overwrite any existing data at the position. That means any existing data in the opened file will be overwritten. To avoid overwriting data in an existing file use `mode=\"a\"`.\n",
    "```\n",
    "\n",
    "Imagine that the above-loaded `data_list` represents measurements in *mm* and we know that the precision of the measuring device was 1.0 *mm*. Thus, all data smaller than 1.0 are within the device error margin, which we want to exclude from further analyses by overwriting such values with **nan** (**not-a-number**). For this purpose, we first create a new list variable `new_data_list`, where we append *nan* values if `data_list[i, j] <= 1.0` and otherwise we preserve the original numeric value of `data_list`.\n",
    "With `open(\"data/modified-data.csv\", mode=\"w+\")`, we create a new *csv* (comma-separated values) file in the *data* sub-folder. A *for-loop* iterates on the sub_lists of `new_data_list` and joins them with a comma-separator. In order to join the list elements of `i` (i.e., the sub-lists) with `\", \".join(list_of_strings)\"`, all list entries first need to be converted to *strings*, which is achieved through the expression `[str(e) for e in row]`. The `\"\\n\"` *string* needs to be concatenated at the end of every line to create a line break (`\"\\n\"` itself will not be visible in the file). The command `new_file.write(new_line)` writes the sub-list-converted-to-string to the file `\"data/modified-data.csv\"`. Once again, `new_file.close()` is needed to avoid that the new *csv* file is locked by Python (alternatively: use a name space within a `with` statement)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[2.59, 5.44, 4.06, 4.87], [4.43, 1.67, 1.26, 2.97], [4.04, 8.07, 2.8, 9.8], [2.25, 5.32, 'nan', 5.57], [6.26, 6.15, 5.98, 8.91], [7.93, 'nan', 5.88, 5.4], [4.72, 1.29, 4.18, 2.46], [7.03, 1.43, 5.53, 9.7], [5.2, 7.87, 1.44, 1.13], [3.18, 5.38, 3.6, 7.32], [5.37, 'nan', 5.29, 4.26], [3.48, 2.26, 3.11, 7.3], [1.36, 1.68, 3.38, 6.4], [1.68, 2.31, 9.29, 3.59], [1.33, 1.73, 3.98, 5.74], [2.38, 9.69, 'nan', 4.16], [9.3, 6.47, 9.14, 3.33]]\n"
     ]
    }
   ],
   "source": [
    "# create a new list and overwrite all values <= 1.0 with nan\n",
    "new_data_list = []  \n",
    "for i in data_list:\n",
    "    new_data_list.append([])\n",
    "    for j in i:\n",
    "        if j <= 1.0:\n",
    "            new_data_list[-1].append(\"nan\")\n",
    "        else:\n",
    "            new_data_list[-1].append(j)\n",
    "\n",
    "print(new_data_list)\n",
    "# write the modified new_data_list to a new text file\n",
    "new_file = open(\"data/modified-data.csv\", mode=\"w+\")  # lets just use csv: Python does not care about the file ending (could also be file.wayne)\n",
    "for row in new_data_list:\n",
    "    new_line = \", \".join([str(e) for e in row]) + \"\\n\"\n",
    "    new_file.write(new_line)\n",
    "new_file.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modify Existing Files\n",
    "\n",
    "Existing text files can be opened and modified in either `mode=\"r+\"` (pretending that information needs to be read before it is modified) or `mode=\"a+\"`. Recall that `\"r+\"` will place the pointer at the beginning of the file and `\"a+\"` will place the pointer at the end of the file. Thus, if we want to modify lines or entries of an existing file, `\"r+\"` is the good choice and if we want to append data at the end of the file, `\"a\"` is the good choice (`+` is not strictly needed in the case of `\"a\"`). This section shows two examples: (1) modification of existing data in a file using `\"r+\"`, and (2) appending data to an existing file using `\"a\"`.\n",
    "\n",
    "Example 1 - Replace data in an existing file with `\"r+\"`\n",
    ": In the previous code block, we eliminated all measurements that were smaller than 1 *mm* because of the precision of the measurement device. However, we have retained all other values with two-digit accuracy - an accuracy that is not given. Consequently, all decimal places in the measurements must also be eliminated. To achieve this, we have to round all measured values with Python's built-in round function (`round(number, n-digits)`) to zero decimal places (i.e., `n-digits = 0`).\n",
    "  In this example (featured in the below code block), an exception `IOError` is raised when the file `\"data/modified-data.csv\"` does not exist (or if it is locked by another software). An `if` statement ensures that rounding the data is only attempted if the file exists.\n",
    "  The overwriting procedure first reads all lines of the file into the `lines` variable. After reading all lines, the pointer is at the end of the file, and `file.seek(0)` puts the pointer back to position 0 (i.e., at the beginning of the file). `file.truncate()` purges the file. Thus, the original file is blank for a moment and all file contents are stored in the `lines` variable. Rounding the data happens within a *for-loop* that:\n",
    "\n",
    "    * Splits the comma-separated line *string* (produces `lines_as_list`).\n",
    "    * Creates the temporary list `_numeric_line_`, where rounded, numeric values are stored (the variable is overwritten in every iteration).\n",
    "    * Loops over the line entries (`line_as_list`), where an exception statement appends rounded (to zero digits), numeric values and appends `\"nan\"` when an entry is not numeric.\n",
    "    * Writes the modified line to the `\"data/modified-data.csv\"` *csv* file.\n",
    "\n",
    "  Finally, the *csv* is closed with `modified_file.close()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed file.\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    modified_file = open(\"data/modified-data.csv\", mode=\"r+\")  # re-open the above data file in read-write\n",
    "except IOError:\n",
    "    print(\"The file does not exist.\")\n",
    "    \n",
    "if modified_file:\n",
    "    # go here only if the file exists\n",
    "    lines = modified_file.readlines()  # read lines > pointer moves to file end\n",
    "    modified_file.seek(0)  # return pointer to file beginning\n",
    "    modified_file.truncate()  # clear file content\n",
    "    for line in lines:\n",
    "        line_as_list = line.split(\", \")  # converts the line into a list using comma separator\n",
    "        _numeric_line_ = []\n",
    "        for e in line_as_list:\n",
    "            try: \n",
    "                _numeric_line_.append(round(float(e), 0))  # try to convert line entry to float and round to 0 digits\n",
    "            except ValueError:\n",
    "                _numeric_line_.append(e)  # for nan values \n",
    "        # write rounded values\n",
    "        modified_file.write(\", \".join([str(e) for e in _numeric_line_]) + \"\\n\")\n",
    "    print(\"Processed file.\" )\n",
    "    modified_file.close()\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Theoretically the above code snippet can be re-written as a function to modify any data in a file. In addition, other threshold values or particular data ranges can be filtered using `if` - `else` statements.\n",
    "\n",
    "Example 2 - Append data to an existing file with `\"a+\"`\n",
    ": By coincidence, you find a hand-written measurement protocol that has data of an 18th experimental run, which is not in the electronic measurement data file due to a data transmission error. Now, you want to add the data to the above-produced *csv* file. Entering the data does not take much work, because only 4 measurements were performed per experimental run and the below code block contains the hand-written data in a list variable called `forgotten_data`.\n",
    "  This example uses the `os` module (recall {ref}`sec-pypckg`) to verify if the data file exists with `os.path.isfile()` (the `os.getcwd()` statement is a gadget here). The code block features the usage of a `with` statement (i.e., a `with` - context manager or name space). \n",
    "\n",
    "  The essential part of the code that writes the line to the data file is `file.write(line)`, where `line` corresponds to the above-introduced `\", \".join(list-of-strings) + \"\\n\"` *string*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "print(os.getcwd())\n",
    "forgotten_data = [4.0, 3.0, \"nan\", 8.0]\n",
    "\n",
    "if os.path.isfile(\"data/modified-data.csv\"):\n",
    "    with open(\"data/modified-data.csv\", mode=\"a\") as file_object:\n",
    "        file_object.write(\", \".join([str(e) for e in forgotten_data]) + \"\\n\")\n",
    "    print(\"Data appended.\")\n",
    "else:\n",
    "    print(\"The file does not exist.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Challenge\n",
    "The expression `\", \".join([str(e) for e in a_list]) + '\\n'` is a recurring expression in many of the above-shown code blocks. How does a function look like that automatically generates this expression for lists of different data types?\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(numpy)=\n",
    "# NumPy \n",
    "\n",
    "*NumPy* provides high-level mathematical functions for linear algebra including operations on multi-dimensional arrays and matrices. The open-source *NumPy* (for *Numerical Python*) library is written in Python and [C](https://en.wikipedia.org/wiki/C_(programming_language)), and comes with comprehensive documentation ([download the latest version on the developer's web site](https://numpy.org/doc/) or [read the developer's online tutorial](https://numpy.org/devdocs/user/quickstart.html)).\n",
    "\n",
    "```{admonition} Watch the NumPy section in video format\n",
    ":class: tip, dropdown\n",
    "<iframe width=\"701\" height=\"394\" src=\"https://www.youtube-nocookie.com/embed/P2t0z-t-l7w\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\n",
    "<p>Watch this section as a video on the <a href=\"https://www.youtube.com/@hydroinformatics\">@hydroinformatics channel on YouTube</a>.</p>\n",
    "```\n",
    "\n",
    "## Installation\n",
    "\n",
    "*NumPy* can be installed through *Anaconda* ({ref}`recall instructions <install-pckg>`) and the developers recommend using a scientific Python distribution (*Anaconda*) with [*SciPy Stack*](https://www.scipy.org/install.html).\n",
    "\n",
    "The provided *Anaconda* [environment.yml (`flussenv`)](https://raw.githubusercontent.com/Ecohydraulics/flusstools-pckg/main/environment.yml) already includes *NumPy* (more information in the {ref}`installation <conda-env>` section). Similarly, Linux users will have *NumPy* installed in a virtual environment (e.g., `vflussenv`) with *pip* (recall {ref}`pip-installing flusstools <pip-quick>`). Otherwise, to install *NumPy* in any other *conda* environment, open *Anaconda Prompt* (*Start* > type *Anaconda Prompt*) and type:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "conda activate ENVIRONMENT-NAME\n",
    "conda install numpy\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To pip-install *NumPy* in any other virtual environment tap:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "pip install numpy\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage\n",
    "\n",
    "The *NumPy* library is typically imported with **`import numpy as np`**. Array handling is the foundation of *NumPy* and linear algebra, where arrays represent a kind of nested data lists. To create a *NumPy* array, use `np.array((values))`, where `values` is a sequences of values. \n",
    "\n",
    "```{tip}\n",
    "This section provides insights into basic *NumPy* functions and it does not (rather: cannot) cover all *NumPy* functions and data types. Generally speaking, be sure that whatever mathematical operation you want to perform, *NumPy* offers a solution. Check out the [*NumPy* documentation](https://numpy.org/devdocs/user/quickstart.html), [have a look at *NumPy*'s built-in functions and methods overview](https://numpy.org/devdocs/user/quickstart.html#functions-and-methods-overview), or use your favorite search engine with the search words **numpy** ***FUNCTION***.\n",
    "```\n",
    "\n",
    "The following code block shows very basic usage of *NumPy* (or: numpy) imported as `np` and the creation of a *2x3* numpy array. The rounded parentheses indicated that the value sequence of the `np.array` represents a tuple for creating a multi-dimensional array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[2 3 1]\n",
      " [4 5 6]]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "an_array = np.array(([2, 3, 1], [4, 5, 6]))\n",
    "print(an_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*NumPy* arrays (data type: *ndarray*) have many built-in features, for example to output the array size:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'numpy.ndarray'>\n",
      "Array dimensions: (2, 3)\n",
      "Total number of array elements: 6\n",
      "Number of array axes: 2\n"
     ]
    }
   ],
   "source": [
    "print(type(an_array))\n",
    "print(\"Array dimensions: \" + str(an_array.shape))\n",
    "print(\"Total number of array elements: \" + str(an_array.size))\n",
    "print(\"Number of array axes: \" + str(an_array.ndim))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many types of `np.array`s and many ways to create them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[2 3 1]\n",
      " [4 5 6]]\n",
      "[[2.+0.j 3.+0.j 1.+0.j]\n",
      " [4.+0.j 5.+0.j 6.+0.j]]\n"
     ]
    }
   ],
   "source": [
    "print(np.array([(2, 3, 1), (4, 5, 6)]))  # the same as an_array\n",
    "print(np.array([[2, 3, 1], [4, 5, 6]], dtype=complex))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays of zeros or ones or empty arrays can be created with *integer* or *float* data types. When creating such arrays, be aware of using tuples (i.e., sequences embraced with rounded parentheses) to define array dimensions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0. 0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0. 0.]]\n",
      "[[1. 1. 1. 1. 1. 1.]\n",
      " [1. 1. 1. 1. 1. 1.]]\n",
      "[[1. 1. 1. 1. 1. 1.]\n",
      " [1. 1. 1. 1. 1. 1.]]\n",
      "[[2 0 3 0 1 0]\n",
      " [4 0 5 0 6 0]]\n"
     ]
    }
   ],
   "source": [
    "print(np.zeros((2,6)))\n",
    "print(np.ones((2,6), dtype=np.float64))  # other dtypes: int16, np.int16, float, np.float32, np.complex32\n",
    "print(np.empty((2,6)))\n",
    "print(np.empty((2,6), dtype=np.int16))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Data type sizes\n",
    ":class: note\n",
    "*NumPy* data types have different sizes (in [bytes](https://en.wikipedia.org/wiki/Byte)) and the more digits, the larger the variable size. For example, `np.float64` has an item size of 8 bytes (64/8), while `np.float32` has an item size of 4 bytes (32/8) only. Use `ndarray.itemsize` (e.g., `an_array.itemsize`) to find out the size of an array in bytes. For analyses of large datasets, the data type gets very important regarding computation speed and storage.\n",
    "```\n",
    "\n",
    "*NumPy* provides the `arange(start, end, step-size)` function to create numeric sequences. Such sequences represent arrays (`ndarray`) that can later be reshaped (i.e., re-organized in columns and rows). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1D array:\n",
      "[0 2 4 6 8]\n",
      "\n",
      "2D array:\n",
      "[[ 0  2  4]\n",
      " [ 6  8 10]]\n",
      "\n",
      "3D array:\n",
      "[[[ 1  2  3]\n",
      "  [ 4  5  6]]\n",
      "\n",
      " [[ 7  8  9]\n",
      "  [10 11 12]]]\n",
      "\n",
      "1D Linspace (start, end, number-of-elements):\n",
      "[0.         1.57079633 3.14159265]\n"
     ]
    }
   ],
   "source": [
    "print(\"1D array:\")\n",
    "print(np.arange(0, 10, 2))  # 1D array\n",
    "print(\"\\n2D array:\")\n",
    "print(np.arange(0, 12, 2).reshape(2, 3))  # 2D array\n",
    "print(\"\\n3D array:\")\n",
    "print(np.arange(1, 13, 1).reshape(2, 2, 3))  # 3D array\n",
    "print(\"\\n1D Linspace (start, end, number-of-elements):\")\n",
    "print(np.linspace(0, np.pi, 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random numbers can be generated with *NumPy*'s random number generator `np.random` and its `.random(range_tuple)` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.0204844  0.91185321 0.00152947 0.79774412]\n",
      " [0.45685876 0.65600015 0.55038482 0.03690686]]\n"
     ]
    }
   ],
   "source": [
    "rand_array = np.random.random((2,4))\n",
    "print(rand_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Built-in array functions enable finding minimum or maximum values, or sums of arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sum of 12-elements ones-array: 12.0\n",
      "Minimum: 1\n",
      "Maximum: 6\n"
     ]
    }
   ],
   "source": [
    "print(\"Sum of 12-elements ones-array: \" + str(np.ones((2,6)).sum()))\n",
    "print(\"Minimum: \" + str(an_array.min()))\n",
    "print(\"Maximum: \" + str(an_array.max()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(colors)=\n",
    "## Color Arrays \n",
    "\n",
    "Arrays may also contain color information, where colors represent a mix of the three base colors red, green, and blue (**RGB**). Thus, one color can be defined as `[red-value, green-value, blue-value]`, and a value of `0` means that a color tone is not present, while `255` is its **maximum** value. There is no color when all color tone values are zero, which corresponds to *black*; when all color tones are maximum (255), the color mix corresponds to *white*. This way, array elements can be lists of color tones, and plotting such arrays produces images. The following example produces an array with 5 color-list elements, which could be plotted as a very basic image with 5 pixels (one black, red, green, blue, and white, respectively):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "color_set = np.array([[0, 0, 0],         # black\n",
    "                      [255, 0, 0],       # red\n",
    "                      [0, 255, 0],       # green\n",
    "                      [0, 0, 255],       # blue\n",
    "                      [255, 255, 255]])  # white"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(array-matrix-operations)=\n",
    "## Array (Matrix) Operations\n",
    "\n",
    "Array calculations (matrix operations) follow the rules of linear algebra:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Subtraction: [[ 0.1115262  -0.48000352]\n",
      " [ 0.1494478   0.31398052]\n",
      " [ 0.38778125  0.18341211]\n",
      " [ 0.0014262   0.5411479 ]]\n",
      "Element-wise product: [[0.78023517 0.24540468]\n",
      " [0.01212987 0.11177306]\n",
      " [0.04761943 0.21543691]\n",
      " [0.37867353 0.05979503]]\n",
      "Matrix product (option 1): [[1.218658   1.0311045 ]\n",
      " [0.73402296 0.63240968]]\n",
      "Matrix product (option 2): [[1.218658   1.0311045 ]\n",
      " [0.73402296 0.63240968]]\n"
     ]
    }
   ],
   "source": [
    "A = np.random.random((2,4))\n",
    "B = np.random.random((4,2))\n",
    "print(\"Subtraction: \" + str(A.transpose() - B))\n",
    "print(\"Element-wise product: \" + str(A.transpose() * B))\n",
    "print(\"Matrix product (option 1): \" + str(A @ B))\n",
    "print(\"Matrix product (option 2): \" + str(A.dot(B)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Further element-wise calculations include exponential (`**`), geometric (`np.sin`, `np.cos`, `np.tan`, etc.), and boolean operators:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A to the power of 3: [[0.83278802 0.00897507 0.11465192 0.23383377]\n",
      " [0.02992313 0.14581371 0.18020002 0.25637808]]\n",
      "Exponential: [[2.56210893 1.23098678 1.62548021 1.85165172]\n",
      " [1.36404919 1.69272506 1.7591499  1.88753709]]\n",
      "Square root: [[0.96996429 0.45586852 0.6969959  0.7849064 ]\n",
      " [0.55718724 0.72549272 0.75155218 0.79704006]]\n",
      "Sine of A times 3: [[2.42414331 0.61897047 1.40075657 1.73351616]\n",
      " [0.91648323 1.50711546 1.60581841 1.78019148]]\n",
      "Boolean where A is smaller than 0.3: [[False  True False False]\n",
      " [False False False False]]\n"
     ]
    }
   ],
   "source": [
    "print(\"A to the power of 3: \" + str(A**3))\n",
    "print(\"Exponential: \" + str(np.exp(A)))\n",
    "print(\"Square root: \" + str(np.sqrt(A)))\n",
    "print(\"Sine of A times 3: \" + str(np.sin(A) * 3))\n",
    "print(\"Boolean where A is smaller than 0.3: \" + str(A < 0.3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Array Shape Manipulation\n",
    "\n",
    "Sometimes it is necessary to stack a multi-dimensional array into a vector or recast the shape of an array. Beyond the `reshape()` function, there are a couple of other options to manipulate the shape of an array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Flattened matrix A (into a vector):\n",
      "[0.94083072 0.20781611 0.48580329 0.61607806 0.31045762 0.52633969\n",
      " 0.56483068 0.63527285]\n",
      "\n",
      "Transpose matrix A and append B:\n",
      "[[[0.94083072 0.31045762]\n",
      "  [0.20781611 0.52633969]\n",
      "  [0.48580329 0.56483068]\n",
      "  [0.61607806 0.63527285]]\n",
      "\n",
      " [[0.82930452 0.79046114]\n",
      "  [0.05836831 0.21235917]\n",
      "  [0.09802204 0.38141858]\n",
      "  [0.61465186 0.09412495]]]\n",
      "\n",
      "Transpose matrix A and append B and cast into a (4x4) array:\n",
      "[[0.94083072 0.31045762 0.20781611 0.52633969]\n",
      " [0.48580329 0.56483068 0.61607806 0.63527285]\n",
      " [0.82930452 0.79046114 0.05836831 0.21235917]\n",
      " [0.09802204 0.38141858 0.61465186 0.09412495]]\n"
     ]
    }
   ],
   "source": [
    "print(\"Flattened matrix A (into a vector):\\n\" + str(A.ravel()))\n",
    "print(\"\\nTranspose matrix A and append B:\\n\" + str(np.array([A.transpose(), B])))\n",
    "print(\"\\nTranspose matrix A and append B and cast into a (4x4) array:\\n\" + str(np.array([A.transpose(), B]).reshape(4,4)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## *NumPy* File Handling and `np.nan`\n",
    "\n",
    "In the above examples on file handling, measurement data were loaded from text files, manipulated (modified), and (re-)written. The data manipulation involved the introduction of `\"nan\"` (*not-a-number*) values, which were excluded because measurements *<1 mm* were considered errors. Why didn't we use zeros here? Zeros are numbers, too, and have a significant effect on data statistics (e.g., for calculating mean values). However, the `\"nan\"` *string* value may cause difficulties in data handling, in particular regarding the consistency of function output. *NumPy* provides with the `np.nan` data type a powerful alternative to the tedious `\"nan\"` *string*.\n",
    "\n",
    "*NumPy* also has a text file load function called `np.loadtxt(file-name, *args, **kwargs)`, which imports text files as `np.array`s of *float* values. The default *float* value type can be adapted with the optional keyword `dtype`. Other optional keyword arguments are:\n",
    "\n",
    "* `delimiter=STR` (e.g., `delimiter=';'`), where the default is `\"None\"`\n",
    "* `usecols=TUPLE` (e.g., `usecols=(1, 3)` extracts the 2<sup>nd</sup> and 4<sup>th</sup> column), where also one *integer* value is possible to read just on single column\n",
    "* `skiprows=INT` (e.g., `skiprows=2` skips the first two lines), where the default is `0`\n",
    "* more arguments are available and listed in the [*NumPy* documentation](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).\n",
    "\n",
    "The following example loads the above-created *csv* file *data/modified-data.csv* containing *integer* and `\"nan\"` *string* values, which are automatically converted to `np.nan`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is the data 4th line (row): [ 2.  5. nan  6.]\n",
      "The data type of the 3rd (nan) entry is: <class 'numpy.float64'>\n"
     ]
    }
   ],
   "source": [
    "experiment_data = np.loadtxt(\"data/modified-data.csv\", delimiter=\",\")\n",
    "print(\"This is the data 4th line (row): \" + str(experiment_data[3, :]))\n",
    "print(\"The data type of the 3rd (%s) entry is: \" % str(experiment_data[3, 2]) + str(type(experiment_data[3, 2])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, or as an alternative, the function `np.load()` picks up data from file-like `.npz`, `.npy`, or pickled (saved Python objects) data sources (more information is available in the [*NumPy* docs](https://numpy.org/doc/stable/reference/generated/numpy.load.html#numpy.load)). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Statistics\n",
    "The above examples featured array functions to assess basic array statistics such as the minimum and maximum. *NumPy* provides many more functions for array statistics such as the mean, median, or standard deviation, including functions that account for `np.nan` values. The following example illustrates some of the statistical functions with the experimental data from the above examples. Note the usage of `nanmean` instead of `mean` and statistics along array axis, where the optional keyword argument `axis=0` corresponds to columns and `axis=1` to statistics along rows in 2-dimensional arrays (maximum axis number corresponds to the array dimensions *n* minus 1, i.e., maximum `axis=n-1`). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean value (without nan): nan\n",
      "Mean value with np.nan: 4.626865671641791\n",
      "Mean value along axis 0 (columns): [4.11111111 4.25       4.53333333 5.55555556]\n",
      "Mean value along axis 1 (rows): [4.25       2.5        6.25       4.33333333 6.75       6.33333333\n",
      " 3.         6.         3.75       4.75       4.66666667 3.75\n",
      " 3.         4.25       3.25       5.33333333 6.75       5.        ]\n"
     ]
    }
   ],
   "source": [
    "print(\"Mean value (without nan): \" + str(np.mean(experiment_data)))  # no applicable result\n",
    "print(\"Mean value with np.nan: \" + str(np.nanmean(experiment_data))) \n",
    "print(\"Mean value along axis 0 (columns): \" + str(np.nanmean(experiment_data, axis=0))) \n",
    "print(\"Mean value along axis 1 (rows): \" + str(np.nanmean(experiment_data, axis=1))) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following paragraphs represent a tabular overview of statistical functions in *NumPy* (source: *NumPy* v.1.13 docs). The listed functions only represent the baseline and *NumPy* provides many more options, which can be leveraged using any search engine with *NumPy* and the desired function as a search keyword."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "Basic statistic functions\n",
    "\n",
    "| Function                              | Description                                                                                 |\n",
    "|---------------------------------------|---------------------------------------------------------------------------------------------|\n",
    "| [`nanmin(a[, axis, out, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nanmin.html#numpy.nanmin)      | Minimum of an array or along an axis, ignoring `np.nan`.                     |\n",
    "| [`nanmax(a[, axis, out, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nanmax.html#numpy.nanmax)      | Maximum of an array or along an axis, ignoring `np.nan`.                 |\n",
    "| [`ptp(a[, axis, out])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ptp.html#numpy.ptp)                   | Range of values (max - min) along an axis.                                          |\n",
    "| [`percentile(a, q[, axis, out, ...])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.percentile.html#numpy.percentile)    | q-th percentile of data along a specified axis.                            |\n",
    "| [`nanpercentile(a, q[, axis, out, ...])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nanpercentile.html#numpy.nanpercentile) | q-th percentile of data along a specified axis, ignoring `np.nan`. |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "Mean (average), standard deviation, and variances\n",
    "\n",
    "| Function                                          | Description                                                                   |\n",
    "|---------------------------------------------------|-------------------------------------------------------------------------------|\n",
    "| [`median(a[, axis, out, overwrite_input, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.median.html#numpy.median) | Median along an (optional) axis.                                  |\n",
    "| [`average(a[, axis, weights, returned])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.average.html#numpy.average)             | Weighted average along an (optional) axis.                        |\n",
    "| [`mean(a[, axis, dtype, out, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.mean.html#numpy.mean)             | Arithmetic mean along an (optional) axis.                         |\n",
    "| [`std(a[, axis, dtype, out, ddof, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.std.html#numpy.std)        | Standard deviation along an (optional) axis.                      |\n",
    "| [`var(a[, axis, dtype, out, ddof, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.var.html#numpy.var)        | Variance along an (optional) axis.                                |\n",
    "| [`nanmedian(a[, axis, out, overwrite_input, ...])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nanmedian.html#numpy.nanmedian)   | Median along an (optional) axis, ignoring `np.nan`.             |\n",
    "| [`nanmean(a[, axis, dtype, out, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nanmean.html#numpy.nanmean)          | Arithmetic mean along an (optional) axis, ignoring `np.nan`.          |\n",
    "| [`nanstd(a[, axis, dtype, out, ddof, keepdims])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nanstd.html#numpy.nanstd)     | Standard deviation along an (optional) axis, while ignoring `np.nan`. |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "Correlating data (arrays)\n",
    "\n",
    "| Function                                       | Description                                             |\n",
    "|------------------------------------------------|---------------------------------------------------------|\n",
    "| [`corrcoef(x[, y, rowvar, bias, ddof])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.corrcoef.html#numpy.corrcoef)           | Pearson (product-moment) correlation coefficients. |\n",
    "| [`correlate(a, v[, mode])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.correlate.html#numpy.correlate)                        | Cross-correlation of two 1-dimensional sequences.       |\n",
    "| [`cov(m[, y, rowvar, bias, ddof, fweights, ...])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.cov.html#numpy.cov) | Estimate covariance matrix, based on data and weights.   |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "Generate and plot histograms\n",
    "\n",
    "| Function                                          | Description                                                                |\n",
    "|---------------------------------------------------|----------------------------------------------------------------------------|\n",
    "| [`histogram(a[, bins, range, normed, weights, ...])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.histogram.html#numpy.histogram) | Histogram of a set of data.                                    |\n",
    "| [`histogram2d(x, y[, bins, range, normed, weights])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.histogram2d.html#numpy.histogram2d) | Bi-dimensional histogram of two data samples.                  |\n",
    "| [`histogramdd(sample[, bins, range, normed, ...])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.histogramdd.html#numpy.histogramdd)   | Multidimensional histogram of some data.                       |\n",
    "| [`bincount(x[, weights, minlength])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.bincount.html#numpy.bincount)                 | Count number of occurrences of each value in array of non-negative ints.   |\n",
    "| [`digitize(x, bins[, right])`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.digitize.html#numpy.digitize)                        | Indices of the bins to which each value in input array belongs. |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(numpy-matlab)=\n",
    "## Can *NumPy* do *MATLAB*&reg;?\n",
    "\n",
    "Are you considering switching to Python after starting softly into programming with *MATLAB&reg;*-like software? There are many reasons for enhancing data analyses with Python and here are some facilitators for previous *MATLAB&reg;* users:\n",
    "\n",
    "* *MATLAB&reg;* matrices can be loaded and saved with [`scipy.io.loadmat(matrix-file-name)`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html#scipy.io.loadmat) (use `import scipy`).\n",
    "* *NumPy*'s `np.array` replaces *MATLAB&reg;*'s matrix notation (even though there is the historic, deprecated *NumPy* data type `np.matrix`).\n",
    "* Import many *MATLAB&reg;* features from `np.matlib` (e.g., `from numpy.matlib import rand, zeros, ones, empty, eye)` or more generally `import numpy.matlib as M`).\n",
    "* Find the *NumPy* equivalent of many *MATLAB&reg;* function in the [*NumPy* documentation](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html#table-of-rough-matlab-numpy-equivalents).\n",
    "* To emulate *MATLAB&reg;*'s plot functions use the `pylab` package and import it as `from pylab import *`. <br>&#9888; This overwrites all other (standard) definitions of the `plot()` function and `array()` objects. So this usage is deprecated. [Read the plotting section](pyplot) for comprehensive plotting instructions with *Python*.\n",
    "\n",
    "*MATLAB&reg; is a registered trademark of The MathWorks.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Practice *numpy* and *csv* file handling in the {ref}`reservoir design exercise <ex-seq-peak>`.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(pandas)=\n",
    "# Pandas\n",
    "\n",
    "*pandas* is a powerful library for data analyses and manipulation with Python. It can handle *NumPy* arrays, and both packages jointly represent a powerful data processing engine. The power of *pandas* lies in processing data frames, data labeling (e.g., workbook-like column names), and flexible file handling functions (e.g., the built-in `read_csv(csv-file)` function). While *NumPy* arrays enable calculations with multidimensional arrays (beyond 2-dimensional tables) and low memory consumption, *pandas* `DataFrame`s efficiently process and label tabular data with more than ~100,000 rows. Because of its labelling capacity, *pandas* also finds broad application in machine learning. In summary, *pandas*' functionality builds on top of *NumPy* and both libraries are maintained by the [*SciPy*](https://www.scipy.org/) (*Scientific computing tools for Python*) community that also produces `matplotlib` (see the {ref}`plotting section <matplotlib>`) and *IPython* (*Jupyter*'s *Python* kernel).\n",
    "\n",
    "```{admonition} Watch the pandas section on YouTube\n",
    ":class: tip, dropdown\n",
    "<iframe width=\"701\" height=\"394\" src=\"https://www.youtube-nocookie.com/embed/ruB-r-0Vyjc\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\n",
    "<p>Watch this section as a video on the <a href=\"https://www.youtube.com/@hydroinformatics\">@hydroinformatics channel on YouTube</a>.</p>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "*pandas* can be installed through *Anaconda* ({ref}`recall instructions <install-pckg>`) and the developers recommend using a scientific Python distribution (*Anaconda*) with [*SciPy Stack*](https://www.scipy.org/install.html).\n",
    "\n",
    "The provided *Anaconda* [environment.yml (`flussenv`)](https://raw.githubusercontent.com/Ecohydraulics/flusstools-pckg/main/environment.yml) already includes *pandas* (more information in the {ref}`installation <conda-env>` section). Similarly, Linux users will have *pandas* installed in a virtual environment (e.g., `vflussenv`) with *pip* (recall {ref}`pip-installing flusstools <pip-quick>`). Otherwise, to install *pandas* in any other *conda* environment, open *Anaconda Prompt* (*Start* > type *Anaconda Prompt*) and type:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "conda activate ENVIRONMENT-NAME\n",
    "conda install pandas\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To pip-install *pandas* in any other virtual environment tap:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "pip install pandas\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage\n",
    "\n",
    "*pandas* standard import alias is `pd`: `import pandas as pd`. The following sections provide an overview of basic *pandas* functions and many more features are documented in the [developer's docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html).\n",
    "\n",
    "## Data Frames & Series\n",
    "The below code block illustrates one way to create a *pandas* data frame (`pd.DataFrame`), one of *pandas* core objects. Note the difference between a 1-dimensional series `pd.Series` (corresponds to a one-column data frame), and an n-dimensional data frame with **row (=index)** and column names. The default row names number rows starting from 0 (unlike Office software that starts at row no. 1), without column names. Column names can be initially defined as a {ref}`list <list>` and replaced with a {ref}`dictionary <dict>` that maps the initial list entries to new names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A 1-column pd.DataFrame:\n",
      "0    3.0\n",
      "1    4.0\n",
      "2    NaN\n",
      "dtype: float64\n",
      "\n",
      "This is a workbook-like (row and column names) data frame:\n",
      "          A         B         C\n",
      "1  1.551689 -0.425844 -1.120399\n",
      "2 -0.472708 -0.619897 -1.491136\n",
      "3  1.909126  0.273118 -2.425986\n",
      "\n",
      "Rename column names with dictionary:\n",
      "   Series 1  Series 2  Series 3\n",
      "1  1.551689 -0.425844 -1.120399\n",
      "2 -0.472708 -0.619897 -1.491136\n",
      "3  1.909126  0.273118 -2.425986\n",
      "\n",
      "Transpose the data frame:\n",
      "          1         2         3\n",
      "A  1.551689 -0.472708  1.909126\n",
      "B -0.425844 -0.619897  0.273118\n",
      "C -1.120399 -1.491136 -2.425986\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "print(\"A 1-column pd.DataFrame:\\n\"+ str(pd.Series([3, 4, np.nan])))  # a simple pandas data frame with one column\n",
    "\n",
    "row_names = np.arange(1, 4, 1)\n",
    "wb_like_df = pd.DataFrame(np.random.randn(row_names.__len__(), 3), \n",
    "                          index=row_names, columns=['A', 'B', 'C'])\n",
    "print(\"\\nThis is a workbook-like (row and column names) data frame:\\n\" + str(wb_like_df))\n",
    "print(\"\\nRename column names with dictionary:\\n\" + str(wb_like_df.rename(\n",
    "        columns={'A': 'Series 1', 'B': 'Series 2', 'C': 'Series 3'})))\n",
    "print(\"\\nTranspose the data frame:\\n\" + str(wb_like_df.T))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A *pandas* `DataFrame` object can also be created from a {ref}`dictionary <dict>`, where the dictionary keys define column names and the dictionary values constitute the data of every column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A dictionary-built data frame:\n",
      "   Flow depth Sediment    Flow regime         Water\n",
      "0    0.210366      yes        fluvial  Always there\n",
      "1    0.234890       no        fluvial  Always there\n",
      "2    0.247299      yes  supercritical  Always there\n",
      "3    0.164717       no       critical  Always there\n",
      "\n",
      "Frame data types:\n",
      "Flow depth      float32\n",
      "Sediment         object\n",
      "Flow regime    category\n",
      "Water            object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame({'Flow depth': pd.Series(np.random.uniform(low=0.1, high=0.3, size=(4,)), dtype='float32'),\n",
    "                   'Sediment': [\"yes\", \"no\", \"yes\", \"no\"],\n",
    "                   'Flow regime': pd.Categorical([\"fluvial\", \"fluvial\", \"supercritical\", \"critical\"]),\n",
    "                   'Water': \"Always there\"})\n",
    "print(\"A dictionary-built data frame:\\n\" + str(df))\n",
    "print(\"\\nFrame data types:\\n\" + str(df.dtypes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Built-in attributes and methods of a *pandas* `DataFrame` enable easy access to the top (head) and the bottom of a data frame and many more object characteristics (recall: use `dir(dict_df)` or [read the developer's docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Head of the dictionary-based dataframe (first two rows):\n",
      "   Flow depth Sediment Flow regime         Water\n",
      "0    0.210366      yes     fluvial  Always there\n",
      "1    0.234890       no     fluvial  Always there\n",
      "\n",
      "End (tail) of the dictionary-based dataframe (last row):\n",
      "   Flow depth Sediment Flow regime         Water\n",
      "3    0.164717       no    critical  Always there\n"
     ]
    }
   ],
   "source": [
    "print(\"Head of the dictionary-based dataframe (first two rows):\\n\" + str(df.head(2)))\n",
    "print(\"\\nEnd (tail) of the dictionary-based dataframe (last row):\\n\" + str(df.tail(1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(exp-Froude)=\n",
    "## Example: Create a `pandas.DataFrame` of Froude Numbers\n",
    "\n",
    "In hydraulics, the {term}`Froude number` $Fr$ characterizes the flow regime as *\"fluvial\"* (*Fr<1*), *\"critical\"* (*Fr=1*), or *\"super-critical\"* (*Fr>1*). The precision of measurement devices in physical flume experiments makes the exact determination of the *critical* moment a challenge and forces researchers to apply an interval around 1, rather than the exact value of `1.0`:\n",
    "\n",
    "| ***Fr*** | (0.00, 0.95( | (0.95, 1.00(           | (1.00)   | )1.00, 1.05)           | )1.05, inf(    |\n",
    "|----------|--------------|------------------------|----------|------------------------|----------------|\n",
    "| *Flow*   | fluvial      | nearby critical (slow) | critical | nearby critical (fast) | super-critical |\n",
    "\n",
    "`pd.DataFrame( ... )` objects are a convenient for to classify and store flume experiment data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   measured             flow regime\n",
      "0  1.462930          super-critical\n",
      "1  0.430861                 fluvial\n",
      "2  0.998070                critical\n",
      "3  0.131475                 fluvial\n",
      "4  0.174419                 fluvial\n",
      "5  0.455032                 fluvial\n",
      "6  1.235265          super-critical\n",
      "7  0.519341                 fluvial\n",
      "8  0.659579                 fluvial\n",
      "9  0.978529  nearby critical (slow)\n"
     ]
    }
   ],
   "source": [
    "Fr_dict = {0.925: \"fluvial\", 0.975: \"nearby critical (slow)\", 1.0: \"critical\", 1.025: \"nearby critical (fast)\", 1.075: \"super-critical\"}\n",
    "Fr_measured = np.random.uniform(low=0.01, high=2.00, size=(10,))\n",
    "Fr_classified = [Fr_dict[min(Fr_dict.keys(), key=lambda x:abs(x-m))] for m in Fr_measured]\n",
    "obs_df = pd.DataFrame({\"measured\": Fr_measured, \"flow regime\": Fr_classified})\n",
    "print(obs_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Append Data to a `pandas.DataFrame`\n",
    "The `at`, `loc`, `concat`, and `append` methods of *pandas* provide direct options for inserting rows or columns into a `pd.DataFrame`. However, these built-in methods are approximately one order of magnitude slower than taking the detour via a dictionary. This applies especially to data frames with more than 10,000 elements. This means that the fastest method to insert data into a `pd.DataFrame` is:\n",
    "\n",
    "1. Convert an existing `pd.DataFrame` object to a *dictionary* with `pd.DataFrame.to_dict()` (e.g., `dict_of_df = df.to_dict()`).\n",
    "1. Update the *dictionary* with the new data\n",
    "  * Append rows with `dict_of_df.update{\"existing-column-name\": {\"new-row-name\": NEW_DATA}}`\n",
    "  * Append columns with `dict_of_df.update{\"newcolumn-name\": {\"existing-row-names\": NEW_DATA(size=existing-number-of-rows}}`\n",
    "1. Re-convert the *dictionary* to a `pd.DataFrame` with `df = pd.DataFrame.from_dict(dict_of_df)`\n",
    "\n",
    "The following code block illustrates both adding a row and a column to an existing *pandas* data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    measured             flow regime  with sediment\n",
      "8   0.659579                 fluvial          False\n",
      "9   0.978529  nearby critical (slow)           True\n",
      "10  0.996000  nearby critical (slow)           True\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "\n",
    "# convert data frame to dictionary\n",
    "dict_of_obs_df = obs_df.to_dict()\n",
    "\n",
    "# append new row\n",
    "new_row_index = max(dict_of_obs_df[\"measured\"]) + 1\n",
    "dict_of_obs_df[\"measured\"].update({new_row_index: 0.996})\n",
    "dict_of_obs_df[\"flow regime\"].update({new_row_index: \"nearby critical (slow)\"})\n",
    "\n",
    "# append column\n",
    "dict_of_obs_df.update({\"with sediment\": {}})\n",
    "for k in dict_of_obs_df[\"measured\"].keys():\n",
    "    dict_of_obs_df[\"with sediment\"].update({k: bool(random.getrandbits(1))})\n",
    "\n",
    "# re-build data frame\n",
    "obs_df = pd.DataFrame.from_dict(dict_of_obs_df)\n",
    "print(obs_df.tail(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## *NumPy* Arrays and *pandas* Data Frames\n",
    " \n",
    "One major difference between a *NumPy* `array` and a *pandas* `DataFrame` is that *NumPy* arrays can only have one single data type (`dtype`), while a *pandas* `DataFrame` can have different data types (one `dtype` per column). This is why a *NumPy* `array` can be seamlessly converted to a *pandas* `DataFrame`, but the opposite conversion can cause high computational cost: *pandas* comes with a built-in function to convert a *pandas* `DataFrame` into a *NumPy* `array`, where numeric variables are maintained where possible. If one column of the *pandas* `DataFrame` is non-numeric, the conversion involves copying the object, which then causes high computational costs. Note that the *index* and *column* labels of a *pandas* `DataFrame` are lost in the conversion from `pd.DataFrame` to `np.ndarray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1.4629304199712003 'super-critical' True]\n",
      " [0.4308609175144578 'fluvial' True]\n",
      " [0.9980702542754516 'critical' False]\n",
      " [0.13147505751224464 'fluvial' False]\n",
      " [0.1744193402007127 'fluvial' False]\n",
      " [0.45503162541640574 'fluvial' True]\n",
      " [1.2352651606844145 'super-critical' True]\n",
      " [0.5193413669752244 'fluvial' False]\n",
      " [0.6595793984115736 'fluvial' False]\n",
      " [0.9785285127361709 'nearby critical (slow)' True]\n",
      " [0.996 'nearby critical (slow)' True]]\n"
     ]
    }
   ],
   "source": [
    "print(obs_df.to_numpy())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access Data Frames Entries\n",
    "\n",
    "Elements of data frames are accessible by the column and row label (`df.loc[index=row, column-label]`) or number (`df.iloc`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Label localization results in: 0.2472994\n",
      "Same result with integer grid location: 0.2472994\n"
     ]
    }
   ],
   "source": [
    "print(\"Label localization results in: \" + str(df.loc[2, \"Flow depth\"]))\n",
    "print(\"Same result with integer grid location: \" + str(df.iloc[2, 0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(pd-reshape)=\n",
    "## Reshape Data Frames\n",
    "\n",
    "Single or multiple rows (indices) and columns can be extracted from and combined into new or existing `DataFrame` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                   0        1         2         3\n",
      "Flow depth  0.210366  0.23489  0.247299  0.164717\n",
      "Sediment         yes       no       yes        no\n"
     ]
    }
   ],
   "source": [
    "print(pd.DataFrame([df[\"Flow depth\"], df[\"Sediment\"]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `df.stack()` method pivots the columns of a data frame, which is a powerful tool to classify data that can take different dimensions (e.g., the volume and weight of 1 m<sup>3</sup> water - read more about the [stack method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack)). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Flow depth         0.210366\n",
      "Sediment                yes\n",
      "Flow regime         fluvial\n",
      "Water          Always there\n",
      "dtype: object\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Flow depth   0         0.210366\n",
       "             1          0.23489\n",
       "             2         0.247299\n",
       "             3         0.164717\n",
       "Sediment     0              yes\n",
       "             1               no\n",
       "             2              yes\n",
       "             3               no\n",
       "Flow regime  0          fluvial\n",
       "             1          fluvial\n",
       "             2    supercritical\n",
       "             3         critical\n",
       "Water        0     Always there\n",
       "             1     Always there\n",
       "             2     Always there\n",
       "             3     Always there\n",
       "dtype: object"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(df.stack()[0])\n",
    "df.unstack()  # unstack data frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Big datasets often contain large amounts of data with many labels, but we are often only interested in a small subset of the data. To this end, data frame subsets can be created with `df.pivot(index, columns, **values)` ([Pivot method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot)):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pivot table for 'Flow regime':\n",
      "Flow depth  0.164717 0.210366 0.234890       0.247299\n",
      "Sediment                                             \n",
      "no          critical      NaN  fluvial            NaN\n",
      "yes              NaN  fluvial      NaN  supercritical\n",
      "\n",
      "Pivot table for 'Water':\n",
      "Flow depth      0.164717      0.210366      0.234890      0.247299\n",
      "Sediment                                                          \n",
      "no          Always there           NaN  Always there           NaN\n",
      "yes                  NaN  Always there           NaN  Always there\n"
     ]
    }
   ],
   "source": [
    "print(\"Pivot table for \\'Flow regime\\':\\n\" + str(df.pivot(index=\"Sediment\", columns=\"Flow depth\")[\"Flow regime\"]))\n",
    "print(\"\\nPivot table for \\'Water\\':\\n\" + str(df.pivot(index=\"Sediment\", columns=\"Flow depth\")[\"Water\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, `df.pivot_table(index, columns, values, aggfunc)` ([Pivot table function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table)) enables inline Office-like function application to one or more rows and/or columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'mean' for 'Flow depth':\n",
      "Flow regime  critical   fluvial  supercritical\n",
      "Sediment                                      \n",
      "no           0.164717  0.234890            NaN\n",
      "yes               NaN  0.210366       0.247299\n"
     ]
    }
   ],
   "source": [
    "print(\"\\'mean\\' for \\'Flow depth\\':\\n\" + str(df.pivot_table(index=\"Sediment\", columns=\"Flow regime\", values=\"Flow depth\", aggfunc=np.mean)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read more about reshaping and pivoting data frames in the [developer's docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(pd-files)=\n",
    "## File Handling (*csv*, Workbooks, and More) \n",
    "\n",
    "*pandas* can read from and write to many data file types, which makes it extremely powerful for analyzing any data. The following table summarizes the most important file types for numerical hydraulic, morphodynamic, and fluvial landscape analyses, and more file type handlers can be found at the [developer's docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).\n",
    "\n",
    "|  File type                                                                  |  *pandas* read                                                                                         |  *pandas* write                                                                                       | Usage example                                                                                           |\n",
    "|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|\n",
    "|  CSV |  [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table)       |  [`to_csv`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-store-in-csv)          | Reading from data loggers (e.g., discharge, flow depth)    |\n",
    "|  Google BigQuery  |  [`read_gbq`](https://en.wikipedia.org/wiki/BigQuery)|  [`to_gbq`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-bigquery)              | Analyze social media   |\n",
    "|  JSON |  [`read_json`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-json-reader)         |  [`to_json`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-json-writer)          | Manipulate {ref}`chpt-basement` model files    |\n",
    "|  HTML |  [`read_html`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-html)           |  [`to_html`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-html)                 | Process  web site data        |\n",
    "|  [HDF5 Format](https://support.hdfgroup.org/HDF5/doc1.6/UG/08_TheFile.html) |  [`read_hdf`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-hdf5)                 |  [`to_hdf`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-hdf5)                  | Analyze {ref}`chpt-basement` or [*HEC-RAS*](https://www.mdpi.com/2073-4441/10/10/1382) output files |\n",
    "|  Python Pickle Format |  [`read_pickle`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-pickle)            |  [`to_pickle`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-pickle)     | Cache memory dump       |\n",
    "|  SQL  |  [`read_sql`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql)     |  [`to_sql`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql)  | Retrieve and write data to SQL data bases    |\n",
    "|  Workbooks (Excel / Open doc) |  [`read_excel`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-excel-reader)|  [`to_excel`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-excel-writer)| Interface with non-programmers (Open only works in read mode)|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code block illustrates how the above produced *data/modified-data.csv* file can be loaded names and saved to a workbook with *pandas*. *pandas* uses [openpyxl](https://openpyxl.readthedocs.io) by default, but this usage varies depending on the workbook file type (e.g., `.ods`, `.xls`, and `xlsb` build on other packages - [read more about the `engine` keyword](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Header of data/modified-data.csv:\n",
      "   Test 1 Test 2 Test 3  Test 4\n",
      "0     3.0    5.0    4.0     5.0\n",
      "1     4.0    2.0    1.0     3.0\n",
      "2     4.0    8.0    3.0    10.0\n"
     ]
    }
   ],
   "source": [
    "measurement_data = pd.read_csv(\"data/modified-data.csv\", sep=\",\", header=None, names=[\"Test 1\", \"Test 2\", \"Test 3\", \"Test 4\"])\n",
    "print(\"Header of data/modified-data.csv:\\n\" + str(measurement_data.head(3)))\n",
    "measurement_data.to_excel(\"data/modified-data-wb.xlsx\", sheet_name=\"2025-01-01 Tests\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} ../img/py-pandas-xlsx-out.png\n",
    ":alt: python pandas excel output\n",
    ":name: py-pandas-xlsx-out\n",
    "\n",
    "The xlsx output file produced with pandas.\n",
    "```\n",
    "\n",
    "```{note}\n",
    "*pandas* tries to convert all data into `dtype=float`, but as soon as there is only one text variable in a column, the entire column will be saved as *string*-data type in a workbook.\n",
    "```\n",
    "\n",
    "A *pandas* `ExcelWriter` object can be created to write multiple `pd.DataFrame` objects to a workbook, on one or more sheets. Here is an example, where the non-numeric `\"nan\"` strings are replaced in `measurement_data` with `np.nan` to yield a purely numeric data frame in two steps (`# (1)` and `# (2)`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "measurement_data = measurement_data.replace(\"nan\", np.nan, regex=True)  # (1) replace \"nan\" with np.nan\n",
    "measurement_data = measurement_data.apply(pd.to_numeric)  # (2) convert all data to numeric\n",
    "\n",
    "# write workbook with pd ExcelWriter object\n",
    "with pd.ExcelWriter(\"data/modified-data-wb-EW.xlsx\") as writer:\n",
    "    measurement_data.to_excel(writer, sheet_name=\"2025-01-01 Tests\")\n",
    "    df.to_excel(writer, sheet_name=\"pandas example\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} ../img/py-pandas-xlsx-out2.png\n",
    ":alt: python pandas excel xlsx with nan\n",
    ":name: py-pandas-xlsx-out2\n",
    "\n",
    "The xlsx file with nan strings.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical Data\n",
    "\n",
    "*string* variables that represent statistically relevant categories are the baseline for data classification and statistics. *pandas* provides the special data type of `dtype=\"category\"` to facilitate statistical analyses.\n",
    "\n",
    "In the above {ref}`Froude-number example <exp-Froude>`, we used five categories to classify the flow regime as a function of the *Froude number*, which can serve as categories. This is useful, for instance, when no water was flowing or when a sensor broke in an experiment and we want to categorize our measurements to filter valid tests only:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0                   fluvial\n",
      "1                       NaN\n",
      "2                  critical\n",
      "3    nearby critical (slow)\n",
      "4                       NaN\n",
      "dtype: category\n",
      "Categories (5, object): ['fluvial', 'nearby critical (slow)', 'critical', 'nearby critical (fast)', 'super-critical']\n"
     ]
    }
   ],
   "source": [
    "flow_regimes = [\"fluvial\", \"nearby critical (slow)\", \"critical\", \"nearby critical (fast)\", \"super-critical\"]\n",
    "observation_examples = [\"fluvial\", \"dry\", \"critical\", \"nearby critical (slow)\", \"measurement error\"]\n",
    "Fr_cat = pd.Categorical(observation_examples, categories=flow_regimes, ordered=False)\n",
    "print(pd.Series(Fr_cat))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Frame Statistics\n",
    "\n",
    "*pandas* has efficient routines to perform workbook-like row or column sorting (e.g., [`df.sort_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html) or [`df.sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)), and enables the fast calculation of data frame statistics with `df.describe()`, where 25%, 50%, and 75% represent the *i-th* percentiles:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Test 1</th>\n",
       "      <th>Test 2</th>\n",
       "      <th>Test 3</th>\n",
       "      <th>Test 4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>18.000000</td>\n",
       "      <td>16.000000</td>\n",
       "      <td>15.000000</td>\n",
       "      <td>18.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>4.111111</td>\n",
       "      <td>4.250000</td>\n",
       "      <td>4.533333</td>\n",
       "      <td>5.555556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.298053</td>\n",
       "      <td>2.792848</td>\n",
       "      <td>2.386470</td>\n",
       "      <td>2.617188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.250000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>7.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>9.000000</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          Test 1     Test 2     Test 3     Test 4\n",
       "count  18.000000  16.000000  15.000000  18.000000\n",
       "mean    4.111111   4.250000   4.533333   5.555556\n",
       "std     2.298053   2.792848   2.386470   2.617188\n",
       "min     1.000000   1.000000   1.000000   1.000000\n",
       "25%     2.250000   2.000000   3.000000   4.000000\n",
       "50%     4.000000   4.000000   4.000000   5.500000\n",
       "75%     5.000000   6.000000   6.000000   7.000000\n",
       "max     9.000000  10.000000   9.000000  10.000000"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "measurement_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Statistical *pandas* data frame methods overlap with *NumPy* methods and include:\n",
    "\n",
    "* `df.abs()` calculates asbolute values\n",
    "* `df.cumprod()` calculates the cumulative product\n",
    "* `df.cumsum()` calculates the cumulative sum\n",
    "* `df.count()` counts the number of non-null observations\n",
    "* `df.max()` calculates the maximum value\n",
    "* `df.mean()` calculates the mean (average)\n",
    "* `df.min()` calculates the minimum value\n",
    "* `df.mode()` calculates the mode\n",
    "* `df.prod()` calculates the product\n",
    "* `df.std()` calculates tthe standard deviation\n",
    "* `df.sum()` calculates the sum\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean:\n",
      "Test 1    4.111111\n",
      "Test 2    4.250000\n",
      "Test 3    4.533333\n",
      "Test 4    5.555556\n",
      "dtype: float64\n",
      "Median:\n",
      "Test 1    4.0\n",
      "Test 2    4.0\n",
      "Test 3    4.0\n",
      "Test 4    5.5\n",
      "dtype: float64\n",
      "Standard deviation:\n",
      "Test 1    2.298053\n",
      "Test 2    2.792848\n",
      "Test 3    2.386470\n",
      "Test 4    2.617188\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "print(\"Mean:\\n\" + str(measurement_data.mean()))\n",
    "print(\"Median:\\n\" + str(measurement_data.median()))\n",
    "print(\"Standard deviation:\\n\" + str(measurement_data.std()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "*pandas* has many more built-in functionalities, for example, to plot histograms or any data using the `matplotlib` library, and machine learning.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Apply Custom (Own) Functions to Data Frames\n",
    "\n",
    "*pandas* data frames have a built-in `apply(fun)` method that enables applying a custom function to (parts of) a `pd.DataFrame` object. The following code block borrows from the `feet_to_meter` function from the {ref}`functions <kwargs>` chapter ([download converter.py](https://raw.githubusercontent.com/hydro-informatics/jupyter-python-course/main/fun/converter.py)).  The *pandas* docs provide more information about the [pandas.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Feets   Meters\n",
      "0     59  17.9832\n",
      "1     60  18.2880\n",
      "2     85  25.9080\n",
      "3     18   5.4864\n",
      "4     20   6.0960\n",
      "5      3   0.9144\n"
     ]
    }
   ],
   "source": [
    "from fun.converter import feet_to_meter\n",
    "\n",
    "# create data frame with random integers\n",
    "df = pd.DataFrame({\"Feet\": np.random.randint(0, 100, size=6),\n",
    "                   \"Meters\": np.ones(6) * np.nan})\n",
    "\n",
    "# apply feet_to_meter to the Meters columns of the data frame\n",
    "df[\"Meters\"] = df[\"Feet\"].apply(feet_to_meter)\n",
    "\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dates and Time\n",
    "\n",
    "*pandas* involves methods for calculations and labeling with date and time values through [`pd.Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html), which converts date-time-like strings into timestamps or creates timestamps from keyword arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2025-01-01 12:00:00\n",
      "2025-01-01 12:00:00\n",
      "2025-01-01 12:00:00\n"
     ]
    }
   ],
   "source": [
    "print(pd.Timestamp('2025-01-01T12'))\n",
    "print(pd.Timestamp(year=2025, month=1, day=1, hour=12))\n",
    "print(pd.Timestamp(2025, 1, 1, 12))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The expression `pd.Timestamp(2025, 1, 1, 12)` mimics the powerful `datetime.datetime` *API* (Application Programming Interface) of the `datetime` Python library, which provides sophisticated methods for handling time-dependent values. While *pandas*' built-in timestamps are convenient for creating time series within `pd.DataFrame` objects and workbook-like tables, `datetime` is one of the best solutions for time-dependent calculations in Python. `datetime` is available by default (i.e., it must not be *conda* or *pip*-installed) and is efficiently applicable, for example, to data that were collected over several years including leap years. The `datetime` package comes with many attributes and methods, which are documented in detail in the [Python docs](https://docs.python.org/3/library/datetime.html).\n",
    "\n",
    "The standard usage is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Datetime variables can be subtracted:\n",
      "5 days, 3:45:30\n",
      "The result is a <class 'datetime.timedelta'> object.\n"
     ]
    }
   ],
   "source": [
    "import datetime as dt\n",
    "start_date = dt.datetime(2024, 2, 25, 22, 30, 0)\n",
    "end_date = dt.datetime(year=2024, month=3, day=2, hour=2, minute=15, second=30)\n",
    "print(\"Datetime variables can be subtracted:\\n\" + str(end_date - start_date))\n",
    "print(\"The result is a %s object.\" % type(end_date - start_date))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dt.timedelta` objects can also be separately defined:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iterate from start to end date with stepsize=time_diff:\n",
      "2024-02(Feb)-25, 22:30:00\n",
      "2024-02(Feb)-26, 21:30:00\n",
      "2024-02(Feb)-27, 20:30:00\n",
      "2024-02(Feb)-28, 19:30:00\n",
      "2024-02(Feb)-29, 18:30:00\n",
      "2024-03(Mar)-01, 17:30:00\n"
     ]
    }
   ],
   "source": [
    "time_diff = dt.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=23, weeks=0)\n",
    "act_time = start_date\n",
    "print(\"Iterate from start to end date with stepsize=time_diff:\")\n",
    "while act_time <= end_date:\n",
    "    print(act_time.strftime(\"%Y-%m(%h)-%d, %H:%M:%S\"))\n",
    "    act_time += time_diff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That is all for the introduction to data and file handling. Though there is much more to data processing than shown in this chapter and the next other chapters of this eBook will occasionally feature more tools."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Practice *pandas* and its *csv* file handling routines, as well as basic date-time handling in the {ref}`flood return period calculation exercise <ex-floods>`.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning Success Check-up\n",
    "\n",
    "Take the [learning success test for this Jupyter notebook](https://forms.gle/V2EiNFEiszL38Aov8).\n",
    "\n",
    "````{admonition} Unfold QR Code\n",
    ":class: tip, dropdown\n",
    "\n",
    "```{image} ../img/qr-codes/gle-pynum.png\n",
    "```\n",
    "````"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}