Getting started with Python and Jupyter Notebooks

One of the hallmarks of modern society is that sensors, computers and data storage are getting cheaper resulting in an exponential growth in the amount of data being collected. To effectively make use of this wealth of data requires knowledge in how to work with a variety of data formats, analyze large and diverse datasets, and present results in a way that increase the value of the analysis. Jupyter Notebook is a powerful tool for interactively developing analysis tools and presenting the results. Notebooks integrate code and code outputs into a single document that combines visualizations, narrative text, mathematical equations and other media into a single sharable document. As part of this class, we will use Jupyter Notebooks first to explore more environmental data than we can with paper and pencil, or even Excel spreadsheets. Then in a second we will take the notebooks a step further and learn how to use their power to solve design problems.

Getting Started

For this class, you can access Jupyter notebooks either by using Purdue Research Computing resources including the scholar computer cluster, or by installing Python 3 and Jupyter Notebooks on your own system. Your notebook files are platform independent, so you can run them on either system, and share them with your classmates. Jupyter notebooks can handle multiple programming languages, but we will be focusing on Python 3.

Starting Jupyter Notebooks on RCAC Computer Clusters

Purdue University Research Computing (rcac.purdue.edu) is a sub-group of ITaP that focuses on facilitating research through building physical infrastructure, providing expert staff, and providing educational opportunities to the Purdue community. As a registered member of this class, you have been granted access to the scholar cluster for the semester. This gives you access to Purdue's research computing clusters for the purpose of educating you on how to use these types of computing resources to solve problems.

Open your favorite browser and direct it to https://notebook.scholar.rcac.purdue.edu/. If you want to use notebook on a cluster other than scholar, than replace "scholar" in the address with the name of the cluster system, for example "brown".
This should take you to the Jupyter Notebook login page shown below.
Enter your Purdue career ID and ",push". Note that this page does require two-factor authentication.
Once you have logged into the system, you will see the Jupyter Notebook "Files" tab. This will show you the files that are in your RCAC home drive. This is not the same as your ECN or ITaP home drives, and will probably be pretty empty if you have not worked with RCAC resources previously.
Using Jupyter's "Files" tab you can navigate around your system drives and folders. You should also create a new directory for you notebook files, for example "JupyterNotebooks" by clicking on the "New" button, and selecting Folder. Then click on the box to the left of the newly created Unnamed Folder and select "Rename" from the top menu.
On the local computer, you should mount your RCAC home drive so that you can copy files to the cluster systems using Windows tools. You can find instructions on how to mount your RCAC home drive to Windows and MacOS X systems at https://www.rcac.purdue.edu/knowledge/scholar/storage/transfer/cifs. Note that if you are using a personal computer, and you do not log into that computer using your Purdue career ID and password, you will need to "Connect using different credentials" on Windows. If you are off campus, you may also need to connect first to Purdue's VPN using the instructions at https://www.itap.purdue.edu/connections/vpn/.
Once your RCAC home drive is mounted, you should be able to navigate to the directory that you created in Step 5.

Installing Anaconda and Jupyter Notebook on a Local System

These instructions apply to installation on a desktop computer in a Purdue computer lab or on a personal computer.

Jupyter notebook is included as part of the Anaconda installation package for Python). Anaconda includes the Python programming language as well as over 1,500 Python data sciences packages. The following steps will help guide you through the process of installing and starting the official Anaconda distribution.

On anaconda.com go the Products → Anaconda Distribution page and download the Python 3.x version (not the Python 2.x version). Make sure that you pick the version for your current operating system. Python and Anaconda can be installed on Windows, macOS X and Linux.
Go to your download folder and open the file you just downloaded.
Choose to install the package as "Just Me" for Purdue lab computers (does not require administrator privileges). If installing on your personal computer you can choose to install for "All Users" but will then have to provide the Administrator account information.

Note

Installing for "All Users" can result in some permission issues, since the code will be installed in directories restricted to Administrator control, rather than your personal directories. If you are not comfortable with installing and maintaining open source software on your own, then you should stick with the "Just Me" option.
Install to your home drive, typically the disk mount as "U:" on ECN systems or the "W:" drive on ITaP systems. I suggest "U:\Anaconda" or something similar. Note that if you install on the "C:" drive, the software will only be available when logged into that specific computer, which can be a problem with computers in shared labs.
[Once the installation process is complete, you will likely need to make sure that Anaconda Navigator is fully up to date by completing the following tasks:]
1. On a Windows system
  1. Start Anaconda Navigator.
  2. Push the button to Launch "CMD.exe Prompt" or "PowerShell Prompt" (at least one will be present in the dashboard).
  3. From the terminal prompt, type "conda update --all".
  4. The system will think for a little while as it assess which of your tools need to be updated, when it is done it will provide a list (which could be quite long) and ask "Proceed ([y]/n)?".
  5. Answer by typing "y" followed by the Enter key.
  6. Anaconda will them proceed to download and install multiple packages. The length of this process depends on the number of packages that need to be updated, but the process works best when you leave the computer plugged in and running (it will continue if the computer is put to sleep, but will not install tools during that time).
2. On A Mac OS X system
  1. Open a terminal or iTerm (installed from iterm2.com).
  2. From the terminal prompt, type "conda update --all".
  3. The system will think for a little while as it assess which of your tools need to be updated, when it is done it will provide a list (which could be quite long) and ask "Proceed ([y]/n)?".
  4. Answer by typing "y" followed by the Enter key.
  5. Anaconda will them proceed to download and install multiple packages. The length of this process depends on the number of packages that need to be updated, but the process works best when you leave the computer plugged in and running (it will continue if the computer is put to sleep, but will not install tools during that time).
3. When the terminal prompt returns and the message indicates that all updates have been completed, close the terminal window and close Anaconda Navigator.
[Restart the Anaconda Navigator.]
[You should now be able to find "Jupyter Notebook". During the initial start-up process, it will ask you to set the default browser to use with Jupyter Notebooks. Most browsers should work, so I suggest you set it to the browser you use regularly for other activities.]
Once Juypter Notebooks has started, you should see a page such as the one shown below, but with files from your local home drive.
Using Jupyter's "Files" tab you can navigate around your system drives and folders. You should also create a new directory for you notebook files, for example "JupyterNotebooks" by clicking on the "New" button, and selecting Folder. Then click on the box to the left of the newly created Unnamed Folder and select "Rename" from the top menu.

Creating Your First Notebook

In this section, we're going to learn to run and save notebooks, familiarize ourselves with their structure, and understand the interface. We'll become intimate with some core terminology that will steer you towards a practical understanding of how to use Jupyter Notebooks by yourself and set us up for the next section, which steps through an example data analysis and brings everything we learn here to life.

Source for this example: https://www.dataquest.io/blog/jupyter-notebook-tutorial/

Running Jupyter

Running Jupyter will open a new tab in your default web browser that should look something like the following screenshot.

Jupyter control
panel

This isn't a notebook just yet, but don't panic! There's not much to it. This is the Notebook Dashboard, specifically designed for managing your Jupyter Notebooks. Think of it as the launch pad for exploring, editing and creating your notebooks.

Be aware that the dashboard will give you access only to the files and sub-folders contained within Jupyter's start-up directory; however, the start-up directory can be changed. It is also possible to start the dashboard on any system via the command prompt (or terminal on Unix systems) by entering the command jupyter notebook; in this case, the current working directory will be the start-up directory.

If you are running Jupyter from a local installation, you may notice that the URL for the dashboard is something like http://localhost:8888/tree. Localhost is not a website, but indicates that the content is being served from your local machine: your own computer. Jupyter's Notebooks and dashboard are web apps, and Jupyter starts up a local Python server to serve these apps to your web browser, making it essentially platform independent and opening the door to easier sharing on the web. Note that if you are running on localhost, then your files and notebooks will only be available on the current machine. If you are running on notebook.scholar.rcac.purdue.edu, then your files and notebooks are available from anywhere that you can log into Purdue's network.

The dashboard's interface is mostly self-explanatory --- though we will come back to it briefly later. So what are we waiting for? Browse to the folder in which you would like to create your first notebook, click the "New" drop-down button in the top-right and select "Python 3".

New notebook
menu

Hey presto, here we are! Your first Jupyter Notebook will open in new tab --- each notebook uses its own tab because you can open multiple notebooks simultaneously. If you switch back to the dashboard, you will see the new file Untitled.ipynb and you should see some green text that tells you your notebook is running.

What is an ipynb File?

It will be useful to understand what this file really is. Each .ipynb file is a text file that describes the contents of your notebook in a format called JSON. Each cell and its contents, including image attachments that have been converted into strings of text, is listed therein along with some metadata. You can edit this yourself --- if you know what you are doing! --- by selecting "Edit > Edit Notebook Metadata" from the menu bar in the notebook.

You can also view the contents of your notebook files by selecting "Edit" from the controls on the dashboard, but the keyword here is "can"; there's no reason other than curiosity to do so unless you really know what you are doing.

The Notebook Interface

Now that you have an open notebook in front of you, its interface will hopefully not look entirely alien; after all, Jupyter is essentially just an advanced word processor. Why not take a look around? Check out the menus to get a feel for it, especially take a few moments to scroll down the list of commands in the command palette, which is the small button with the keyboard icon (or Ctrl + Shift + P).

New Jupyter
Notebook

There are two fairly prominent terms that you should notice, which are probably new to you: cells and kernels are key both to understanding Jupyter and to what makes it more than just a word processor. Fortunately, these concepts are not difficult to understand.

A kernel is a "computational engine" that executes the code contained in a notebook document.
A cell is a container for text to be displayed in the notebook or code to be executed by the notebook's kernel.

Cells

We'll return to kernels a little later, but first let's come to grips with cells. Cells form the body of a notebook. In the screenshot of a new notebook in the section above, that box with the green outline is an empty cell. There are two main cell types that we will cover:

A code cell contains code to be executed in the kernel and displays its output below.
A Markdown cell contains text formatted using Markdown and displays its output in-place when it is run.

The first cell in a new notebook is always a code cell. Let's test it out with a classic hello world example. Type print('Hello World!') into the cell and click the run button Notebook Run
Button in the toolbar above or press Ctrl + Enter. The result should look like this:

print('Hello World!')

Hello World!

When you ran the cell, its output will have been displayed below and the label to its left will have changed from In [ ] to In [1]. The output of a code cell also forms part of the document, which is why you can see it in this article. You can always tell the difference between code and Markdown cells because code cells have that label on the left and Markdown cells do not.

The "In" part of the label is simply short for "Input," while the label number indicates when the cell was executed on the kernel --- in this case the cell was executed first. Run the cell again and the label will change to In [2] because now the cell was the second to be run on the kernel. It will become clearer why this is so useful later on when we take a closer look at kernels.

From the menu bar, click Insert and select Insert Cell Below to create a new code cell underneath your first and try out the following code to see what happens. Do you notice anything different?

import time
time.sleep(3)

This cell doesn't produce any output, but it does take three seconds to execute. Notice how Jupyter signifies that the cell is currently running by changing its label to In [*].

In general, the output of a cell comes from any text data specifically printed during the cells execution, as well as the value of the last line in the cell, be it a lone variable, a function call, or something else. For example:

def say_hello(recipient):
    return 'Hello, !'.format(recipient)
say_hello('Tim')

'Hello, Tim!'

You'll find yourself using this almost constantly in your own projects, and we'll see more of it later on.

Keyboard Shortcuts

One final thing you may have observed when running your cells is that their border turned blue, whereas it was green while you were editing. There is always one "active" cell highlighted with a border whose color denotes its current mode, where green means "edit mode" and blue is "command mode."

So far we have seen how to run a cell with Ctrl + Enter, but there are plenty more shortcuts available. Keyboard shortcuts are a very popular aspect of the Jupyter environment because they facilitate a speedy cell-based workflow. Many of these are actions you can carry out on the active cell when it's in command mode.

Below, you'll find a list of some of Jupyter's keyboard shortcuts. You're not expected to pick them up immediately, but the list should give you a good idea of what's possible.

Toggle between edit and command mode with Esc and Enter, respectively.
Once in command mode:
- Scroll up and down your cells with your Up and Down keys.
- Press A or B to insert a new cell above or below the active cell.
- M will transform the active cell to a Markdown cell.
- Y will set the active cell to a code cell.
- D + D (D twice) will delete the active cell.
- Z will undo cell deletion.
- Hold Shift and press Up or Down to select multiple cells at once.
  - With multiple cells selected, Shift + M will merge your selection.
Ctrl + Shift + -, in edit mode, will split the active cell at the cursor.
You can also click and Shift + Click in the margin to the left of your cells to select them.

Go ahead and try these out in your own notebook. Once you've had a play, create a new Markdown cell and we'll learn how to format the text in our notebooks.

Markdown

Markdown is a lightweight, easy to learn markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags, so some prior knowledge here would be helpful but is definitely not a prerequisite. Let's cover the basics with a quick example.

# This is a level 1 heading

## This is a level 2 heading

This is some plain text that forms a paragraph.

Add emphasis via **bold** and __bold__, or *italic* and _italic_.

Paragraphs must be separated by an empty line.

* Sometimes we want to include lists.

  * Which can be indented.

1. Lists can also be numbered.

2. For ordered lists.

7. The number at the start of the line does not matter, it will be rendered in order.

[It is possible to include hyperlinks](https://www.example.com), where [TEXT](LINK).

Inline code uses single backticks: `foo()`, and code blocks use triple backticks:

/```
This is a block of code
/```

Or can be indented by 4 spaces:

    foo()

And finally, adding images is easy: ![Alt text](https://www.example.com/image.jpg)

When attaching images, you have three options:

Use a URL to an image on the web.
Use a local URL to an image that you will be keeping alongside your notebook, such as in the same git repo.
Add the image as an attachment via "Edit > Insert Image"; this will convert the image into a string and store it inside your notebook .ipynb file.

Note that the last option will make your .ipynb file much larger!

There is plenty more detail to Markdown, especially around hyperlinking, and it's also possible to simply include plain HTML. Once you find yourself pushing the limits of the basics above, you can refer to the official guide from the creator, John Gruber, on his website.

Kernels

Behind every notebook runs a kernel. When you run a code cell, that code is executed within the kernel and any output is returned back to the cell to be displayed. The kernel's state persists over time and between cells --- it pertains to the document as a whole and not individual cells.

For example, if you import libraries or declare variables in one cell, they will be available in another. In this way, you can think of a notebook document as being somewhat comparable to a script file, except that it is multimedia. Let's try this out to get a feel for it. First, we'll import a Python package and define a function.

import numpy as np
def square(x):
    return x * x

Once we've executed the cell above, we can reference np and square in any other cell.

x = np.random.randint(1, 10)
y = square(x)
print('%d squared is %d' % (x, y))

1 squared is 1

This will work regardless of the order of the cells in your notebook. You can try it yourself, let's print out our variables again.

print('Is %d squared %d?' % (x, y))

Is 1 squared 1?

No surprises here! But now let's change y.

y = 10

What do you think will happen if we run the cell containing our print statement again? We will get the output Is 4 squared is 10?!

Most of the time, the flow in your notebook will be top-to-bottom, but it's common to go back to make changes. In this case, the order of execution stated to the left of each cell, such as In [6], will let you know whether any of your cells have stale output. And if you ever wish to reset things, there are several incredibly useful options from the Kernel menu:

Restart: restarts the kernel, thus clearing all the variables etc that were defined.
Restart & Clear Output: same as above but will also wipe the output displayed below your code cells.
Restart & Run All: same as above but will also run all your cells in order from first to last.

If your kernel is ever stuck on a computation and you wish to stop it, you can choose the Interrupt option.

Choosing a Kernel

You may have noticed that Jupyter gives you the option to change kernel, and in fact there are many different options to choose from. Back when you created a new notebook from the dashboard by selecting a Python version, you were actually choosing which kernel to use.

Not only are there kernels for different versions of Python, but also for over 100 languages including Java, C, and even Fortran. Data scientists may be particularly interested in the kernels for R and Julia, as well as both imatlab and the Calysto MATLAB Kernel for Matlab. The SoS kernel provides multi-language support within a single notebook. Each kernel has its own installation instructions, which will likely require you to run some commands on your computer.

A First Demonstration of Jupyter Notebooks

This section assumes that you have started Jupyter Notebooks either on an RCAC Cluster system, or after successfully installing Anaconda on your own system, both topics covered in previous sections of this Wiki page. For this section, you will work with an existing Jupyter notebook file that solves a homework problem from Soil and Water Conservation Engineering Chapter 2 to estimate 5, 50 and 100 year storm events for West Lafayette, IN. You will then modify the notebook file to work with a new file.

Opening an existing Jupyter notebook

Download the attached Jupyter notebook file (Homework_2018_2-5.ipynb) and an ASCII text data file (IN_12-9427_ams.txt), and save them into the folder that you just created.
1. You can do this by saving both files directly to that folder, or
2. By saving both files to the local computer and using the "Upload" button in the upper right of the Jupyter Files tab to select and upload each file to the correct directory.
Your Jupyter notebook Files tab should now look something like this:
From the Jupyter notebook Files tab, click on the Homework_2018_2-5.ipynb. This will open the notebook file inside a new browser tab, like this:
This notebook page is a mixture of Rich Text including equations, and Python code. It is designed to open the Annual Maximum data file for the West Lafayette, Indiana station used for the Chapter 2 homework. For the assignment, you probably downloaded the file and worked with it in Excel and provided answers in a mixture of Excel print outs and hand written assignment pages. Notebook software, such as Jupyter, has been developing for many years as many people have worked to take what they like about notebooks that have long been used to record scientific discoveries and meld it with the powerful graphics and programming environments now available to any computer user.
If you scroll down through this notebook, you will see the problem statement from the homework assignment, equations from the textbook, and figures and tables representing the work required to solve this problem.
You may also find that some of the text formatting is not correct, or that numbers are "undefined" or otherwise incorrect. This is because you imported the notebook file, and have started a Python kernel to run the file, but you have not actually run the notebook in its current location so the results have not been updated.
To run the entire notebook, go to Cell → Run All from the menu at the top of the notebook page.
When Markdown cells are run the formatting code is applied, and the resulting rich text is displayed. When Code cells are run the number in the code prompt (e.g., ) will be replaced with "*" while the code runs, then with a number when the code has finished. If the code outputs something, then the old output will disappear when the code starts running and be replaced with new output when the code is done.
When the full notebook has been run, read through it. This notebook is solving problems from the Week 2 homework where a statistical distribution was fit to precipitation extremes and then the statistical distribution was used to estimate to 5, 50 and 100-year return events. Check the solutions from this document versus your own homework.
Note that there are several places in the Jupyter notebook file where data is printed to the screen, or visualized in figures. Looking at your data is important, and something that notebooks make easier than even an Excel spreadsheet.

Modifying an existing Jupyter notebook to process new data

One great thing about the Jupyter notebook is that they can be reused to complete similar analysis on different data sets. To demonstrate that, lets modify the notebook from the previous section to process precipitation data from a different location.

From the tab where you have Homework_2018_2-5.ipynb open, click on File → Make a Copy.
This will open a new tab, named something like "Homework_2018_2-5-Copy1".
Double click on the document name to the right of the Jupyter logo (see circled name below). This will open a new window where you can edit the document name. Rename the document to "Homework_2018_2-5-Bloomington". Click on the Save icon.
Now scroll down under the Solution heading in the document, and use the embedded link to go to the Precipitation Frequency Data Server. Click on Indiana, and then use the pull down menu to select the station "BLOOMINGTON (12-0784)". Scroll down, and click on the "Supplementary Information" tab. Scroll down to the VII. Time series data heading. Right click on link next to annual maximum series file, and choose to "Save link as ..." (or equivalent, will depend on what browser you are using). Save the file to your Downloads folder - it should appear as IN_12-0784_ams.txt.
Now return to the Jupyter Files tab - should still be open, but in a different browser tab. Use the Upload button to import the Bloomington precipitation annual maximum series.
Now return to the Bloomington copy of the notebook file. Scroll down to where the MaxTimeSeriesFile variable is defined, and change the original entry to match the new file you just downloaded.
Once the change is saved, select Kernel → Restart & Run All from the Jupyter menu. This will restart the kernel clearing up anything left in memory from the original program, and now rerun the notebook using data from the Bloomington, Indiana file.
What did you get for the 5, 50 and 100 year storm precipitation totals in Bloomington? Does it look something like this?
Scroll back through the notebook, can you work out why your calculations are resulting in NaN (Not a Number) values?
This is where it helps to include visualizations of the data in the notebook. Looking at the two plots of the data, it is quite clear that one of the precipitation values is a negative number:
Go back to the Files tab, and click on the file you just downloaded to open it in yet another browser tab. Can you find the problem variable in the data file? While we could edit the file to remove the offending value, we can also handle the value within the notebook, which will result in a more robust code for future applications.
Scroll through the notebook until you find the section on Importing the Data. At the end of this section, there is a Code cell with the statement, MaxTimeSeriesDF.head(). This outputs the first 5 rows of the dataframe named MaxTimeSeriesDF. Edit the line, by clicking on the cell, and then changing "head" to "describe". Then run the cell, which will change the output from the first five lines to a table of summary statistics. You can see from the table, that the minimum precipitation value is -9.99. This is not a valid precipitation measurement (what is another name for negative precipitation?), so we should modify the notebook code to remove it from our dataset prior to analysis.
Add a new cell above the one you just edited. The cell type should be set for Code. In the cell enter the following:
```
MaxTimeSeriesDF = MaxTimeSeriesDF.loc[MaxTimeSeriesDF.Prec >= 0] # remove rows with negative precipitation
NumObs = len(MaxTimeSeriesDF['Prec'])
```
The first line uses the .loc[] method defined for pandas dataframes to access a subset of the full dataframe. Here the subset is defined as rows where precipitation data is greater than or equal to zero (MaxTimeSeriesDF.Prec >= 0), and only those rows are rewritten into the MaxTimeSeriesDF variable. The effect is that the row with missing precipitation data is removed. The second line resets the NumObs variable set earlier in the notebook when the data file was first read. It NumObs is not reset then the code will crash later when it finds that the number of precipitation values is different than NumObs.
Run the new cell and the one that described the contents of the dataframe, and you will find that your -9.99 precipitation value have been removed from the data table. Also, the number of rows (count) has gone from 105 to 104 for both Prec and Year
Finish running the notebook. You should now have values for the 5, 50 and 100 year storm precipitation totals. How do you values compare with those from West Lafayette?