Python Resources
This page provides links and information useful for learning Python and using Python for graphiical and data analysis.
Reference and Training Materials
Python Books
There are way too many books about Python to list here many of them very good, but directed towards different learning styles and levels. This list includes books that I use, or that students in my class have found particularly useful.
- Downey, A., Think Python: How to think like a computer
programmer. This is my preferred book for teaching Python and
programming. It is available from
Green Tree Press.
Be sure to get the 2nd edition which is for Python 3.x. - Lutz, M., Python Pocket Reference. A good reference source, does not provide much in the wy of instruction. Available for download from IT eBooks.
- Beazley, D. M., Python Essential Reference. A good comprehensive reference book with some training materials. Useful if you need a hard copy of Python documentation instead of relying on web pages.
On-Line Reference Materials and Tutorials
Python can be installed on almost all platforms, and there are plenty of guides available on-line to help. I have provided some guidance on preferred methods to install Python in the document Setting up Computers Outside of Class.
Of particular concern is making sure to get the modules designed specifically for scientific data handling and analysis. Many of these are now included in the standard Python package downloaded from python.org. Others, however, may require that you install them yourself. This is where the method of installation makes a big difference, since it can be a real pain to find, download and install each module and its supporting software - not to mention keeping them updated once installed. This is what package managers were developed to do.
Linux systems are typically installed with a package manager as the operating system is really a collection of tools. The package manager is therefore another tool used for upgrading installed packages and adding new ones. RCAC now uses the module system to manage the many different "standard" software installations they need to make available in support of the Purdue community. The setup.exe script that installs Cygwin, the collection of Linux tools for Windows, also serves as a useful package manager, but only for Cygwin specific packages. On Mac OS X systems, I have found MacPorts to be very useful for installing and maintaining software.
To get started directly on ITaP Research Computing's cluster systems, follow these instructions:
- Getting started with Python on Purdue Cluster Computing, or
- Getting started with Python and Jupyter Notebooks
Basic Tutorials
- An introductory tutorial - this tutorial webpage includes an embedded Python interpreter, so will work even if you have not installed Python yet.
- An on-line Python Tutorial at TutorialsPoint - this one is pretty complete, covering a lot of the language features.
- Google's Python Class - an on-line course for learning Python, includes video lectures and practice material
Formatting Strings
- Tutorial Point - This is a very in-depth presentation of strings and all of their functionality in Python.
- These links are much more focused on formatting Python strings using the .format() method:
Working with Files
- Introduction to working with files in Python
- Using the CSV module to read ASCII text files
- Tutorial: How to Easily Read Files in Python (Text, CSV, JSON)
- Using NumPy tools to read ASCII files
- Pandas IO tools (text, CSV, HDF5, ...)
Python for Data Science
Python is a general programming language but it is customizable through the import of specialized modules. Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. For data science applications, you will want to make use of Anaconda either on the Purdue RCAC Cluster Computers or through installation on your personal computer. Refer the to Getting Started page for information on installing or accessing Python and Anaconda.
-
The Python Data Science Handbook is an excellent place to get started with Python and the concepts of data science using Python and Jupyter notebook. The entire book and associated Jupyter notebook pages can be downloaded from this link.
-
NumPy is the default numerical methdos package for Python. It introduces array objects and tools for efficient calculations and manipulation of large data sets and matrices (linear algebra). Base package for many other tools (e.g., SciPy, pandas, etc.)
- Here is link to the on-line documentation
- Here is a link to the official NumPy user manual.
- An introductory tutorial for NumPy from W3 Schools
-
SciPy is a collection of fundamental algorithms for scientific computing in Python.
-
Pandas - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, Data frames (and the one-dimensional Series) are fast and efficient objects for data manipulation, similar to NumPy arrays but with the native ability to work with mixed data types.
Time Series Analysis
- Working with datetime data in pandas - currently one of the best methods for working with datetime data in Python. Can develop single-line commands for converting daily data to monthly or annual, and a lot more.
- Working with datetime data in numpy
- Working with datetime in Python - covers fundamentals, but really if you have a choice use pandas.
Graphical Tools
- Using matplotlib to create MatLAB like 2D and 3D plots
- Matplotlib tutorials is the source for all officially supported tutorials and documentation.
- Gallery - Images of many types of figures generated using matplotlib figures, and the source code used to make them.
- The mpl_toolkits module provides extensions to matplotlib, of
particular interest is the Basemap module.
- Link to the mpl_toolkits main page.
- The mpl_toolkits.Basemap tutorial - includes information on adding a basemap (political boundaries and coastlines to plots, including shapefiles on figures, etc.)
- Generic Mapping Tools (GMT) interfaces
- The PyGMT interface for the Generic Mapping Tools Tools for all publication quality graphing needs.
- NCAR Command Language (NCL) interfaces
- PyNGL is a Python interface to the high quality 2D scientific visualizations in the NCAR Command Language (NCL).
- Making better figures
Geospatial Tools
- Working with Python in ArcGIS
- PyGMT interface for the Generic Mapping Tools
- PyQGIS scripting language for QGIS (Quantum GIS)
- The mpl_toolkits.Basemap tutorial - includes information on adding a basemap (political boundaries and coastlines to plots, including shapefiles on figures, etc.)
Debugging Python Programs
- The pdb module is the default debugger, works a bit like the C-program debugger dbx.
Installing Specialized Modules using the Conda Environment
While Anaconda automatically installs many of the Python modules needed for the analysis of scientific data, you may still find that there are modules you want to use that are not included. These may be very specialized, not publicly available, too new, too old, or still too developmental for general release. The conda environment makes creating customized Python environments for your work fairly easy to do, and something that does not require administrator privileges.
The methods presented here should work with any Anaconda installation, but require that you access a command line terminal. For RCAC cluster computers, log in and load the module for the base anaconda version you want to modify. For Anaconda Navigator, you should start the "Qt Console" from the dashboard, you can also create custom environments from the "Environments" tab but for those you pick from the ports that Anaconda knows about and are less able to install non-standard packages.
For experienced users, refer to the Conda cheat-sheet for help on specific commands.
Create a new conda environment
-
Use the command
conda create --name <environment name> <list of modules...>
-
Where
is replaced with the desired name of the environment. Keep it short and memorable. Consider includeing the version of Python used to create the environment in the name. Also avoid spaces and most other special characters. - For example, I use "py311-data-analysis" for my Python 3.11 data analysis environment. When I recently created a new environment with the same modules but for Python 3.7, the new name "py37-data-analysis" fits right in and makes it clear which version of Python I am using.
- I also have an environment called "py311-pyDEM", which again tells me the version of Python used but also that it was specific for running the pyDEM module, a very large and specialized module that I do not want included in my standard data analysis environment.
-
The
- should be specific. If you do not
include a list, then conda rebuilds all of the current version of
anaconda. This is disk space intensive, and should be avoided in
general when working on the cluster resources where your home drive
space is limited. Instead focus your selection on the big packages
that you know you will use.
- For my data-analysis environments this list typically includes: numpy, scipy, pandas, matplotlib, and seaborn. Note that packages will check for their own requirements, and those will be installed as part of the process. So, for example, pandas requires numpy, so simply saying I want pandas will also get me numpy. However, by explicitly indicating what packages I really want, I do not have to trust to my memory that I will get all of the packages I want.
- If you think that you will or may need the ability to add or update packages later, then you may want to install the Python package manager (the "pip" module) with this conda creation statement. If pip is not installed, you cannot install packages administered through pip and not through conda. The pip module can be installed later using conda install (see next section), but if you know you are going to use it, now is a good time to install it.
# this command creates my default data analysis environment
conda create --name py311-data-analysis numpy scipy pandas matplotlib seaborn
# this command adds pip so that I can add or update packages in the environment later
conda create --name py311-data-analysis numpy scipy pandas matplotlib seaborn pip
- Once the command is entered, conda will evaluate what packages are required for installation and ask for your approval. Once you have approved the package selection, conda will begin downloading, building and installing the packages into the environment directory. Once this process is done, you will be able to activate the environment at any time.
Installing a new package into an existing conda environment
If you discover that you need a package not installed within en existing conda environment, there are two methods available to install new packages.
-
You can use the conda install method and install packages that are available with the currently installed conda package. On the cluster systems, you need to load the anaconda version that matches the base Python version (e.g., py310 or py311) as the environment you want to modify. Then you can use the command:
# this is the general format of the conda install command to add a package to an existing environment conda install --name <environment name> <package name> ... <additional package names>
# this command adds pip to my default data analysis environment conda install --name py311-data-analysis pip
The command will evaluate your currently installed environment, and indicate packages that need to be updated, and what must be installed along with the requested package to make it work. Agree to install the list of packages, and conda will run until installation is complete.
-
If you are trying to install a non-standard package or one that is otherwise not available through conda install, you will need to use the Python package manager (pip). Complete pip instructions are available at https://packaging.python.org/tutorials/installing-packages/.
Note
For clarification: pip
installs python packages within any Python
installation; conda
installs any package within conda environments.
They appear to have the same functionality only because they both will
install Python packages inside a conda environment.
The conda program can also install R and Orange and other software distributions inside of a conda environment.
The pip package manager can install Python modules in any environment and can access Python modules that have been published on the Python Package Index (PyPI), but not released to conda. Both pip and PyPI are governed and supported by the Python Packaging Authority (PyPA).
Anaconda and conda are maintained by Anaconda, PyData and others forming an international community of users and developers of data analysis tools based on Python, R and other packages.
Activating a conda environment
- The following commands will activate an existing conda environment. This will change the command line prompt to reflect the new environment.
# this is the standard activation command
conda activate <environment name>
# this is suggested by conda when you finish creating a new environment, and also appears to work
source activate <environment name>
# this activates my envirnment called "py311-data-analysis"
conda activate py311-data-analysis
If you are trying to run conda on an RCAC cluster system, you may get a message similar to "CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. This will not appear immediately after you create a new environment, but will appear when you log out and back in, or switch to a new terminal window. The error is an artifact of the module load system used by Research Computing. In the error message you will see a message similar to:
$ echo ". /<module path>/conda.sh" >> ~/.bashrc
If you type in this command it will add or edit the Bash properties file
.bashrc for your account. DO NOT do this! Each version of
anaconda/python maintained by Research Computing through the module
system has a different
Therefore, to get conda to work in your current terminal type the
command given to you in quotes by conda at the command prompt.
Specifically typing ". /
If you are doing this a lot, you might consider creating a Bash alias to run the setup command before activating your environment. For example, check out this page: https://linuxize.com/post/how-to-create-bash-aliases/.
Deactivating a conda environment
- The following commands will activate an existing conda environment. This will change the command line prompt to reflect the new environment.
# this is the standard activation command
conda deactivate
# this is suggested by conda when you finish creating a new environment, and also appears to work
source deactivate
- This will return you to your original Bash environment with whatever version of Anaconda you have loaded.
Listing available conda environments
- One more command that I find useful, will list currently available conda environment:
# this is the standard activation command
conda info --envs
-
This will provide a list of all of the conda environments available to you (thus those that you have created).
-
These are all installed in your home directory under .conda/envs, and are available from any of the RCAC computer clusters (since they all share your home drive).
Note
These environments are also visible to you if you run Jupyter notebooks on the clusters (e.g., notebook.scholar.rcac.purdue.edu).
-
All environments will appear in the pull down menu for New documents (along with Bash, Python [default], and R).
-
A special note, you cannot edit these environments through the notebook interface unless you create a shortcut to the full environment in the main environments directory.
-
For example, my Python 3.6 data analysis environment is installed down a version specific directory, /home/cherkaue/.conda/envs/cent7/5.1.0-py36/py36-data-analysis, which is specified when I list my conda environments.
-
For Jupyter notebooks to modify it, it needs to appear in /home/cherkaue/.conda/envs. I do not want to create yet another copy of the installed modules, so instead I can create a soft link that fools notebook into looking in the correct location for the installed modules. For my specific example, I first change into the directory /home/cherkaue/.conda/envs and then use the command "ls -l cent7/5.1.0-py36/py36-data-analysis" to create a link in the current directory called "py36-data-analysis".
-
Two instances of the same environment will likely appear in the Jupyter notebook New pull-down menu. Use the environment linked to .conda/envs under the Conda Tab to make additional edits to the environment within Jupyter notebooks.
-