Getting started on the Assignment - Using Python's Built-in File Methods

Warning

This help page is depreciated as of Spring 2025 due to changes in the assignment instructions. Those changes were implemented to further improve the transition from Assignments 01 & 02 to 03, which has continued to be a big step for many students. Revison of this guidance document may occur after testing of the new assignment document, if it is decided that additional guidance is still helpful.

This document is still posted because some students may find it useful. The goals of the assignment are have not changed, but the order of steps and addition of more functions in the template are not reflected in this document.

This page provides additional help for getting started on the third assignment for ABE 65100 Environmental Informatics. It assumes that you have accepted the assignment and made a personal clone of the assignment repository and that you have used the template to create the bare-bones solution file.

Read the assignment carefully, there are a lot of parts to this assignment, but if you have completed the assigned lectures and reading materials you know the basics of everything that you have to do. The key to being successful with this assignment is to follow the "divide and conquer" approach to software design. Break the overall complex task of completing the assignment into smaller steps that you know how to do, test those smaller steps so that you know that they work, and then move to the next task. This guide will help you get started with that process.

Step 1 - Dealing with the Input File

The assignment requires that you process a directory full of input files, reformat their data and write to a new output file. That is a lot of steps and you are unlikely to be successful if you try to tackle them all at the same time (trust me I know from personal experience, try to write the code all at once to be fast and you will spend more time debugging the overly complex final code than you would have spent if you designed and tested as described here).

What is the first, relatively simple task that you could implement in your code?

How about starting the process_data_file() function by opening the file? Not all of the files, just one of the input files. If the function opens and handles one file, then you can write code later to make it process all of the required input files.

The resulting code would look something like this:

def process_data_file( inFile, outFile ):
    fin = open( inFile, 'r' )
    print( "File opened!" )

if __name__ == '__main__':
    process_data_file( "datasets/ill-soilmoist-data-001.txt", "datasets/ill-soilmoist-data-Merge.csv" )
    print( "Function call complete!")

Where the function process_data_file() is called from the main section of the code with two files names. The second filename is the required name for the output file taken from the assignment README.md file. The first filename is for one of the input files. Something to consider is that the README.md file does not give these filenames exactly, instead describing the filename format.

If you look inside the repository, you will find that there are no data files in the main folder, just the README.md, template, your assignment solution file and perhaps a .gitignore file depending on where you are looking at the folder contents and whether or not that method hides special files (it will be visible in GitHub, but not by default in Windows Explorer or Mac OS Finder). The repository folder also includes a folder called datasets, open that folder and you will find the input files that have been provided with the assignment. You may or may not at this point see the full filename. Again, GitHub will not hide anything, while both Explorer and Finder try to be helpful and declutter your life and so they typically default to not showing filename extensions (the last part of the filename after a '.') for file formats that they recognize. Instead they may indicate that the file type is "text" or "TXT" or something similar. just because your operating system is hiding the file extension does not mean it does not exist, so the input file still has to be defined with the ".txt" extension.

Note

Here are some helpful website instructions to help you view file extensions on all files within a folder or on your system:

By sending a single file to the function process_data_file(), you can build and debug the basic function before you have to worry about how to read through all of the required input files.

The function receives both the input and output file names, and then opens in the input file with the Python file open function. This function returns a file object (which I will sometimes call a pointer to a file, since that is the C language nomenclature that I learned first). The file object is stored in a new variable called "fin", which is local to the function.

Then since it is annoying to have a program run and not do anything, I have added print statements after the program opens the file, and again after the main code returns from the function call. If both messages print, I know that the program finished without errors.

So you should now run the program and make sure that it works.

Did you get an error?

Note

Then look back at the code shared above and what you typed into the solution file template. Watch for warnings and errors to the left of the code where Spyder IDE is trying to warn you about potential problems (the variable outFile has not been used yet, so do not delete it just leave the warning message for now).

Now that the first part of the code is working, let's think about how we can catch an error that occurs when the code tries to open the input file.

Let's make use of Python's built-in error and exception handling system, where Python lets you "try" a code segment and catch the exceptions. This is introduced in Think Python Chapter 14: Files, part of the reading assignment for this module. It is a powerful method far handling exceptions raise during program execution that functions much like the if-then conditional statements you have already learned.

Note

If you want to learn more about the try-except command structure in Python check out this nice tutorial by W3Schools.

Using the try-except process modify the existing code to catch IOErrors with the file, and raise a SystemExit exception so that the code will end gracefully. When using the try-execpt process to handle errors, it is useful to print a message for the user, so they know what happened and can respond.

Here is what my code looks like:

def process_data_file( inFile, outFile ):
    try:
        fin = open( inFile, 'r' )
    except IOError:
        print("ERROR: Unable to open the input file .".format(inFile))
        raise SystemExit
    print( "File opened!" )

if __name__ == '__main__':
    process_data_file( "datasets/ill-soilmoist-data-001.txt", "datasets/ill-soilmoist-data-Merge.csv" )
    print( "Function call complete!")

Run the code and it should work just like the first time, because the input file still exists and can be opened.

Next add a typo to the name of the input file. What happens? Hopefully you got your error message and the program exited without an error.

Note that this error catches any problems with opening the input file, so technically it is checking both that the file exists and that it can be opened.

Note that it is useful to add a comment at the start of this block of text to indicate that the code is going to open the input file and catch any errors! Document your code as you go alone to save time and effort later.

Step 2 - Reading from the Input File

A good next step is to read the contents for the input file. That code can follow the code segment that opens the input file. That code will look a lot like the sample code from the Introduction to working with files in Python wiki page that was completed in preparation for this assignment. Remember that the file is already open from Step 1, so you simply need to read in the contents of the file using file methods such as readlines() on the existing file object, "fin".

It's a good idea to add more print statements to echo data read from the file back to the screen so that you know that it worked. I also suggest changing which input file you are using at least once to make sure that your code will in fact work with all of them.

Note

You can also use the console to try out shorter pieces of code to see if they work as expected before committing them to the assignment solution file. For example, to try out reading the contents of the file, start by defining the variable inFile in the console, then copy and run the code from within the process_data_file() function (DO NOT include the "def" statement or indent the code).

Why the code from within the function and not the function?

Because running the function will open the file and save the file object to the variable "fin", but that variable is local to the function. Once the function ends, the variable "fin" is deleted from computer memory (and the file is closed, even though we did not specifically include that file method call.
Only way to work with the variable "fin" is to use it within the function or return it to the calling code. Running only the code from within the function means that the variable "fin" is defined in the console, so you can interact with it from the console and try out the next part of your code.

Step 3 - Dealing with the Output File

In order to open the output file you need similar code to what you did with the input file, but you cannot just rely on the try-except method since you not only have to catch errors with opening the file, but you also have to know if the output file exists.

How do you check if a file exists?

Again from Think Python Chapter 14: Files, you have been introduced to the os (operating system) module in Python. The same Python code can be run on all types of systems (Windows, Mac OS X, Chrome OS, Linux, etc.), as long as the Python interpreter can be installed. Python does this by having a standard set of function calls for operating system specific commands (typically related to file (I/O) handling) so that the Python code looks the same, but in the background the Python interpreter is making sure that actual commands match the local computer's operating system.

In any case, the function os.path.exists(NAME) will check to see if NAME exists on the local file system. This would, however, return "True" if there was a directory (or folder) that matched NAME. Since we want to check whether a file called NAME exists, we might be better off using the os.path.isfile() function. Fortunately there are many such functions in the os.path module so that your conditional check can be customized as needed.

So a conditional statement to check if the file exists will look something like this:

if os.path.isfile(outFile):

    print("The output file already exists!")

    # open the output file for appending more data

    # make sure that you do not write the header in this case

else:

    print("The output file does not exist, must create a new one!")

    # open a brand new output file for writing

    # and do not forget to write the header to the new file!

# and now the output file is open and ready for data to be written to it!

In this case, I did not provide a lot of details, but instead wrote the required conditional and added comments that tell me what I need to do next. Note that the comments are indented correctly so that I know where the code needs to go. These comments are also kind of informal, so they will not make great in-line comments later but they are placeholders for where those comments should go later. Because I have print statements in each branch of the conditional, this code will run and I can test that the conditional check works.

Next, I already wrote code to open the input file and catch any exceptions, so let's copy that code in under the two comments about opening the output file. Make sure that it is indented correctly, then edit the statements to replace inFile with outFile, fin with fout, and the 'r'ead method with either 'a'ppend or 'w'rite depending on where the code it in the conditional.

Add print statements to make sure that the code is doing what you think it should, If you want, you can also add an fout.write() statement into the code under the comment about beaing ready to write data, and have it write a simple string to the file. Run the code and check that the print statements are as expected. Then open the output file to look for you message.

Note

Be careful to use an appropriate tool to open and view the contents of your plain text data files.

Graphical word processing programs such as Word will try to make your plain text look nice, so they use different fonts including those where character widths are scaled, which means that columns may not line up as expected (e.g., i's and 1's are narrower than m's and w's). Such programs will also try to interpret what you meant with the end-of-line characters, and not what you actually have - so you may or may not be able to see problems with those hidden character in your files. They may also try to clean up what they deem to be excessive white space at the start or end of lines.
Spreadsheet programs such as Numbers or Excel will ask to ingest the values in your text files. If allowed to do so, you will not be able to see if the delimiters are used correctly in your file, since they will have been interpreted. These programs will also try to be helpful with white space and end of line characters.

The best tools for looking at your data files in their native form are:

Windows - Notepad or Notepad+ are excellent tools that are installed by default. Note that WordPad is better than Word but will still try to be helpful and clean up the display, so not all file problems will be visible.
Mac OS X - TextEdit is installed and works reasonably well, but I prefer Aquamacs (which can be installed using this link), which is a Mac OS X native version of the emacs editor, which is the best editor in Linux (and Mac OS X is Linux).

One final word of warning - NEVER, EVER save your data files from a graphical word processing program, they will never be the same. Also be very careful if saving your data file from a spreadsheet - these handle the various delimiters just fine, but you will need to carefully select the type of text file to write (DOS, Windows, Linux, Mac, etc.) to get a file that is directly usable by your Python program. You will be able to deal with this when you are more comfortable with Python and file formats, but for now just stick to reading and writing text data files with Python - it will make sure to minimize the funny business.

Step 4 - Writing to the Output File

Now your code structure should be set to write data to the output file. I suggest starting with writing the output file header. Write the code, run the code, check the output file. This should again be similar to the code snippets from the Introduction to working with files in Python wiki page.

Is it working? If not then work on debugging your code. Is it giving an error or not working the way you expected? Error messages tend to be easier to identify and fix, but sometimes you have to look at the code statements that occurred before the one that threw the error. Semantic errors, those where the code runs but does not do what is expected, but typically are related to logic problems. If you did the early rounds of testing, then your semantic errors are most likely limited to the code you just wrote. Perhaps you indented that code incorrectly so it is in the wrong part of the larger sequence of conditional statements?

Next, write the data for a single file to the output file. Again, check the results carefully and do not move on until you have it working.

Step 5 - Finish the main part of the code

Now that you think the process_data_file() function is working for a single file, it is time to build the main part of the code.

I suggest that you start by checking if the output file exists and deleting it if it does. This code goes within the main code block, but before the first call to the process_data_file() function. Run the code, it should delete any previous version of the output file and create a new version of the output file that only contains data from a single input file. If you have been testing the function during development, it might be quite long when you get to this step, since it is supposed to append to any existing file, which means every test run it got longer. Now the output file should always be the same length, and that length should be equal to the length of the input file being used. So some simple print statements in the code that check the various file lengths, will help you diagnose problems and let you know when this part of the code works.

Next, copy the process_data_file() call statement so that there are two calls to the function. Each can use the same or a different input file. Run the code and make sure that the new output file is the same length as the two input files minus the second set of three header lines - only one copy of the header should be present. The length of the file should also be the same every time you run the code.

When that works, you can delete the second function call, and put the first inside of a loop statement, a for-loop perhaps. Feed the for-loop a list of the possible input files. This could be a simple list that you type in (if you think it is not going to change often), or you could make Python create the list, perhaps by using the Python os.listdir() method or the glob module which introduces Linux style file matching. Note that the glob module using Linux file and path conventions, which are not the same as those used by Windows (for example, there are no drive letters such as A:, C:, D:), but Windows still understands since Python does all of the interpreting based on the local operating system. We will talk about Linux file conventions soon in the in-person class. Distance learners are directed to tutorials about the glob command instead.

Finish testing your code - do a little stress testing: How does you code work when asked to open a non-existent file? What if you change the name of the output file?

When you are satisfied that it works clean up the extraneous print statements that you added to debug the code. Update your documentation (header and in-line comments). Commit any last changes to GitHub and submit to Gradescope.