Skip to content

Introduction to working with files in Python

Fundamental File Types

Plain Text Files

Everything on a computer is at its core a binary number, since computers do everything with bits that represent 0 and 1. In order to have a file that is "plain text", so human readable with minimal intervention, binary values must be mapped to specific letters. This process is typically called encoding. Almost everything is now encoded using UNICODE, but there are still some older files and documentation that refer to older standards, especially ASCII which was dominant in the United States and therefore throughout the world for many decades. The most common UNICODE system, UTF-8 is backwards compatible with ASCII, meaning that many people will still refer to ASCII encoding even when a file is UTF-8.

Here are some additional details on plain text encoding systems:

American Standard Code for Information Interchange (ASCII)

  • An ASCII text file is one in which each byte represents one character according to the ASCII code.
  • ASCII files are human readable and are sometimes called plain text files though in reality they are binary files with a standard interpretation.
  • For an example of the ASCII table definition, refer to this version of the ASCII table, which illustrates how binary numbers are assigned to standard English language characters.
  • ASCII files are relatively simple to structure and do not require special tools to read so they are commonly used for data storage.
  • Data corruption is often easier to identify and recover from since the file should be readable using many common text editing and viewing tools.
  • ASCII files have low entropy - information stored in an ASCII file typically occupies more storage than is strictly necessary.

UNICODE

The Unicode Consortium is the standards body for the internationalization of software and services. Before the Unicode standard was developed, there were many different systems, called character encodings (such as ASCII), for assigning these numbers. These earlier character encodings were limited and did not cover characters for all the world’s languages. Even for a single language like English, no single encoding covered all the letters, punctuation, and technical symbols in common use. Pictographic languages, such as Japanese, were a challenge to support with these earlier encoding standards.

Early character encodings also conflicted with one another. That is, two encodings could use the same number for two different characters, or use different numbers for the same character. Any given computer might have to support many different encodings. However, when data is passed between computers and different encodings it increased the risk of data corruption or errors.

The two primary Unicode encodings you are likely to work with are UTF-8 and UTF-16:

  • UTF-8 (Unicode Transformation Format – 8-bit). UTF-8 is capable of encoding all 1,112,064 valid Unicode scalar values using a variable-width encoding of one to four one-byte (8-bit) code units. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file. Most software designed for any extended ASCII can read and write UTF-8 (including on Microsoft Windows) and this results in fewer internationalization issues than any alternative text encoding.

  • UTF-16 (16-bit Unicode Transformation Format). UTF-16 encoding is variable-length as code points are encoded with one or two 16-bit code units and is also capable of encoding all Unicode scaler values. UTF-16 is used by the Windows API, and therefore by many programming environments such as the Java programming language and JavaScript/ECMAScript. It is also sometimes used for plain text and word-processing data files. UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit ASCII, but its adoption has been very limited and it is considered less secure than UTF-8 for web applications.

Binary

  • A binary file is computer readable but not human readable.
  • Binary formats are used for executable programs and numeric data, whereas text formats are used for textual data.
  • Binary files are typically more compact than ASCII files as information typically occupies storage more in line with what it truly requires.
  • Binary files are the only way to store exact values from computer program as computers use Base 2 math which does not translate to the Base 10 math in an ASCII file without some loss of precision.

Mixed Formats

Many files contain a combination of binary and text formats. Such files are usually considered to be binary. For example, files that have been formatted with a word processor may encode written characters using a plain text format (UTF-8 or UTF-16 now), but may encode additional information such as formatting instructions and embedded images using binary coding. Such mixed files should be treated as binary to preserve the original data when, for example, transmitting the files between computer systems using FTP or other file transfer protocols.

Working with Plain Text Files

Types of plain text files

There are three common table formats for the transfer of data using plain text files. These are <TAB> delimited, comma delimited, and fixed width files. The first two have standard extensions (.txt and .csv), but the extension can be anything. Using a standard extension just helps operating systems and programs guess what the file type should be, for instance Microsoft Excel will generally open .txt and .csv files automatically, while it will have to ask about file format for other extensions. Examples of each of these file formats is provided below:

TAB separated files

Many programs can also parse lines using white space, which includes TABs as well as spaces and end-of-line characters, however, parsing the example data file using white space will result in station names being split around the space. Using TABs allows spaces to be included in data fields, so that lines are correctly parsed into the desired columns.

1    Bondville    BVL    40.05    88.22    213
2    Dixon Springs-Bare    DXB    37.45    88.67    165
3    Brownstown    BRW    38.95    88.95    177 
4    Orr Center (Perry)    ORR    39.80    90.83    206 

Comma separated files

Commas are also commonly used to demark different columns in an ASCII data file. This will allow fields with any kind of white space to be preserved, but commas cannot be included in data fields.

1,Bondville,BVL,40.05,88.22,213
2,Dixon Springs-Bare,DXB,37.45,88.67,165
3,Brownstown,BRW,38.95,88.95,177 
4,Orr Center (Perry),ORR,39.80,90.83,206 

Fixed width files

Something of a hold-over from the early days of computer input and output this is a standard format for use with FORTRAN programs, especially older ones. It is, however, a useful file format as the size of every line is known in advance, so the number of lines and how to skip through the file can be determined fairly easily. It can, however, be very difficult to read and does no forgive accidental changes in field width, as the start and end of each field of data is set based on its starting and ending column. Thus data runs together, but station number is always in the first two columns, station name in the next 20 columns and station ID in the next 3 columns.

 1           BondvilleBVL40.0588.22213
 2  Dixon Springs-BareDXB37.4588.67165
 3          BrownstownBRW38.9588.95177
 4  Orr Center (Perry)ORR39.8090.83206

Warning

Why doesn't my Python script read this file correctly?xs

The ASCII/UTF-8 standard allows ASCII-only text files to be freely exchanged between and used on different operating systems, however, newlines (end of line indicators) are not implemented the same between operating systems. This is a left-over from the early days of operating system development. There are three methods used to mark the end of a line in an ASCII file:

  1. A carriage return (CR) character ('\r')
  2. A line feed (LF) character ('\n')
  3. A CR/LF pair ('\r\n')

Files created on systems running MS DOS and Windows (built on the MS DOS core) use method 3. Files created on systems running Linux/Unix use method 2. Files created on the original Apple MacOS use method 1, but newer files use method 2 since MacOS X is a variant of Unix.

Most modern languages, including Python, provide facilities for dealing with these differences. In the case of Python the assistance comes in the form of the os module which defines a variable called linesep, which is set to whatever the newline character is on the current operating system. This makes adding newlines easy, and the string method .rstrip() takes account of the OS when it does its work of removing them, so really the simplest way to stay sane, so far as newlines are concerned is: always use rstrip() to remove newlines from lines read from a file and always add os.linesep to strings being written to a file.

The biggest headache comes when you move between operating systems. If you open an ASCII/UTF-8 file created on a Linux/UNIX system on a Windows system using a program looking for MS-DOS file formats (e.g., original Notepad) it may open the file as a single line, since it does not contain the CR/LF code combination. When opening an ASCII file created in Windows on a Linux machine, it may look correct or it may indicate an unrecognized control character (typically '^M') at the end of each line, either way there is an unexpected character at the end of each line. By default, Python programs will assume the linesep of the operating system running the interpreter. This means that if you switch between operating systems (or share a data file created on a different OS) you will have to deal with the problem.

To have your program check for a problem in advance, you can have it check the end of the line versus the contents of os.linesep. If they are not the same, print an error message and quit the program.

That still leaves the problem of what to do if the file is incorrectly formatted. You can reset os.linesep within your script rather than producing an error, you can also change the formatting of the file before running the script. Fortunately Linux/Unix typically comes with a pair of commands, dos2unix and unix2dos, that will convert any file between the two formats. This will not harm files that were already in the correct format, and can be used in shell scripts to automate the conversion of any number of files.

Reading and Writing ASCII files using Python

Here I introduce basic file input/output (I/O) methods for ASCII files, and work through some examples.  There are other modules available (some presented under Modules for Reading and Writing Other File Formats) that make it easier to handle exceptions or special formatting considerations, but I recommend that you start here.

Reading an ASCII file in Python

>>> fin = open( "DemoPlainTextFile-TabDelimited.txt", "r" )
  • The open command accepts two arguments:
    • the first is the file name
    • the second tells Python what to do with it -- it's "mode"
      • "r" = read (assumed)
      • "w" = write (existing file will be overwritten)
      • "a" = append new information to an existing file.
  • The variable fin is now a file pointer (C) or a file object (Python), so if you print its contents it will look different
 >>> print ( fin )
 <open file 'DemoPlainTextFile-TabDelimited.txt', mode 'r' at 0x33ca0>

Note

The hexadecimal number following the "0x" in this statement indicates the physical location of the file pointer in the file. You do not need to know this number, but should recognize that it will change when you open and close the file as the memory location pointed to is reset by the computer, It will also change as you read through the file as the pointer progresses through the contents of the file.

  • Now that the file is open, we need to read its contents, it is of course most useful if the contents are stored in a variable though you also can dump the contents directly to the screen.
>>> Var1 = fin.read(45)
>>> print ( Var1 )
1       Bondville       BVL     40.05   88.22   213
2       Dixon Sprin
>>> fin.seek(0)
>>> Var2 = fin.readline()
>>> print ( Var2 )
1       Bondville       BVL     40.05   88.22   213
>>> Var3 = fin.readlines()
>>> print ( Var3 )
['2\tDixon Springs-Bare\tDXB\t37.45\t88.67\t165\n', '3\tBrownstown\tBRW\t38.95\t88.95\t177 \n', '4\tOrr Center (Perry)\tORR\t39.80\t90.83\t206 \n', '5\tDe Kalb\tDEK\t41.85\t88.85\t265 \n', '6\tMonmouth\tMON\t40.92\t90.73\t229 \n', '8\tPeoria\tICC\t40.70\t89.52\t207 \n', '9\tSpringfield\tLLC\t39.52\t89.62\t177 \n', '10\tBelleville\tFRM\t38.52\t89.88\t133 \n']
>>>
  • Python file objects include multiple methods which can be used to work with and modify the file object. The methods used in the above example include:
    • The fin.read() method which will read the specified number of bytes (in this case 45) irrespective to what is in the file. When the contents of Var1 are printed you can see that it contains the first line and part of the second line of the file.
    • The seek method with an argument of zero returns the file pointer (fin) to the start of the file.
    • The readline() method which reads the file until it reaches an end-of-line or end-of-file marker and returns a string. When the contents of the variable are printed it contains the first full line of the file.
    • The readlines() method reads the file until it encounters an end-of-file marker and returns a list of strings, where each string represents a lines in the file (as marked by end-of-line markers). The contents of this variable now includes a list of all but the first line in the file, this is because we did not rewind the file using the seek method, so the file pointer started where the previous readline() method left it.
  • The readlines() method tends to read files most efficiently, but it does not allow for control of the read statement within the program. That is instead left for post-processing each string.

Working with the contents of an ASCII file

>>> fin.close() # close the file now if you are continuing with the demo and already have it opened
>>>
>>> fin = open( "DemoPlainTextFile-TabDelimited.txt", 'r' ) # open input file
>>> lines = fin.readlines() # read all lines in file
>>> fin.close() # close the file
>>> Data = [0]*len(lines) # create a list of 0s of equal size to the file that was read
>>> for lidx in range(len(lines)):
...     Data[lidx] = lines[lidx].split("\t") # split the line based on the separation flag
...
>>> print ( Data )
[['1', 'Bondville', 'BVL', '40.05', '88.22', '213\n'], ['2', 'Dixon Springs-Bare', 'DXB', '37.45', '88.67', '165\n'], ['3', 'Brownstown', 'BRW', '38.95', '88.95', '177 \n'], ['4', 'Orr Center (Perry)', 'ORR', '39.80', '90.83', '206 \n'], ['5', 'De Kalb', 'DEK', '41.85', '88.85', '265 \n'], ['6', 'Monmouth', 'MON', '40.92', '90.73', '229 \n'], ['8', 'Peoria', 'ICC', '40.70', '89.52', '207 \n'], ['9', 'Springfield', 'LLC', '39.52', '89.62', '177 \n'], ['10', 'Belleville', 'FRM', '38.52', '89.88', '133 \n']]
  • First close the file if it is still open from the previous section.
  • Next open the file again.
  • Read the entire contents of the file into an list called "lines" since each element in the list contains a complete line from the file.
  • Next create a list of zeros, called Data, the same length as the number of lines read from the file. You could simply create an empty list and append to it as you process each line, but declaring the full list at the start of the process is faster - this will become more apparent as we start working with larger data files.
  • Next each line is "split" into separate variables based on the presence of TABs ("\t"). Change the character in the split statement to a comma and the same program will process a comma separated file.
  • Values are stored in the Data structure, so by the end it is a list of lists (e.g., Data[line][field]) where the contents of each line is preserved and all fields in that line have been parsed into a list of strings.

  • To really make use of this data, however, we should convert the strings into appropriate data types. This is particularly true of the values, since in their current form they cannot be used in mathematical functions.

>>> fin = open( "DemoPlainTextFile-TabDelimited.txt", "r" )
>>> lines = fin.readlines()
>>> fin.close()
>>> Data = [0]*len(lines)
>>> for lidx in range(len(lines)):
...     Data[lidx] = lines[lidx].strip().split("\t")
...     Data[lidx][0] = int(Data[lidx][0])
...     Data[lidx][5] = int(Data[lidx][5])
...     Data[lidx][3:5] = map(float,Data[lidx][3:5])
...
>>> print ( Data )
[[1, 'Bondville', 'BVL', 40.049999999999997, 88.219999999999999, 213], [2, 'Dixon Springs-Bare', 'DXB', 37.450000000000003, 88.670000000000002, 165], [3, 'Brownstown', 'BRW', 38.950000000000003, 88.950000000000003, 177], [4, 'Orr Center (Perry)', 'ORR', 39.799999999999997, 90.829999999999998, 206], [5, 'De Kalb', 'DEK', 41.850000000000001, 88.849999999999994, 265], [6, 'Monmouth', 'MON', 40.920000000000002, 90.730000000000004, 229], [8, 'Peoria', 'ICC', 40.700000000000003, 89.519999999999996, 207], [9, 'Springfield', 'LLC', 39.520000000000003, 89.620000000000005, 177], [10, 'Belleville', 'FRM', 38.520000000000003, 89.879999999999995, 133]]
  • This version of the program converts the first and sixth fields into integers using the int() function, applied independently to each variable.
  • It uses the map() function to apply the function float(), which converts ASCII strings to floating point values, to the fourth and fifth fields in a single command.
  • The result of this program is a Data structure that includes integer values for station number and elevation, floating point values for latitude and longitude and strings for the station name and station ID.

  • Reading a fixed width file requires a different approach, the widths of each field must be defined as part of the program. The program below will read the same data fields out of the file DemoPlainTextFile-FixedWidth.fw.

>>> fin = open( "DemoPlainTextFile-FixedWidth.fw", "r" )
>>> lines = fin.readlines() # read all lines in the file into a list
>>> lines = [ i.rstrip() for i in lines ] # strip extra white space from the right side of all lines
>>> fin.close() # close the file, so fin is no longer defined
>>> Data = [0]*len(lines) # create a list of zeros
>>> for lidx in range(len(lines)):
...     # process each line of data from the input file
...     Data[lidx] = [0]*6 # create 1-D array for line
...     Data[lidx][0] = int(lines[lidx][:2])
...     Data[lidx][1] = lines[lidx][2:22].strip()
...     Data[lidx][2] = lines[lidx][22:25]
...     Data[lidx][3] = float(lines[lidx][25:30])
...     Data[lidx][4] = float(lines[lidx][30:35])
...     Data[lidx][5] = int(lines[lidx][35:])
...
>>> print ( Data )
[[1, 'Bondville', 'BVL', 40.049999999999997, 88.219999999999999, 213], [2, 'Dixon Springs-Bare', 'DXB', 37.450000000000003, 88.670000000000002, 165], [3, 'Brownstown', 'BRW', 38.950000000000003, 88.950000000000003, 177], [4, 'Orr Center (Perry)', 'ORR', 39.799999999999997, 90.829999999999998, 206], [5, 'De Kalb', 'DEK', 41.850000000000001, 88.849999999999994, 265], [6, 'Monmouth', 'MON', 40.920000000000002, 90.730000000000004, 229], [8, 'Peoria', 'ICC', 40.700000000000003, 89.519999999999996, 207], [9, 'Springfield', 'LLC', 39.520000000000003, 89.620000000000005, 177], [10, 'Belleville', 'FRM', 38.520000000000003, 89.879999999999995, 133]]

Writing to a ASCII File using Python

  • Opening a file in "w"rite or "a"ppend mode results in a file pointer that functions as an output rather than an input.
  • Unlike reading from files, there are only two standard methods to write to a file
    • write(<string>) which will write the string to the file via the opened file pointer (e.g., fout.write("Hello") if the file pointer "fout" was opened using mode "w" or "a")
    • writelines(<list>) which will write a sequence of lines to the file.

Working with Binary Files

The key difference between text files and binary files is that text files are composed of bytes of binary data where each byte represents a character (mapped using the ASCII table) and the end of the file is marked by a special character, known generically as the end of file, or eof. Binary files contain arbitrary binary data and thus no specific value can be used to identify the end of the file, so they have to be opened in a special binary mode that will not automatically stop the import process at the first appearance of the eof character. Additionally, there are no standard identifiers of the end of a line or even between neighboring pieces of data. This means that is impossible to correctly read a binary data file without understanding its structure.

Why work with binary files?

  • Binary is native to the computer system. Computers operate with bits (Base 2 math, 0-1, on/off) where every 8-bits is a byte and most computers now operate with 64-bit representation of floating point numbers.
  • Humans traditionally work with Base 10 math (0-9), so representation of binary numbers in a traditional Base 10 system results in significant truncation of floating point number. While errors associated with this can be minimized, only binary files are capable of storing the exact number used by the computer.
  • Binary files can also be significantly more compressed than traditional ASCII files.
    • ASCII characters such as "a", "A", "5" or "%" are each represented using a single byte number using the ASCII table.
    • A number like "58.9543" uses a byte for each character, so in this case 7 bytes are used to represent this floating point number. In binary form, this number will always be the size of a floating point number in memory (typically 4 bytes) and that will include the full precision of the number as maintained in computer memory. The ASCII number is most likely an approximation of the full value, and it takes up more memory.
  • Most image, database and higher order scientific data formats are binary or mixed ASCII-binary formats.
  • Compressed files are binary, even if in their uncompressed form they are ASCII.
  • Since you cannot see the contents of a binary file in the same way you can for an ASCII file, you must know more about the file contents before you can start working with it. Thus Metadata (data about data) is very important when using binary formats.

Data Representation and Storage

The information we use in our programs must all be converted into sequences of bytes (8-bits) or words (16-bits). Strings, such as those stored in the above ASCII files, were commonly mapped using the ASCII table. With the increasing need to represent special non-English characters a new encoding standard Unicode has been produced that that uses data words instead of bytes. This increases the number of characters that can be encoded with the full Unicode mapping ranging from 0 through 1,114,111 to capture all required characters for all written languages. A subset of Unicode, called UTF-8, corresponds closely to earlier ASCII coding standard, so a valid ASCII file can still be read using the newer standard (the opposite is not guaranteed). Python will work with ASCII files by default, but can also handle Unicode but you will be required to tell Python which encoding the file / program use so that it will not be confused. Additionally, numbers also require special handling. Outside of the integer values from 0 to 255 which are easily stored in a single byte, methods must be developed for storing larger values, floating point numbers, and negative values. There have been many such standards, and as computer architecture changes (moving from 8-bit, to 16-but to 32-bit, to 64-bit ... CPUs) new standards must be developed and the community must come to agreement on them to keep data portable between systems.

What you should take away from this discussion is that when reading (or writing) a binary file, the raw patterns of bits must be interpreted into the correct data type within our program.  It is perfectly possible to interpret a stream of bytes that were originally written as a character string as a set of floating point numbers. Or course the original meaning will have been lost but the bit patterns could represent either. So when we read binary data it is extremely important that we convert it into the correct data type.

The table below provides a list of some of the available types of binary data

Bit Number Name Number of Bytes Value Range
8 Unsigned Byte or Unsigned Character 1 0 to 255
8 Signed Byte or Signed Character 1 -128 to 127
16 Unsigned Short Integer 2 0 to 65535
16 Signed Short Integer 2 -32768 to 32767
32 Real or Float 4 -3.40282e+38 to 3.04282e+38

Binary data types can be signed or unsigned. For signed variables the first bit of the byte will represent the sign (+/-) of the value, so the range of positive or negative values is halved. For unsigned values the first bit is used to indicate larger numbers so the range is increased, but only positive numbers will be represented.

Reading and Writing a Binary File with Python

  • Each line in a binary file is a specified number of bytes long, most binary formats do not both with end-of-line characters since they increase the size of the file without significant benefit.
  • Each variable or field within the line will also have a specific length.
  • In order to read and parse a binary file correctly you must know both of these lengths.
  • The following code will read a meteorological file, called data_45.1875_-95.0625.BIG.bin, that contains an unknown number of lines, but each line consists of four variables. The first, precipitation, is stored as an Unsigned Short Integer = 2 bytes, the other three variables (maximum air temperature, minimum air temperature and wind speed) are all stored as Signed Short Integers = 2 bytes.
>>> import struct
>>> f = open("data_45.1875_-95.0625.BIG.bin", "rb" )
>>> Bdata = f.read(8)
>>> Bcnt = 10
>>> while Bdata and Bcnt > 0:
...     # process each line of the data file
...     Adata = struct.unpack("@Hhhh", Bdata)
...     print ( Adata[0]/40., Adata[1]/100., Adata[2]/100., Adata[3]/100. )
...     Bdata = f.read(8)
...     Bcnt = Bcnt - 1
...
64.0 304.61 -112.7 -294.4
473.6 194.54 -153.64 -156.15
748.85 84.47 -87.08 10.27
1056.025 130.56 -76.82 194.58
6.4 -215.04 130.51 99.86
0.0 -94.72 135.63 -63.98
0.0 -89.59 -48.66 53.78
0.0 -46.08 -174.1 309.77
0.0 -69.14 -220.25 -299.5
0.0 202.18 92.08 51.22
>>> f.close()

Note

Note that the file mode has now been set to "rb". The "r" still stands for read mode, while the addition of the "b" indicates that the file is binary not the default ASCII. This matters since Python will not assume that values read from a binary file should be mapped to the text table to produce a string.

  • Now that the file has been opened as a binary file, the value provided to the read() file method refers specifically to the number of bytes that should be read (a byte is equivalent to a Character so the difference is not immediately apparent, but see what happens if you print the variable Bdata). Since each line of the file contains 4 two-byte short integer values, the read() method is asked to read 8 bytes at a time.
  • Now the binary data must be interpreted, which is where the struct.unpack() function comes into play. The unpack command tells Python how to convert bytes in Bdata into variables types that it understands. The string given to the unpack() command must define the byte order (more later) and the format of each set of bytes. Adata will be a list containing the variables extracted from Bdata using the provided formats.
    • The "H" indicates the the first pair of bytes will be of type unsigned short integer.
    • The three "h"s indicate that the next three items are all pairs of bytes of type signed short integer.
  • The use of the read() method for input files

Simple File Compression

There are many types of file compression, but one of the simplest is convert float point numbers into integers by applying a multiplication factor, and then truncating the value at the decimal point.  The sample file uses such a compression method, which is explained here.

  • Using short integer values with multipliers is a simple way of compressing files.
    • Every value in this file is stored using 2 bytes rather than 4 bytes, so file size is automatically half what it would be if data was stored as float values.
    • Care must be paid to selecting both the data type and the multiplier to preserve as much data as possible.
    • Precipitation is stored using an unsigned number, since negative precipitation unlike negative air temperatures is not possible. Wind speeds can be negative if they also are being used to convey direction, so N-S wind speed and E-W wind speed where the sense of direction is conveyed by the sign.
    • Temperatures are stored as signed short integers with a multiplier of 100, so from the previous table values of temperature that can be stored range from -327.68 to 327.67. This easily contains all physically realistic temperatures on the Earth\'s surface in degrees F or degrees C. As most air temperatures are collected with a precision of 0.1 degrees F/C this packing method also preserves the original value to its full precision.
    • Wind speed (m/s) used the same format as Temperature, so the range of wind speeds and precision of the wind speeds that can be stored also exceeds the range of realistic values.
    • Precipitation vales were stored with a multiplier of 40, so the range of values that can be stored is from 0 to 1638 mm. Maximum daily precipitation does not have the same physical maximum as temperature, and is measured using tenths of mm (or hundredths of inches). This compression preserves almost all precipitation measurements fully, however, some very large storm events may be truncated, so care must be used before using this method to compress a new data set.
  • Note that the temperatures that were extracted from the meteorological file do not appear reasonable, this brings us to a topic of concern when using binary files, the endianness of binary values.

Endianness

  • One thing that must be considered for a binary file but not for an ASCII file is the Endianness of the data, or more accurately the machine the file was created on and the one we are reading it with.
  • Endianness generally refers to sequencing methods used in a one-dimensional system (such as writing on computer memory).
    • big-endian (big units first).
    • little-endian (little units first).
  • Endianness is also referred to as "byte order" or "byte sex."
  • When a sequence of small units is used to form a larger ordinal value, convention must establish the order in which those smaller units are placed.
    • English and French -- written left to right
    • Arabic and Hebrew -- written right to left
    • For left to right systems, decimal numbers are big-endian
      • 2958 starts with thousands (2) and ends with ones (8)
  • Computers experience no advantage of one method over the other, but little-endian (adopted by Intel) dominates
  • Networks typically use big-endian so that addresses can be interpreted as they are parsed (US → IN → Lafayette).
    • This is similar to a phone number: the area or country code is entered first, next comes the three digit region prefix (in the US), followed by a more generic 4 digit code.
    • In this way the phone system can be refining your connection before you have even completed the number
  • This occurs at the level of the byte, so single byte data (e.g. ASCII text) is typically unaffected.
  • In the example above about reading data from a binary file, the demonstration code was run on a Solaris UNIX system.
  • If you ran the same demonstration on from a Linux or Windows system chances are you got different values out of the file. Probably even values that look like real temperature data.
  • The original file was created on a Linux system with an Intel CPU, so it was created in a little-endian environment.
  • The CPUs for Pasture and Danpatch (the ABE department's Linux web and data servers) are big-endian, so when it tried to interpret the file it got things reversed resulting in unreasonable numbers.
  • The Python code that was provided used an "@" at the start of the unpack statement, which tells Python to use the system default for interpreting endianness or byte order. Most of the systems in use at Purdue are now based on Intel or similar CPUs and therefore use little-endian byte-order.  The file you downloaded was created with big-endian byte order, which means that the values printed to the screen were read using the wrong protocol.  Replacing the "@" in the script with a ">" will force Python to interpret the file as big-endian no matter what the local system architecture is. 
  • Retry the sample script after replacing the "@" with both "<" for little-endian and and ">" for big-endian and see what happens to the output.  The output from which of these seems the most likely to be environmental data, where the columns are: daily precipitation (mm), maximum air temperature (deg C), minimum daily air temperature (deg C), and average daily wind speed (m/s)?
  • The file data_45.1875_-95.0625.LITTLE.bin was created on a little-endian system, which version of the sample scripts produce reasonable outputs?

Writing a Binary File with Python

  • Opening a file in "w"rite or "a"ppend "b"inary mode ("wb" or "ab") results in a file pointer that functions as an output rather than an input, and will write to a file without interpreting the contents of the binary data.
  • This code essentially reverses the previous script, except this time that it opens both an input and an output file.  Since the binary file that will be written is not human readable, it is better to write it directly into an output file rather than have it written to the screen.
  • The file data_45.1875_-95.0625.asc is provided as the correctly unpacked version of the binary data file used previously.
>>> import struct
>>> fin = open( "data_45.1875_-95.0625.asc", "r" )
>>> fout = open( "data_45.1875_-95.0625.bin", "wb" )
>>> Adata = fin.readlines() # read all lines from the file
>>> Adata = [ i.rstrip() for i in Adata ] # strip whitespace from the right side of the lines
>>> fin.close() # all done with it, so close it
>>> for CurrLine in Adata:
...     # process all lines of the file
...     LineData = CurrLine.split() # split string using white space
...     LineData[0] = int(float(LineData[0])*40.) # multiply precip by 40 for storage
...     LineData[1] = int(float(LineData[1])*100.) # multiply Tmax by 100 for storage
...     LineData[2] = int(float(LineData[2])*100.) # multiply Tmin by 100 for storage
...     LineData[3] = int(float(LineData[3])*100.) # multiply wind speed by 40 for storage
...     Bdata = struct.pack("@Hhhh", LineData[0], LineData[1], LineData[2], LineData[3] ) # pack data into short int's using system endian setting.
...     fout.write(Bdata) # write current line to the output file
...
>>> fout.close()
  • The process of writing to a binary file is the reverse of the reading function.  Data is various Python formats must be packed into a structure using the struct.pack command, and then the write() function can be called directly to put the output into the file.
  • The close() command flushes the file buffer and closes the file cleanly.  The computer uses a buffer to reduce the time spent read and writing from the disk, which is slower than accessing memory.  This means that you may find an empty or incomplete file if you try to view the file while the program is still running, which is stills true if the interpreter is waiting for your next command.

Modules for Reading and Writing Other File Formats

Python has the ability to read and write to many file types.  Many of the operations described above can be facilitated by importing a module that includes read and write functions specifically designed for a particular file format.  Examples of some of the modules available for Python are included in the table below:

File Format Description Web Link
Plain Text (ASCII/UTF) The CSV module: A module included with the standard Python distribution to simplify the reading and writing of CSV and other delimited file types.  Greatest advantage over what was described previously is that it is more robust when handling ASCII files developed on different systems. LINK
Plain Text (ASCII/UTF) The NumPy module: NumPy is the fundamental package for scientific computing with Python, which adds significant tools including a highly efficient N-dimensional array object.  As part of the package it includes powerful functions for handling ASCII data types.  It also often required for handling common scientific data formats (see netCDF and HDF below). LINK
gzip Compression of files using the gzip standard.  Includes file methods for opening, reading and writing standard gzip compressed files from within Python scripts. Example Script. Example Zipped Tab Delimited File. LINK
netcdf4, netcdf3, HDF5 The netcdf4-python module can read and write files in both the new netCDF 4 and the old netCDF 3 format, and can create files that are readable by HDF5 clients. LINK
HDF5 The h5py package is a Pythonic interface to the HDF5 binary data format. LINK
MATLAB files, IDL files, Matrix Market files, WAV sound files, Arff files, and netcdf. The SciPy module: The SciPy library provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization that have been found useful for scientific analysis.  As part of this package it includes an input/output module that can read many common scientific data formats. LINK
Plain Text, MS Excel, HDF5, SQL, many others The pandas module has the ability to read many scientific data formats. LINK