5.3. FilesĀ¶

5.3.1. PersistenceĀ¶

So far, we have learned how to write programs and communicate our intentions to the Central Processing Unit using conditionals, functions, and iterations. We have learned how to create and use data structures like lists and strings in the Main Memory. The CPU and memory are where our software works and runs. It is where all of the ā€œthinkingā€ happens.

But if you recall from our hardware discussions, once the power is turned off, anything stored in either the CPU or main memory is erased. So up until now, our programs have only produced temporary results.

Hardware architecture, including secondary memory

Hardware architecture, including secondary memoryĀ¶

In this chapter, we start to work with Secondary Memory, like hard drives, where files are stored. Secondary memory is not erased when the power is turned off, and it has much more capacity than the main memory. If we keep our data long-term or work with large datasets we need to learn to access and store data in files.

Example code in this section will treat the data in the box below as a file named atotc_opening2.txt. As before, this contains opening lines from A Tale of Two Cities (a few more this time). You can edit the file below, and its changed contents will be used in any active code blocks.

Data file: atotc_opening2.txt
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with a
fair face, on the throne of France. In both countries it was clearer than
crystal to the lords of the State preserves of loaves and fishes, that
things in general were settled for ever.

It was the year of Our Lord one thousand seven hundred and seventy-five.
Spiritual revelations were conceded to England at that favoured period, as
at this. Mrs. Southcott had recently attained her five-and-twentieth blessed
birthday, of whom a prophetic private in the Life Guards had heralded the
sublime appearance by announcing that arrangements were made for the
swallowing up of London and Westminster. Even the Cock-lane ghost had been
laid only a round dozen of years, after rapping out its messages, as the
spirits of this very year last past (supernaturally deficient in
originality) rapped out theirs. Mere messages in the earthly order of events
had lately come to the English Crown and People, from a congress of British
subjects in America: which, strange to relate, have proved more important to
the human race than any communications yet received through any of the
chickens of the Cock-lane brood.

5.3.2. Text Files and LinesĀ¶

A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. For example, atotc_opening2.txt above contains 37 lines.

To break the file into lines, there is a special character that represents the ā€œend of the lineā€ called the newline character.

In Python, we represent the newline character as \n in string constants. (Thatā€™s a ā€œbackslash,ā€ because it is leaning backwards.) Even though this looks like two characters, it is understood by Python as a single character.

If we print a string variable containing the newline character, we can see that it prints the string on two lines, moving to the beginning of the next line when the newline character is reached:

You can also see that the length of the string X\nY is three characters because the newline character is a single character.

So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.

Note

The backslash \ character is an escape character, meaning it is used to indicate the start of a special character in a string. But what if you want to include an actual backslash followed by an n in a string? You can escape the backslash itself: "\\n" is a string with two characters: a backslash and an n.

5.3.3. Opening FilesĀ¶

When we want to read or write a file in a program, we first must open the file. When you open a file, you are asking the operating system to find the file by name, make sure the file exists, and prepare it to be read from or written to.

To open a file, we can use the open() function [[full documentation](https://docs.python.org/3/library/functions.html#open)]. In its simplest form, it takes one argument: a string containing the name of the file to open. In this example, we open the file (from above) atotc_opening2.txt:

If the call to open() is successful, it returns a file object. The file object is not the actual data contained in the file, but instead it has a ā€œhandleā€ that it can use to access the data. You can use the object by calling its methods via dot notation, just like with other objects. You are given a file object if the requested file exists and you have the proper permissions to read the file.

A file object with file handle

A file object with file handleĀ¶

If the file does not exist, open() will fail with a traceback and you will not get a file object to access the contents of the file:

5.3.4. Reading FilesĀ¶

While the file object does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:

Note

The code above reports 36 lines, despite the file having 37. This appears to be a bug in the Python interpreter used to run code in the browser. It skips the final line for some reason. You can see this by adding print(line) inside the for loop and comparing the output to the file data above.

The code will work correctly in any ā€œnormalā€ Python interpreter.

We can use the file object as the sequence in a for loop, and each element in the sequence will be another line from the file. This may feel a bit odd, but file objects have all kinds of functionality built in to them, and acting as a sequence for a for loop is just one of the things they can do.

The for loop above counts the number of lines in the file and prints the count. The rough translation of the for loop into English is, ā€œfor each line in the file represented by the file object, add one to the count variable.ā€

When the file is read using a for loop in this manner, Python takes care of splitting the data in the file into separate lines using the newline character. Python reads each line through the newline and includes the newline as the last character in the line variable for each iteration of the for loop.

Because the for loop reads the data one line at a time, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded.

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read() method of the file object.

In this example, the entire contents (all 1,862 characters) of the file atotc_opening2.txt are read directly into the variable contents. We use string slicing to print out the first 20 characters of the string data stored in contents.

When the file is read in this manner, all the characters including all of the lines and newline characters are one big string in the variable contents. Using the read() method like this will not work well for really large files (bigger than 100 megabytes, perhaps), because they might not fit in the computerā€™s memory. For such large files, it will be better to process the file line by line in a loop (which doesnā€™t read the entire file all at once) or to use more sophisticated tools.

5.3.5. Closing FilesĀ¶

When a program is done using a file, it should close the file using the close() method of the file object. This will release resources in the computer and make sure everything is cleaned up correctly.

Using a file object after it has been closed will not work:

To make sure a file is always closed and cleaned up, it is safest to use the with syntax:

Whenever the body of the with statement (the indented lines below it) exits, for any reason, the file object created by the open() call will automatically be closed.

Syntax Pattern

A with statement has the form:

with <expression> as <var>:
    <body>

When Python interprets this syntax, it evaluates the expression and stores the result in the variable <var>. It then executes the body. Upon leaving the body for any reason (an error, a return statement, or just reaching the end), the object stored in <var> will automatically be closed.

[Technically, we are omitting some details here, and there is more involved than described. The documentation provides the full details. The description above is sufficient for using the syntax, though, and additional details would just complicate matters at this point.]

We will use the with syntax in the rest of the examples here, though manually opening and closing a file (with open() and .close()) would work as well.

5.3.6. Searching Through a FileĀ¶

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file with string methods to build simple search mechanisms.

For example, if we wanted to read a file and only print out lines which started with the prefix 'it', we could use the string method startswith() to select only those lines with the desired prefix:

When this program runs, we get the following output:

it was the worst of times,

it was the age of wisdom,

it was the age of foolishness,

it was the epoch of belief,

it was the epoch of incredulity,

it was the season of Light,

it was the season of Darkness,

it was the spring of hope,

it was the winter of despair,

its noisiest authorities insisted on its being received, for good or for

The output looks correct since the only lines we are seeing are those which start with 'it', but why are we seeing the extra blank lines? This is due to invisible newline characters. Each of the lines in the file ends with a newline, so the line variable will as well. The print() statement prints the string in the variable line, including its newline character, and then print() adds its own newline, resulting in the double spacing effect we see.

We could use string slicing to print all but the last character, but a simpler approach is to use the rstrip() method, which strips whitespace (including newline characters) from the right side of a string:

As your file processing programs get more complicated, you may want to structure your search loops using continue. The continue keyword skips the rest of a loop body and goes to the next iteration. When processing a file in a loop, you can use continue in cases where you donā€™t want to process a line. The basic idea of the search loop is that you are looking for ā€œinterestingā€ lines and effectively skipping ā€œuninterestingā€ lines. And then when we find an interesting line, we do something with that line.

We can structure the loop to follow the pattern of skipping uninteresting lines as follows:

The output of the program is the same. In English, the uninteresting lines are those which do not start with 'it', which we skip using continue. For the ā€œinterestingā€ lines (i.e., those that start with 'it') we perform the processing on those lines.

We can use the find() string method to find lines where a search string is anywhere in the line. Since find() looks for an occurrence of a string within another string and either returns the position of the string or -1 if the string was not found, we can write the following loop to show lines which contain the string 'for':

5.3.7. Writing FilesĀ¶

To write data into a new file or to overwrite an old file, we open the file with a mode value 'w' (for ā€˜wā€™rite mode) as the second argument to the open() function call:

If the file doesnā€™t exist, a new one is created and opened. If the file already exists, opening it in write mode deletes the old data and starts fresh, so be careful!

Once the file is opened we can use the write() method of the file object to put data into the file. The write() methods writes characters into the file and then returns the number of characters written, though the return value is rarely used or important.

The file object keeps track of where it is, so if you call write() again, it will add the new data to the end of the file.

We must make sure to manage the ends of lines as we write to the file by explicitly inserting the newline character when we want to end a line. The print() statement automatically appends a newline, but the write() method does not add the newline automatically. If you write strings into a file without adding newline characters, they will all end up as one long line, which is probably not what you want.

Note

Both of the above code examples write to a file. When run in this book, the new file will show up as a text box labeled output.txt. The second example will write data into the file text box created by the first example. Scroll up if it has gone off the page.

Closing files is especially important after writing data into them. Data might not be physically written to the secondary memory until close() is called, and it remains in danger of being lost if the computer loses power.

Again, using the with syntax ensures the file is closed automatically. Otherwise, be sure to add a call to the close() method when the program is done writing to the file.

5.3.8. DebuggingĀ¶

When you are reading and writing files, you might run into problems with whitespace. These errors can be hard to debug because spaces, tabs (written in string constants as \t), and newlines are normally invisible:

The built-in function repr() can help by explicitly showing you the ā€˜invisibleā€™ characters in your file. repr() takes any object as an argument and returns a string representation of the object. If we pass in a string, repr() returns that string with ā€˜invisibleā€™ characters shown as backslash sequences:

This can be helpful when debugging.

If you are running code of different computers one problem you might run into is that different systems use different characters to indicate the end of a line. Some systems use a newline, represented \n. Others use a return character, represented \r. Some use both. For now, you do not need to worry about this, but it is important to keep in mind that your code may function differently on different operating systems or computers because of these slight variations.

For most systems, there are applications to convert files from one format to another. You can find them (and read more about this issue that you wouldnā€™t think should be so complex) at en.wikipedia.org/wiki/Newline. Or, perhaps, you might write the code to do the conversion yourself.