5.3. Files¶
5.3.1. Persistence¶
So far, we have learned how to write programs and communicate our intentions to the Central Processing Unit using conditionals, functions, and iterations. We have learned how to create and use data structures like lists and strings in the Main Memory. The CPU and memory are where our software works and runs. It is where all of the “thinking” happens.
But if you recall from our hardware discussions, once the power is turned off, anything stored in either the CPU or main memory is erased. So up until now, our programs have only produced temporary results.
In this chapter, we start to work with Secondary Memory, like hard drives, where files are stored. Secondary memory is not erased when the power is turned off, and it has much more capacity than the main memory. If we keep our data long-term or work with large datasets we need to learn to access and store data in files.
Example code in this section will treat the data in the box below as a file
named atotc_opening2.txt
. As before, this contains opening lines from A
Tale of Two Cities (a few more this time). You can edit the file below, and
its changed contents will be used in any active code blocks.
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever. It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. Even the Cock-lane ghost had been laid only a round dozen of years, after rapping out its messages, as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. Mere messages in the earthly order of events had lately come to the English Crown and People, from a congress of British subjects in America: which, strange to relate, have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood.
5.3.2. Text Files and Lines¶
A text file can be thought of as a sequence of lines, much like a Python
string can be thought of as a sequence of characters. For example,
atotc_opening2.txt
above contains 37 lines.
To break the file into lines, there is a special character that represents the “end of the line” called the newline character.
In Python, we represent the newline character as \n
in string constants.
(That’s a “backslash,” because it is leaning backwards.) Even though this looks
like two characters, it is understood by Python as a single character.
If we print a string variable containing the newline character, we can see that it prints the string on two lines, moving to the beginning of the next line when the newline character is reached:
xxxxxxxxxx
stuff = 'Hello\nWorld!'
print(stuff)
stuff = 'X\nY'
print(stuff)
print(len(stuff))
You can also see that the length of the string X\nY
is three characters
because the newline character is a single character.
So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.
Note
The backslash \
character is an escape character, meaning it is used
to indicate the start of a special character in a string. But what if you
want to include an actual backslash followed by an n
in a string? You
can escape the backslash itself: "\\n"
is a string with two characters:
a backslash and an n
.
5.3.3. Opening Files¶
When we want to read or write a file in a program, we first must open the file. When you open a file, you are asking the operating system to find the file by name, make sure the file exists, and prepare it to be read from or written to.
To open a file, we can use the open()
function [[full
documentation](https://docs.python.org/3/library/functions.html#open)]. In its
simplest form, it takes one argument: a string containing the name of the file
to open. In this example, we open the file (from above) atotc_opening2.txt
:
xxxxxxxxxx
file = open('atotc_opening2.txt')
print(file)
If the call to open()
is successful, it returns a file object. The file
object is not the actual data contained in the file, but instead it has a
“handle” that it can use to access the data. You can use the object by calling
its methods via dot notation, just like with other objects. You are given
a file object if the requested file exists and you have the proper permissions
to read the file.
If the file does not exist, open()
will fail with a traceback and you
will not get a file object to access the contents of the file:
xxxxxxxxxx
file = open('stuff.txt')
5.3.4. Reading Files¶
While the file object does not contain the data for the file, it is quite easy
to construct a for
loop to read through and count each of the lines in a
file:
xxxxxxxxxx
file = open('atotc_opening2.txt')
count = 0
for line in file:
count = count + 1
print('Line Count:', count)
Note
The code above reports 36 lines, despite the file having 37. This appears
to be a bug in the Python interpreter used to run code in the browser. It
skips the final line for some reason. You can see this by adding
print(line)
inside the for loop and comparing the output to the file data
above.
The code will work correctly in any “normal” Python interpreter.
We can use the file object as the sequence in a for
loop, and each element
in the sequence will be another line from the file. This may feel a bit odd,
but file objects have all kinds of functionality built in to them, and acting
as a sequence for a for
loop is just one of the things they can do.
The for
loop above counts the number of lines in the file and prints the
count. The rough translation of the for
loop into English is, “for each
line in the file represented by the file object, add one to the count
variable.”
When the file is read using a for
loop in this manner, Python takes care of
splitting the data in the file into separate lines using the newline character.
Python reads each line through the newline and includes the newline as the last
character in the line
variable for each iteration of the for
loop.
Because the for
loop reads the data one line at a time, it can efficiently
read and count the lines in very large files without running out of main memory
to store the data. The above program can count the lines in any size file using
very little memory since each line is read, counted, and then discarded.
If you know the file is relatively small compared to the size of your main
memory, you can read the whole file into one string using the read()
method
of the file object.
xxxxxxxxxx
file = open('atotc_opening2.txt')
contents = file.read()
print(len(contents))
print(contents[:20])
In this example, the entire contents (all 1,862 characters) of the file
atotc_opening2.txt
are read directly into the variable contents
. We use
string slicing to print out the first 20 characters of the string data stored
in contents
.
When the file is read in this manner, all the characters including all of the
lines and newline characters are one big string in the variable contents
.
Using the read()
method like this will not work well for really large
files (bigger than 100 megabytes, perhaps), because they might not fit in the
computer’s memory. For such large files, it will be better to process the file
line by line in a loop (which doesn’t read the entire file all at once) or to
use more sophisticated tools.
5.3.5. Closing Files¶
When a program is done using a file, it should close the file using the close() method of the file object. This will release resources in the computer and make sure everything is cleaned up correctly.
Using a file object after it has been closed will not work:
xxxxxxxxxx
# open a file
file = open('atotc_opening2.txt')
contents1 = file.read()
print(len(contents1))
# close the file
file.close()
# attempt to read from the same file object
contents2 = file.read()
print(len(contents2))
To make sure a file is always closed and cleaned up, it is safest to use the
with
syntax:
xxxxxxxxxx
# open a file, and automatically close it when the with block exits
with open('atotc_opening2.txt') as file:
contents1 = file.read()
print(len(contents1))
Whenever the body of the with
statement (the indented lines below it)
exits, for any reason, the file object created by the open()
call will
automatically be closed.
Syntax Pattern
A with
statement has the form:
with <expression> as <var>:
<body>
When Python interprets this syntax, it evaluates the expression and stores
the result in the variable <var>
. It then executes the body. Upon
leaving the body for any reason (an error, a return
statement, or just
reaching the end), the object stored in <var>
will automatically be
closed.
[Technically, we are omitting some details here, and there is more involved than described. The documentation provides the full details. The description above is sufficient for using the syntax, though, and additional details would just complicate matters at this point.]
We will use the with
syntax in the rest of the examples here, though manually
opening and closing a file (with open()
and .close()
) would work as well.
5.3.6. Searching Through a File¶
When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file with string methods to build simple search mechanisms.
For example, if we wanted to read a file and only print out lines which started
with the prefix 'it'
, we could use the string method startswith()
to select
only those lines with the desired prefix:
xxxxxxxxxx
with open('atotc_opening2.txt') as file:
for line in file:
if line.startswith('it'):
print(line)
When this program runs, we get the following output:
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
its noisiest authorities insisted on its being received, for good or for
The output looks correct since the only lines we are seeing are those which
start with 'it'
, but why are we seeing the extra blank lines? This is due
to invisible newline characters. Each of the lines in the file ends with a
newline, so the line
variable will as well. The print()
statement
prints the string in the variable line
, including its newline character,
and then print()
adds its own newline, resulting in the double spacing
effect we see.
We could use string slicing to print all but the last character, but a
simpler approach is to use the rstrip()
method, which strips whitespace
(including newline characters) from the right side of a string:
xxxxxxxxxx
with open('atotc_opening2.txt') as file:
for line in file:
line = line.rstrip()
if line.startswith('it'):
print(line)
As your file processing programs get more complicated, you may want to
structure your search loops using continue
. The continue
keyword skips
the rest of a loop body and goes to the next iteration. When processing a file
in a loop, you can use continue
in cases where you don’t want to process a
line. The basic idea of the search loop is that you are looking for
“interesting” lines and effectively skipping “uninteresting” lines. And then
when we find an interesting line, we do something with that line.
We can structure the loop to follow the pattern of skipping uninteresting lines as follows:
xxxxxxxxxx
with open('atotc_opening2.txt') as file:
for line in file:
line = line.rstrip()
# Skip 'uninteresting lines'
if not line.startswith('it'):
continue
# If we get here, the line wasn't skipped,
# so we can process our 'interesting' line:
print(line)
The output of the program is the same. In English, the uninteresting
lines are those which do not start with 'it'
, which we skip using
continue
. For the “interesting” lines (i.e., those that start with
'it'
) we perform the processing on those lines.
We can use the find()
string method to find lines where a search string is
anywhere in the line. Since find()
looks for an occurrence of a string within
another string and either returns the position of the string or -1 if the
string was not found, we can write the following loop to show lines which
contain the string 'for'
:
xxxxxxxxxx
with open('atotc_opening2.txt') as file:
for line in file:
line = line.rstrip()
# Skip 'uninteresting lines'
if line.find('for') == -1:
continue
# If we get here, the line wasn't skipped,
# so we can process our 'interesting' line:
print(line)
5.3.7. Writing Files¶
To write data into a new file or to overwrite an old file,
we open the file with a mode value 'w'
(for ‘w’rite mode) as the
second argument to the open()
function call:
xxxxxxxxxx
with open('output.txt', 'w') as file:
print(file)
If the file doesn’t exist, a new one is created and opened. If the file already exists, opening it in write mode deletes the old data and starts fresh, so be careful!
Once the file is opened we can use the write()
method of the
file object to put data into the file. The write()
methods writes
characters into the file and then returns the number of characters
written, though the return value is rarely used or important.
xxxxxxxxxx
with open('output.txt', 'w') as file:
line1 = "This here's the wattle,\n"
file.write(line1)
The file object keeps track of where it is, so if you call write()
again,
it will add the new data to the end of the file.
We must make sure to manage the ends of lines as we write to the file by
explicitly inserting the newline character when we want to end a line.
The print()
statement automatically appends a newline, but the write()
method does not add the newline automatically. If you write strings into a
file without adding newline characters, they will all end up as one long line,
which is probably not what you want.
xxxxxxxxxx
with open('output.txt', 'w') as file:
line1 = "This here's the wattle,\n"
line2 = 'the emblem of our land.\n'
file.write(line1)
file.write(line2)
Note
Both of the above code examples write to a file. When run in this book, the
new file will show up as a text box labeled output.txt
. The second
example will write data into the file text box created by the first example.
Scroll up if it has gone off the page.
Closing files is especially important after writing data into them.
Data might not be physically written to the secondary memory until close()
is called, and it remains in danger of being lost if the computer loses power.
Again, using the with
syntax ensures the file is closed automatically.
Otherwise, be sure to add a call to the close()
method when the program
is done writing to the file.
5.3.8. Debugging¶
When you are reading and writing files, you might run into problems with
whitespace. These errors can be hard to debug because spaces, tabs (written in
string constants as \t
), and newlines are normally invisible:
xxxxxxxxxx
s = '1 2\t 3\n 4'
print(s)
The built-in function repr()
can help by explicitly showing you the ‘invisible’
characters in your file. repr()
takes any object as an
argument and returns a string representation of the object. If we pass in a
string, repr()
returns that string with ‘invisible’ characters shown as
backslash sequences:
xxxxxxxxxx
s = '1 2\t 3\n 4'
print(repr(s))
This can be helpful when debugging.
If you are running code of different computers one problem you might run into
is that different systems use different characters to indicate the end of a line.
Some systems use a newline, represented \n
. Others use a return character,
represented \r
. Some use both. For now, you do not need to worry about this,
but it is important to keep in mind that your code may function differently
on different operating systems or computers because of these slight variations.
For most systems, there are applications to convert files from one format to another. You can find them (and read more about this issue that you wouldn’t think should be so complex) at en.wikipedia.org/wiki/Newline. Or, perhaps, you might write the code to do the conversion yourself.