5.1. Strings

We’ve been using strings from the beginning, and so you already have a general idea of what they are and know how to do a few things with them. Here, we’ll go into more detail and show you more ways you can use and manipulate strings.

5.1.1. A String is a Sequence

The first detail to add is that a string is a sequence. And where have we seen “sequence” before? Earlier, we saw that a sequence can be given to a for loop, and the loop will iterate over each of the elements in the given sequence.

If a for loop accepts any sequence, and if a string is a sequence, then something like the following should work:

Try it out. What can we learn from that?

We see that it prints out each individual character in the string. That is, each time the for loop gets a new item from the string, it gets a single character. Therefore, we can see that a string is a sequence of characters.

There are many other tools we can use with sequences, and we’ll go over several below. All of these apply to any sequence, not just strings. Recall that lists are sequences as well. Most of the following tools and patterns work with lists just like they do with strings. We’ll focus on strings here and get to lists shortly.

5.1.1.1. Indexing

In addition to iterating over a sequence in a for loop, there are other things we can do with sequences. One of the most important is called indexing. Indexing is a tool that lets us get a single element out of a sequence. In the case of a string, it lets us get a single character out of the string.

To perform indexing, we use the [] or “bracket” operator with an integer:

The second statement extracts the character at index position 1 from the fruit variable and assigns it to the letter variable. The expression in the brackets is called an index. The index indicates which character in the sequence you want (hence the name).

But why did it print ‘a’ and not ‘b’? For most people, the first letter of ‘banana’ is ‘b,’ not ‘a.’ But in Python and most other programming languages, an index is an offset from the beginning of the string, and the offset of the first letter is zero.

So ‘b’ is the letter “at index [or position] 0” of ‘banana,’ ‘a’ is the letter “at index 1,” and ‘n’ is the letter “at index 2.”

String Indexes

String Indexes

You can use any expression, including variables and operators, as an index, but the value of the index has to be an integer. Otherwise you get:

>>> letter = fruit[1.5]
TypeError: string indices must be integers

5.1.1.2. Using len() with Strings

Recall the len() built-in function. We can now see that it always returns the number of elements in a sequence. If the sequence we give it is a string, we get back the number of characters in the string.

To get the last letter of a string, you might be tempted to try something like this:

The reason for the IndexError is that there is no letter in “banana” with the index 6. Since we started counting at zero, the six letters are numbered 0 to 5. To get the last character, you have to subtract 1 from length:

Alternatively, you can use negative indices, which count backward from the end of the string. The expression fruit[-1] yields the last letter, fruit[-2] yields the second to last, and so on.

5.1.1.3. Traversal Through a String with a Loop

A lot of computations involve processing a string one character at a time. Often they start at the beginning, select each character in turn, do something to it, and continue until the end. This pattern of processing is called a traversal. We’ve seen above that we can accomplish this with a for loop, using a string as its sequence. Another way to write a traversal is with a while loop:

This loop traverses the string and displays each letter on a line by itself. The loop condition is index < len(fruit), which can be considered to be saying, “As long as index is still a valid index of fruit” because all valid indexes are less than the length of the string. So when index is equal to the length of the string, the condition is false, and the loop stops executing.

With each value for index counting up from 0, the body of the loop uses indexing to get the character at that index from the string, and it prints it out.

Check your understanding

Write a while loop that starts at the last character in the string and works its way backwards to the first character in the string, printing each letter on a separate line, except backwards.

5.1.1.4. Slicing

If we want a portion of a string, rather than a single character, we can use slicing. A segment of a string is called a slice. Selecting a slice is similar to selecting a character:

To perform slicing, place a : inside the [] brackets with an index written before and after it. The operator returns the portion of the string from the first index up to but not including the second index.

If you omit the first index (before the colon), the slice starts at the beginning of the string. If you omit the second index, the slice goes to the end of the string:

5.1.1.5. Strings are Immutable

It is tempting to use the indexing operator on the left side of an assignment, with the intention of changing a character in a string. For example:

>>> greeting = 'Hello, world!'
>>> greeting[0] = 'J'
TypeError: 'str' object does not support item assignment

The “object” in this case is the string and the “item” is the character you tried to assign. For now, an object is the same thing as a value, but we will refine that definition later. An item is one of the values in a sequence.

The reason for the error is that strings are immutable, which means you can’t change an existing string. The best you can do is create a new string that is a variation on the original:

This example concatenates a new first letter onto a slice of greeting. It has no effect on the original string.

5.1.1.6. Looping and Counting

The following program counts the number of times the letter ‘a’ appears in a string:

This program demonstrates another pattern of computation called a counter. The variable count is initialized to 0 and then incremented each time an “a” is found. When the loop exits, count contains the result: the total number of a’s. We used this pattern back in the word count example program.

5.1.1.7. The in Operator

The word in is a Boolean operator that takes two strings and returns True if the first appears as a substring in the second:

>>> 'a' in 'banana'
True
>>> 'seed' in 'banana'
False

The in operator is commonly used in conditionals, as demonstrated in the following example:

5.1.2. String Comparison

The comparison operators work on strings. To see if two strings are equal:

Other comparison operations are useful for putting words in alphabetical order:

Python does not handle uppercase and lowercase letters the same way that people do. All the uppercase letters come before all the lowercase letters, so if you enter “Pineapple,” for example:

Your word, Pineapple, comes before banana.

A common way to address this problem is to convert strings to a standard format, such as all lowercase, before performing the comparison. The next section includes a way to do that.

5.1.3. string Objects and Methods

Strings in Python can do a lot more than just hold a sequence of characters. Strings are an example of Python objects.

Definition

An object contains both data and methods, which are functions that are built into the object and can modify or perform operations on it.

As another way of putting it, objects “know things” and “can do things”:

  • Objects “know things”: an object holds data.

  • Objects “can do things”: an object contains code (the methods).

In the case of a string object, the object’s data is the characters of the string itself. And there are a few ways to learn about what methods (code) it contains.

Python has a function called dir() which lists the methods available in an object. The type() function shows the type of an object and the dir() function shows the available methods.

>>> stuff = 'Hello world'
>>> type(stuff)
<class 'str'>
>>> dir(stuff)
['capitalize', 'casefold', 'center', 'count', 'encode',
'endswith', 'expandtabs', 'find', 'format', 'format_map',
'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit',
'isidentifier', 'islower', 'isnumeric', 'isprintable',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower',
'lstrip', 'maketrans', 'partition', 'replace', 'rfind',
'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip',
'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'translate', 'upper', 'zfill']

While the dir() function lists the methods, a better source of documentation for string methods is the official Python documentation: https://docs.python.org/3/library/stdtypes.html#string-methods.

Note

The official Python documentation uses a syntax that might be confusing. For example, in find(sub[, start[, end]]), the brackets indicate optional arguments. So sub is required, but start is optional, and if you include start, then end is optional.

Methods, like any other function, can be called to execute them. Calling a method is similar to calling a function (it takes arguments and can return a value), but to access a method within an object, we use dot notation just like when accessing functions within modules.

For example, the method upper() takes a string and returns a new string with all uppercase letters:

This form of dot notation specifies the name of the method, upper(), and the name of the string to apply the method to, word. The parenthese are empty because this method takes no arguments.

The find() string method searches for the position of one string within another:

In this example, we invoke find() on word and pass in the string we are looking for as a parameter.

The find() method can optionally take a second argument: the index where it should start searching.

The final call to find() there returns -1 to indicate the search string was not found. The string 'nan' is present in 'banana', but the second argument started the search at index 4, beyond where 'nan' starts.

One common task is to remove white space (spaces, tabs, or newlines) from the beginning and end of a string using the strip() method:

Some methods such as startswith() return Boolean values.

Note that startswith() requires case to match, so sometimes we take a line and map it all to lowercase before we do any checking using the lower() method.

>>> line = 'Have a nice day'
>>> line.startswith('h')
False
>>> line.lower()
'have a nice day'
>>> line.lower().startswith('h')
True

Check your understanding

There is a string method called count() that counts the occurrence of one string within another. Read about this method in Python string method documentation and write a short program that uses count() to count the number of times the letter ‘a’ occurs in a string the user types in.

5.1.4. Parsing Strings

Often, we want to look into a string and find a substring. For example if we were presented a series of lines formatted as follows:

From stephen.marquard@uct.ac.za   Sat Jan  5 09:14:16 2008

and we wanted to pull out only the second half of the address (i.e., uct.ac.za) from each line, we can do this by using the find() method and string slicing.

First, we will find the position of the at-sign in the string. Then we will find the position of the first space after the at-sign. And then we will use string slicing to extract the portion of the string which we are looking for.

We use the optional arguments for the find() method that allow us to specify the position in the string where we want find() to start searching. When we slice, we extract the characters from “one beyond the at-sign” and up to but not including the index of the next space character.

The documentation for the find method describes the optional arguments.

5.1.5. String Formatting

The format() string method is one of the most commonly used. It allows us to construct strings, replacing parts of the strings with the data stored in variables or calculated in expressions. Let’s look at an example:

The syntax might look a little strange. It is calling the format() method via dot notation, but instead of writing a string variable to the left of the dot, we wrote a string literal. This is commonly how format() is called.

The string literal (from which the format() method is called) is known as a format string. The format string should contain one or more placeholders, written as {} (known as “curly braces”). Then, each argument given to the format() method is placed into the string in place of each of the placeholders in order.

Each placeholder can contain information inside the {} curly braces that specifies how the value included there should be formatted. There are many options for controlling how the string is formatted, but the more commonly-used options control how floating point values are printed and allow for aligning values in columns. The following example demonstrates both.

In the format string here, "{:>5}  {:5.3f}", the first placeholder is {:>5}. It includes a : to start the formatting options, the > makes the value “right-aligned” and the 5 controls how many characters the value is placed in. So it always uses 5 characters, and it places the value on the right hand side of that space. The second placeholder, {:5.3} uses 5 characters, again, and the .3f makes it format it as a floating point value and place 3 digits after the decimal point, again regardless of the value itself. Values are left-aligned by default. Try changing some of the values in the placeholders to see how it affects the formatting.

There are many more options for controlling what is included in the string and how it is formatted. You can see the full set of options in the “Format String Syntax” documentation, and the examples provided can help you learn about additional features.

Using string formatting is often easier than building strings by concatenating different pieces and provides more control than including multiple arguments in a plain print() statement. For example: