4.3. Common Comprehension Patterns for a Single Sequence

List comprehensions can replace for loops in many (most) situations. In the following sections, we will highlight some useful techniques for describing a new list using comprehensions.

4.3.1. Combine information in tuples

We start with useful pattern for list comprehensions involves combining information into a tuple. For example, lets compute the cube-root of all the numbers between 1 and 10 and use a tuple to store both the original number and the value of the cube-root.

In [1]: cube_root = lambda n: n**(1/3)

In [2]: L = [(val, cube_root(val)) for val in range(1,10)]

In [3]: L
Out[3]: 
[(1, 1.0),
 (2, 1.2599210498948732),
 (3, 1.4422495703074083),
 (4, 1.5874010519681994),
 (5, 1.7099759466766968),
 (6, 1.8171205928321397),
 (7, 1.912931182772389),
 (8, 2.0),
 (9, 2.080083823051904)]

When applying a list comprehension to a list of tuples, we can save each of the values to a separate variable by providing a comma-separated sequence of variables between for and in, as illustrated below.

In [4]: cube_root_less_than_2 = [val for val, cube_root in L if cube_root < 2]

In [5]: cube_root_less_than_2
Out[5]: [1, 2, 3, 4, 5, 6, 7]

This approach will work for any list of tuples regardless of the length of the tuples, provided that the number of variables matches the length of the tuples.

4.3.2. Use built-in helper functions

The functions enumerate and zip both exhibit the pattern from the last section. The enumerate function can be used to return both the index and value of each element of a sequence.

In [6]: L = [1,2,3,4,5,6]

In [7]: pairs = [(ind, val) for ind, val in enumerate(L)]

In [8]: pairs
Out[8]: [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

This can be useful if we are describing a transformation that involves both the value and the position of the value in the sequence. For example, suppose we want to add 3 to the first 3 values of L.

In [9]: new_list = [val + 3 if ind < 3 else val for ind, val in enumerate(L)]

In [10]: new_list
Out[10]: [4, 5, 6, 4, 5, 6]

Without enumerate, we would have had to focus on the index and used the indexing operator to access the value.

In [11]: new_list = [L[ind] + 3 if ind <= 3 else L[ind] for ind in range(len(L))]

In [12]: new_list
Out[12]: [4, 5, 6, 7, 5, 6]

Clearly using enumerate led to a simpler and easier to read construction. Be sure to use this function whenever you need both the index and value of a sequence.

Check your Understanding

    rec-5-43: What will be printed by executing the following code block?

    L = list(enumerate(range(3)))
    print(L)
    
  • (A) [(1, 0), (2, 1), (3, 2)]
  • Remember that Python starts counting at 0! So does enumerate.
  • (B) [(0, 1), (1, 2), (2, 3)]
  • Remember that Python starts counting at 0! So does range.
  • (C) [(0, 0), (1, 1), (2, 2)]
  • (D) [(0, 1, 2), (0, 1, 2)]
  • Enumerate pairs the index with the original value.

    rec-5-44: What will be printed by executing the following code block?

    L = list(enumerate(range(1,4)))
    print(L)
    
  • (A) [(1, 0), (2, 1), (3, 2)]
  • Enumerate returns pairs with the index in the first entry and the value in the second.
  • (B) [(0, 1), (1, 2), (2, 3)]
  • (C) [(0, 0), (1, 1), (2, 2)]
  • This range function starts at 1 and goes up to (but not including) 4.
  • (D) [(0, 1, 2), (1, 2, 3)]
  • Enumerate pairs the index with the original value.

The zip function combines two sequences into one sequence of pairs.

In [13]: L = [1, 2, 3, 4]

In [14]: M = ["a", "b", "c"]

In [15]: new_list = [(Lval, Mval) for Lval, Mval in zip(L,M)]

In [16]: new_list
Out[16]: [(1, 'a'), (2, 'b'), (3, 'c')]

Notice that the length of the new list is the same as the shorter list. One example of an application of zip comes from probability. Suppose that \(X\) and \(Y\) are random variables that are the results of rolling fair 6 an 20 sided die, respectively. We wish to simulate the distribution of the sum of these two values. We can accomplish this by generating a number of trials for each the dice separately, then computing the sum using a list comprehension and zip.

In [17]: from random import randint

In [18]: N_trials = 10

In [19]: six_sided = [randint(1,6) for i in range(N_trials)]

In [20]: six_sided
Out[20]: [1, 5, 2, 1, 1, 2, 5, 2, 2, 2]

In [21]: twenty_sided = [randint(1,20) for i in range(N_trials)]

In [22]: twenty_sided
Out[22]: [5, 17, 10, 3, 11, 6, 18, 7, 2, 3]

In [23]: sums = [r6 + r20 for r6, r20 in zip(six_sided, twenty_sided)]

In [24]: sums
Out[24]: [6, 22, 12, 4, 12, 8, 23, 9, 4, 5]

To generalize this process, we use lambda expressions to create general functions for creating each sequence such that the value of the number of trials N can be adjusted.

In [25]: six_sided = lambda N: [randint(1,6) for i in range(N)]

In [26]: twenty_sided = lambda N: [randint(1,20) for i in range(N)]

In [27]: sums = lambda N: [r6 + r20 for r6, r20 in zip(six_sided(N), twenty_sided(N))]

In [28]: mean = lambda L: sum(L)/len(L)

In [29]: mean(sums(1000000))
Out[29]: 14.004244

Above we illustrate this refactoring of the original code and use the newly constructed functions to simulate the average of 1 million rolls. As both six_sided and twenty_sided are now functions, the value of N need to be passed along in the definition of sums.

This is another example of how lambda expressions can be used to transform a specific example into a more general solution. This is done by identifying the variable(s) we would like to change and adding them as formal parameters to a lambda expression. This is as simple as appending the lambda N: to the front of our expressions and changing some variable references to function calls.

Check Your Understanding

    rec-5-45: What will be printed by executing the following code block?

    L = list(zip(range(3), range(1,4)))
    print(L)
    
  • (A) [(1, 0), (2, 1), (3, 2)]
  • Zip preserves the order of the original arguments. In this case the values from range(3) will preceed the values from range(1,4).
  • (B) [(0, 1), (1, 2), (2, 3)]
  • (C) [(0, 0), (1, 1), (2, 2)]
  • The second range function starts at 1 and goes up to (but not including) 4.
  • (D) [(0, 1, 2), (1, 2, 3)]
  • Zip combines the sequences in pairs based on index (1st with 1st, 2nd with 2nd, ...)

    rec-5-46: What will be printed by executing the following code block?

    L = list(zip(range(3), range(4)))
    print(L)
    
  • (A) [(1, 0), (2, 1), (3, 2)]
  • The first range starts at 0 and counts up to 2.
  • (B) [(0, 1), (1, 2), (2, 3)]
  • The second range starts at 0 and counts up to 3.
  • (C) [(0, 0), (1, 1), (2, 2)]
  • If given sequences of different length, zip will stop at the end of the shorted sequence.
  • (D) Error, you can't zip lists of different length.
  • If given sequences of different length, zip will stop at the end of the shorted sequence.

Finally, we highlight the reversed function, which allows us to iterate through a sequence from back to front.

In [30]: L = [1,2,3,4,5,6]

In [31]: new_list = [i for i in reversed(L)]

In [32]: new_list
Out[32]: [6, 5, 4, 3, 2, 1]

4.3.3. Use built-in functions to reduce a list to a value

There are a number of built-in Python functions that help us reduce a list to a value, including sum, len, max, and min. Remember to use these functions along with a list comprehension to describe a computation on a sequence of values.

For example, suppose that we want to compute the sum of squares for a small set of numbers. Let’s give this a try using the regular definition, shown below.

\[SS = \sum_{i=1}^n (y_i - \bar{y})^2\]

For each value in the list we must subtract the mean and the square this difference. Finally, we add up all of these values. We will create a function for computing the mean and then another for computing the sum of squares.

In [33]: mean = lambda L: sum(L)/len(L)

In [34]: ss = lambda L: sum([(i - mean(L))**2 for i in L])

In [35]: my_list = [1,2,3,4,5]

In [36]: mean(my_list)
Out[36]: 3.0

In [37]: ss(my_list)
Out[37]: 10.0

On a modern computer, the above code is reasonably fast for fairly small lists. Next, we use the IPython %timeit magic to time our function on various size lists.

In [31]: %timeit ss(range(10**2))
   ....: %timeit ss(range(10**3))
   ....: %timeit ss(range(10**4))
   ....:
1000 loops, best of 3: 666 us per loop
10 loops, best of 3: 53.2 ms per loop
1 loop, best of 3: 5.97 s per loop

Notice that the ss function is costs about 100 times more (in processing time) each we multiple the length of the list by 10. This hints at a complexity of \(O(n^2)\). Consider the complexity of these functions. The mean function must visit each element of the list and is \(O(n)\). The inefficiency of the ss function results from calling the mean function for each value in the list!. Thus the time complexity of ss is \(n*O(n) = O(n*n) = O(n^2)\). We can fix this issue by using the usual simplification shown below.

\[SS = \sum_{i=1}^n{y_i^2} - \frac{\left(\sum_{i=1}^n y_i\right)^2}{n}\]
In [38]: ss = lambda L: sum([i**2 for i in L]) - sum(L)**2/len(L)

In [39]: my_list = [1,2,3,4,5]

In [40]: ss(my_list)
Out[40]: 10.0

Now we refactor the code to clear up the meaning of each part.

In [41]: sum_square_values = lambda L: sum([i**2 for i in L])

In [42]: ss = lambda L: sum_square_values(L) - (sum(L))**2/len(L)

In [43]: my_list = [1,2,3,4,5]

In [44]: ss(my_list)
Out[44]: 10.0

Each of the component functions, len, sum_values and sum_square_values, is \(O(n)\) (they each visit a list element exactly once and perform a \(O(1)\) computation). Therefore, this implementation is \(O(n)\), which is illustrated below by the 10 fold increase in computation time when increasing the length of a list by a factor of 10.

In [31]: %timeit ss(range(10**2))
   ....: %timeit ss(range(10**3))
   ....: %timeit ss(range(10**4))

10000 loops, best of 3: 42.5 us per loop
1000 loops, best of 3: 406 us per loop
100 loops, best of 3: 4.11 ms per loop

Beware of Premature Optimization!

A well-known computer scientist, Donald Knuth, once said

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

It is important to think about the efficiency of your program, but follow Knuth’s advice and worry about it only after you have demonstrated it is slow. Instead, follow another pieces of advice from Knuth.

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

That is, you should focus on describing what your program does in such a way that another programmer can understand just by reading your code.

Check Your Understanding

    rec-5-47: What will be printed when executing the following code?

    val = sum([x for x in range(6) if x % 3 == 0])
    print(val)
    
  • (A) 21
  • The range will not include 6 and will only include values divisible by 3 (x % 3 == 0).
  • (B) 15
  • The comprehension will only include values divisible by 3 (x % 3 == 0).
  • (C) 3
  • (D) 9
  • The range will not include 6

4.3.4. Use a key function with max and min to process tuples

The built-in reduction function max and min have an optional keyword parameter that accepts a key function. This can be used to search a list of lists, specifying the entry by which we wish to compare values.

In [45]: help(max)
Help on built-in function max in module builtins:

max(...)
    max(iterable, *[, default=obj, key=func]) -> value
    max(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its biggest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the largest argument.

For example, consider the hours table in the following code.

In [46]: hours = [["Alice", 43],
   ....:            ["Bob", 37],
   ....:            ["Fred", 15]]
   ....: 

Suppose that we want to determine the name of the employee that worked the most hours last week. In this case, we want to compare the rows based on the 2nd entry (index 1). This is accomplished by defining a key function that takes a row and returns the 2nd element. Then the max function will apply this function before comparing the rows.

In [47]: get_hours = lambda row: row[1]

In [48]: max_row = max(hours, key=get_hours)

In [49]: max_row
Out[49]: ['Alice', 43]

In [50]: get_name = lambda row: row[0]

In [51]: employee_max_hours = get_name(max_row)

In [52]: employee_max_hours
Out[52]: 'Alice'

This process is generalized by combining the expressions for max_row and employee_max_hours into a lambda expression.

In [53]: get_name = lambda row: row[0]

In [54]: get_hours = lambda row: row[1]

In [55]: employee_max_hours = lambda hours_table: get_name(max(hours_table, key=get_hours))

In [56]: employee_max_hours(hours)
Out[56]: 'Alice'

A similar technique can be used to returns a sorted table using the sorted function, which also accepts a key function.

In [57]: help(sorted)
Help on built-in function sorted in module builtins:

sorted(iterable, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customise the sort order, and the
    reverse flag can be set to request the result in descending order.

We can sort the table by hours worked (defaults is smallest-to-largest),

In [58]: sorted(hours, key = get_hours)
Out[58]: [['Fred', 15], ['Bob', 37], ['Alice', 43]]

by hours worked sorted from largest to smallest using reverse=True,

In [59]: sorted(hours, key = get_hours, reverse=True)
Out[59]: [['Alice', 43], ['Bob', 37], ['Fred', 15]]

or put the list in alphabetical order by creating a new key function that returns a lower-case name.

In [60]: lower_names = lambda row: get_name(row).lower()

In [61]: sorted(hours, key=lower_names)
Out[61]: [['Alice', 43], ['Bob', 37], ['Fred', 15]]

    rec-5-48: What will be printed when executing the following code?

    get_second = lambda row: row[1]
    table = [list(range(i, i+3)) for i in range(4)]
    val = max(table, key = get_second)
    print(val)
    
  • (A) [0, 1, 2]
  • This would be the minimum row.
  • (B) [2, 3, 4, 5]
  • This is the largest column, not largest row
  • (C) [3, 4, 5]
  • This is the row with the largest 2nd entry.
  • (D) Error, the table isn't square.
  • the max function can process any table, as long as the key function works for all rows (even if all the rows are not the same length)

4.3.5. Use any and all for Boolean questions about a list

There are two more functions that help use reduce a list to a value. The Python built-in functions any and all are useful when asking Boolean questions about a list. The function any takes a sequence as input and returns True if any of the elements in the sequence evaluate to True. On the other hand, when all is applied to a sequence, it only evaluates to True if all of the elements of said sequence evaluate to True.

Suppose that we want to know if any or all of the elements of a list are even. We can write a lambda functions to perform each task as follows.

In [62]: all_even = lambda L: all([ i % 2 == 0 for i in L])

In [63]: any_even = lambda L: any([ i % 2 == 0 for i in L])

In [64]: my_list = [1,2,3,4,5]

In [65]: any_even(my_list)
Out[65]: True

In [66]: all_even(my_list)
Out[66]: False

Notice that this pattern involves

  1. Using a Boolean expression in the output expression of the comprehension.
  2. Applying any or all to the list of Boolean values.

Let’s refactor this code to clean it up, namely by introducing an is_even function.

In [67]: is_even = lambda n: n % 2 == 0

In [68]: all_even = lambda L: all([is_even(i) for i in L])

In [69]: any_even = lambda L: any([is_even(i) for i in L])

In [70]: my_list = [1,2,3,4,5]

In [71]: any_even(my_list)
Out[71]: True

In [72]: all_even(my_list)
Out[72]: False

Check Your Understanding

    rec-5-49: What will be printed when executing the following code?

    val = any([n % 4 == 0 for n in range(4)])
    print(val)
    
  • (A) True
  • 0 is divisible by 4 (and any other int).
  • (B) False
  • 0 is divisible by 4 (and any other int).

    rec-5-50: What will be printed when executing the following code?

    val = all([n % 4 == 0 for n in range(4)])
    print(val)
    
  • (A) True
  • 1 is not divisible by 4.
  • (B) False
  • 1 is not divisible by 4.

4.3.6. Use a comprehension to apply many functions to a value

Suppose that we want to apply many functions to the same value. We can accomplish this using a list of functions and a list comprehension. In the first example, we will compute a number of statistics on the same list.

In [73]: mean = lambda L: sum(L)/len(L)

In [74]: funcs = [mean, max, min]

In [75]: L = [1,2,3,4,5]

In [76]: stats = [stat_func(L) for stat_func in funcs]

In [77]: stats
Out[77]: [3.0, 5, 1]

In the previous example, each function is pulled out of the funcs list and applied to the list in sequence. We can use a similar approach, combined with the any function, to check if a given value is a number of some sort, be it int, real or complex.

In [78]: is_type_funcs = [lambda n: type(n) is int,
   ....:                  lambda n: type(n) is float,
   ....:                  lambda n: type(n) is complex]
   ....: 

In [79]: is_number = lambda n: any([is_type_func(n) for is_type_func in is_type_funcs])

In [80]: is_number(5)
Out[80]: True

In [81]: is_number(2.3)
Out[81]: True

In [82]: is_number("a")
Out[82]: False

One thing to note about the last example is that we used to anonymous nature of a lambda expression to define each of the functions for checking a type inside the list definition without assigning each to a separate name.

    rec-5-51: What will be printed when executing the following code?

    identity = lambda x: x
    sqr = lambda x: x**2
    cube = lambda x: x**3
    vals = [f(3) for f in (identity, sqr, cube)]
    print(val)
    
  • (A) [3, 9, 27]
  • This comprehension will apply call function in order with an argument of 3.
  • (B) [9, 3, 27]
  • This comprehension will apply call function in same order as given in the list.
  • (C) Error, you can not iterate over a list of functions.
  • Functions are objects and can be iterated over like any other Python value.

4.3.7. Filter and count using len

You can use the if clause along with the length function to count the number of elements that satisfy a condition. In a previous example, we simulated the distribution of the sum of rolls on a six and twenty sided die. Suppose that we wish to estimate the frequency of the result of the sum being prime. First, we will use filtering and len to answer the question.

In [83]: six_sided = lambda N: [randint(1,6) for i in range(N)]

In [84]: twenty_sided = lambda N: [randint(1,20) for i in range(N)]

In [85]: sum_rolls = lambda N: [r6 + r20 for r6, r20 in zip(six_sided(N), twenty_sided(N))]

In [86]: primes = [2, 3, 5, 7, 11, 13, 17, 19, 23]

In [87]: num_primes = len([i for i in sum_rolls(10000) if i in primes])

In [88]: prop_primes = num_primes/10000

In [89]: prop_primes
Out[89]: 0.343

We get the number of primes in the simulation by filtering with if i in prime and then using len to count the number of results.

Another approach to the last example is to map results to 0 or 1 based on the condition, then using sum to answer the question.

In [90]: primes_rolls = [1 if i in primes else 0 for i in sum_rolls(10000)]

In [91]: prop_primes = sum(primes_rolls)/10000

In [92]: prop_primes
Out[92]: 0.3329

We can compare the efficiency of these methods using the IPython %timeit magic after abstracting each approach with a lambda expression.

In [93]: with_zero_one = lambda N: sum([1 if i in primes else 0 for i in sum_rolls(N)])/N

In [94]: %timeit with_zero_one(10000)
10 loops, best of 3: 32.9 ms per loop
In [95]: with_len= lambda N: len([i for i in sum_rolls(N) if i in primes])/N

In [96]: %timeit with_len(10000)
10 loops, best of 3: 32.8 ms per loop

It is not surprising that these methods have nearly the same efficiency, as they both have to visit each entry once and must count values.

Question

Can you prove that each of these approaches is \(O(n)\)?

Check Your Understanding

Next Section - 4.4. Common Comprehension Patterns for Two Sequences