7.7. Lazy Iteration

We have already seen some Python expressions that exhibit lazy evaluation, including range, map and filter.

In [1]: range(4)
Out[1]: range(0, 4)

In [2]: map(lambda x: x**2, range(5))
Out[2]: <map at 0x10fd07780>

In [3]: filter(lambda x: x % 2 == 1, range(20))
Out[3]: <filter at 0x10fd07940>

Each of these objects will hold off evaluation as long as possible, only working with the top most element of a sequence at anyone time. Furthermore, compositions of lazy sequences are lazy.

In [4]: from toolz import pipe

In [5]: from toolz.curried import map, filter

In [6]: seq = pipe(range(4),
   ...:            map(lambda x: x**2),
   ...:            filter(lambda x: x % 2 == 1))
   ...: 

In [7]: seq
Out[7]: <filter at 0x10fd07f28>

Composing lazy operations can allow us to work with very large (even infinite) data structures without taxing system memory. Provided we don’t include any eager evaluation, we are only limited by processing time.

As you may have already discovered, lazy sequences can be tricky to work with. First, they can’t be manipulated in the same ways as a list or tuple. For example, we can’t get elements or use the len function on a map.

In [8]: m = map(lambda x: x**2, range(5))

In [9]: m[1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-f3cd3fd3a242> in <module>()
----> 1 m[1]

TypeError: 'map' object is not subscriptable
In [10]: m = map(lambda x: x**2, range(5))

In [11]: len(m)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-31c0f2e6f20c> in <module>()
----> 1 len(m)

TypeError: object of type 'map' has no len()

Each of these restrictions arises from the fact that we have no idea of how long a lazy structure is and determining its length would get in the way of the primary advantage of a lazy structure: they don’t need to be held in memory. The primary purpose of a lazy sequence is to reduce memory usage and for this reason they are never held in memory all at once and can only processed once. Consequently, there is no way to determine the length of the sequence or get the 157th item without losing some or all of the sequence.

In this section, we will show you other methods for constructing your own lazy constructs and discuss the advantage of chaining together lazy expressions.

7.7.1. Generator Expressions

The core lazy type in Python is the generator. The easiest way to construct a lazy expression is using the generator expression, which has the same syntax as a list comprehension, except that it is surrounded by parentheses.

The generator expression
The syntax for the generator function is the same as that of a list comprehension, except being delimited by parentheses. The resulting object will hold off computing each item until evaluation is forced.

For a first example, let’s write the generator expression of squaring all of the odd values in a sequence.

In [12]: g = (item**2 for item in range(5) if item % 2 == 1)

In [13]: g
Out[13]: <generator object <genexpr> at 0x10fb96360>

Notice that evaluating the generator doesn’t produce the sequence. We can force the evaluation of the next item in the sequence using the next function.

In [14]: next(g)
Out[14]: 1

In [15]: next(g)
Out[15]: 9

In [16]: next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-16-5f315c5de15b> in <module>()
----> 1 next(g)

StopIteration: 

By design, the generator will throw an exception when it has completed the sequence, which can be caught by a try-except block to determine the end of the sequence. In practice, this exception does not surface, as we process generators in loops, comprehensions, or other lazy constructs like map and filter. Each of these techniques catches the termination exception behind the scenes so that our code can successfully complete.

Note that a generator is a single-use sequence and must be reinitialized before attempting to iterate a second time.

In [17]: g = (item**2 for item in range(5) if item % 2 == 1)

In [18]: g
Out[18]: <generator object <genexpr> at 0x10f9edca8>

In [19]: x = list(g)

In [20]: x
Out[20]: [1, 9]

In [21]: y = list(g)

In [22]: y
Out[22]: []

In [23]: g = (item**2 for item in range(5) if item % 2 == 1)

In [24]: z = list(g)

In [25]: z
Out[25]: [1, 9]

Consequently, we need to process sequences in one pass. In particular, if we want to reduce the sequence to a number of statistics, we will need to compute all these statistics simultaneously. (See the section on simultaneous reduction for more details.)

7.7.2. The advantage of lazy sequence evaluation

The primary advantage of using generator expressions to process sequences comes from the fact that chaining together generators preserves the lazy evaluation. For example, suppose that we want to compute the distance to some point \((a, b)\) for a series of points. One way to approach this problem is to attack it in a series of small operations.

In [26]: a, b = (3, 5)

In [27]: points = zip(range(2),range(2))

In [28]: dist_along_axes = ((i - a, j - b) for i, j in points)

In [29]: sum_sqr_dist = ( d1**2 + d2**2 for d1, d2 in dist_along_axes)

In [30]: dist_to_ab = ( sqr_dist**(0.5) for sqr_dist in sum_sqr_dist)

In [31]: next(dist_to_ab)
Out[31]: 5.830951894845301

In [32]: next(dist_to_ab)
Out[32]: 4.47213595499958

In [33]: next(dist_to_ab)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-33-4bcfb66d83b1> in <module>()
----> 1 next(dist_to_ab)

StopIteration: 

Under casual inspection, it will appear that we have iterated over the points 4 times, one for each comprehension. The key feature to note is that each call to next gives the complete solution for that point without forcing the computation for the next point. This is true regardless of how many generators we chain together! In reality each point is completely processed at once, as each sequence is lazy and waits to complete each calculation until the last moment. Consequently, the chain of generators will each compute the next value and pass it along in sequence. This is a very powerful and memory efficient approach to composing operations on sequences, because we only need to hold one point in memory at a time.

Contrast this with a solution that uses list comprehensions. Each comprehension is eager to complete its operation and this will result in the entire sequence being processed and held in memory before we can start computation on the next sequence. Furthermore, we will need to keep the most recently processed list in memory until the next step in the chain is complete.

In [34]: a, b = (3, 5)

In [35]: points = [(i, j) for i, j  in [(0, 0), (1, 1), (2, 2)]]

In [36]: points
Out[36]: [(0, 0), (1, 1), (2, 2)]

In [37]: dist_along_axes = [(i - a, j - b) for i, j in points]

In [38]: dist_along_axes
Out[38]: [(-3, -5), (-2, -4), (-1, -3)]

In [39]: sum_sqr_dist = [d1**2 + d2**2 for d1, d2 in dist_along_axes]

In [40]: sum_sqr_dist
Out[40]: [34, 20, 10]

In [41]: dist_to_ab = [sqr_dist**(0.5) for sqr_dist in sum_sqr_dist]

In [42]: dist_to_ab
Out[42]: [5.830951894845301, 4.47213595499958, 3.1622776601683795]

In contrast to the solution that used generators, the last batch of code illustrates that each comprehension has completed it’s computation before the next operation. Consequently each sequence must be held in memory simultaneously. If the data is very large, this can lead to problems with memory usage and disk access. In fact, “Big Data” problems typically involve more data than can be held in the memory of any one machine. Using lazy evaluation gives our first solution to this problem by allowing us to process the data incrementally without ever needing all of it in memory at one time.

7.7.3. More Complicated Generator Expressions

All of the tricks that you learned about list comprehensions will work with generator expressions as well, be it nesting list comprehensions or including multiple sequence expression.

In [43]: g = (i*j for i in range(5) for j in range(i,5))

In [44]: g
Out[44]: <generator object <genexpr> at 0x10fb75728>

In [45]: list(g)
Out[45]: [0, 0, 0, 0, 0, 1, 2, 3, 4, 4, 6, 8, 9, 12, 16]

The only caveat is that nested generators are very lazy and can be difficult to force to completion. In this next example, this is accomplished by mapping the list conversion function unto each embedded lazy sequence.

In [46]: g = ((i*j for j in range(5)) for i in range(5))

In [47]: table = list(g)

In [48]: table
Out[48]: 
[<generator object <genexpr>.<genexpr> at 0x10f9edd58>,
 <generator object <genexpr>.<genexpr> at 0x10f9f41a8>,
 <generator object <genexpr>.<genexpr> at 0x10f9f4570>,
 <generator object <genexpr>.<genexpr> at 0x10f9f4620>,
 <generator object <genexpr>.<genexpr> at 0x10f9f4f68>]

In [49]: mapped_table = list(map(list, table))

In [50]: mapped_table
Out[50]: 
[[0, 4, 8, 12, 16],
 [0, 4, 8, 12, 16],
 [0, 4, 8, 12, 16],
 [0, 4, 8, 12, 16],
 [0, 4, 8, 12, 16]]

In [51]: L = [[i*j for j in range(5)] for i in range(5)]

In [52]: L
Out[52]: 
[[0, 0, 0, 0, 0],
 [0, 1, 2, 3, 4],
 [0, 2, 4, 6, 8],
 [0, 3, 6, 9, 12],
 [0, 4, 8, 12, 16]]

That answer is wrong! (or at least unexpected) In the last attempt, I messed up by completing the outer generator without completing the inner loops. Therefore the value of i was stuck at the last generated value. The correct way to resolve this nested structure is to process all the sequences simultaneously as follows.

In [53]: g = ((i*j for j in range(5)) for i in range(5))

In [54]: mapped_table = list(map(list, g))

In [55]: mapped_table
Out[55]: 
[[0, 0, 0, 0, 0],
 [0, 1, 2, 3, 4],
 [0, 2, 4, 6, 8],
 [0, 3, 6, 9, 12],
 [0, 4, 8, 12, 16]]

Main Takeaway: Processing nested, single-use lazy structures can be tricky and return unexpected results depending on the order of execution.

7.7.4. Generator Functions

A more robust method of creating generators is using the yield and/or yield from statements in a function definition.

7.7.4.1. The yield statement

The yield statement is used to yield the next value in the lazy sequence and tells Python that our function is really a generator.

In [56]: def f(x):
   ....:     """ a function """
   ....:     return x**2
   ....: 

In [57]: def g(x):
   ....:     """ A generator """
   ....:     yield x**2
   ....: 

When we call a regular function like f, the value is result is immediately returned.

In [58]: type(f)
Out[58]: function

In [59]: y = f(2)

In [60]: y
Out[60]: 4

On the other hand, a call to a generator will yield a lazy generator object that requires a call to next before it will return a value.

In [61]: type(g)
Out[61]: function

In [62]: y = g(2)

In [63]: y
Out[63]: <generator object g at 0x10fc39a40>

In [64]: next(y)
Out[64]: 4

Generators can include any number of yield statements and will halt execution at each statement, waiting for a request for the next value.

In [65]: def many_yields(x):
   ....:     """ a generator with multiple yield statements """
   ....:     y = x**2
   ....:     yield y
   ....:     z = y + 2
   ....:     yield z
   ....:     yield "Awesome"
   ....: 
In [66]: g = many_yields(2)

In [67]: next(g)
Out[67]: 4

In [68]: next(g)
Out[68]: 6

In [69]: next(g)
Out[69]: 'Awesome'

In [70]: next(g)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-70-5f315c5de15b> in <module>()
----> 1 next(g)

StopIteration: 

Notice that the generator captures and maintains the start of the body in between calls to next. For example, many_yields kept the value of y so that it could be later used to compute z. As with generator expressions, any next calls beyond the last yield will throw an exception.

Finally, note that we can use comprehensions (list or generator) to process a generator. In this case, Python takes care of catching the StopIteration exception.

In [71]: [x for x in many_yields(2)]
Out[71]: [4, 6, 'Awesome']

7.7.4.2. The yield from statement

Sometimes we want to yield values from some other sequence. In this case, we can use the yield from statement, which automates the process of iterating through the sequence, yielding one item at a time in a lazy fashion.

In [72]: def g(L):
   ....:     yield from L
   ....: 
In [73]: y = g([1,2,3])

In [74]: y
Out[74]: <generator object g at 0x10f9c3a98>

In [75]: type(y)
Out[75]: generator

In [76]: next(y)
Out[76]: 1

In [77]: next(y)
Out[77]: 2

In [78]: next(y)
Out[78]: 3

You can combine any combination of yield and yield from statements.

In [79]: def g(L):
   ....:     yield "Wait for it"
   ....:     yield "Here it comes..."
   ....:     yield "...any second now..."
   ....:     yield from L
   ....:     yield "So there you go"
   ....:     yield "Enjoy!"
   ....: 
In [80]: [x for x in g([1,2,3])]
Out[80]: 
['Wait for it',
 'Here it comes...',
 '...any second now...',
 1,
 2,
 3,
 'So there you go',
 'Enjoy!']

Note

Generator expressions also can used a return statement, but this immediately throws a StopIteration exception. The main purpose of this statement in a generator is to control when a generated sequence halts.

7.7.5. Coroutines

Generators are a form of one-way lazy evaluation. Once created they only return items and can no longer receive values. Python has another lazy structure that can be used to pass values back and forth in a lazy way: a coroutine.

The keyword yield is again used to create a coroutine, but this time using it in an assignment-like statement. This assignment statement signifies both the yielding of the next value and acquisition of the next value.

In [81]: def coroutine(s):
   ....:     s = s.lower()
   ....:     s = yield s
   ....:     s = s.lower()
   ....:     s = yield s
   ....:     s = 3*s.upper()
   ....:     s = yield s
   ....: 

As with generators, values are pulled from a coroutine using next. A coroutine comes with a method send that is used to pass it the next value.

In [82]: co = coroutine("Hi ")

In [83]: next(co)
Out[83]: 'hi '

In [84]: co.send("there ")
Out[84]: 'there '

In [85]: co.send("Bob!")
Out[85]: 'BOB!BOB!BOB!'
Next Section - 7.8. Working with Large Files