5.2. Sets¶

The set is a useful associative Python data structure for collecting unique values and checking membership. If you find yourself building a data structure that will only be used to check if items are in the group, the set is the correct data structure for the task. Sets are based on mathematical sets and come with methods for unions, intersections, etc. Consequently, they are also useful for comparing the membership of two groups.

5.2.1. Using a `set` to count unique values¶

Sets can be useful for determining the number of unique values in a collection. Suppose that we are interested in determining how many unique words there are in the Zen of Python

In [1]: zen_no_punc = '''
   ...: The Zen of Python by Tim Peters
   ...: Beautiful is better than ugly
   ...: Explicit is better than implicit
   ...: Simple is better than complex
   ...: Complex is better than complicated
   ...: Flat is better than nested
   ...: Sparse is better than dense
   ...: Readability counts
   ...: Special cases arent special enough to break the rules
   ...: Although practicality beats purity
   ...: Errors should never pass silently
   ...: Unless explicitly silenced
   ...: In the face of ambiguity refuse the temptation to guess
   ...: There should be one and preferably only one obvious way to do it
   ...: Although that way may not be obvious at first unless youre Dutch
   ...: Now is better than never
   ...: Although never is often better than right now
   ...: If the implementation is hard to explain its a bad idea
   ...: If the implementation is easy to explain it may be a good idea
   ...: Namespaces are one honking great idea  lets do more of those'''
   ...: 

The easiest way to construct a set is using the set conversion function on a list of values.

In [2]: words_zen = zen_no_punc.lower().split()

In [3]: unique_words_zen = set(words_zen)

In [4]: unique_words_zen
Out[4]: 
{'a',
 'although',
 'ambiguity',
 'and',
 'are',
 'arent',
 'at',
 'bad',
 'be',
 'beats',
 'beautiful',
 'better',
 'break',
 'by',
 'cases',
 'complex',
 'complicated',
 'counts',
 'dense',
 'do',
 'dutch',
 'easy',
 'enough',
 'errors',
 'explain',
 'explicit',
 'explicitly',
 'face',
 'first',
 'flat',
 'good',
 'great',
 'guess',
 'hard',
 'honking',
 'idea',
 'if',
 'implementation',
 'implicit',
 'in',
 'is',
 'it',
 'its',
 'lets',
 'may',
 'more',
 'namespaces',
 'nested',
 'never',
 'not',
 'now',
 'obvious',
 'of',
 'often',
 'one',
 'only',
 'pass',
 'peters',
 'practicality',
 'preferably',
 'purity',
 'python',
 'readability',
 'refuse',
 'right',
 'rules',
 'should',
 'silenced',
 'silently',
 'simple',
 'sparse',
 'special',
 'temptation',
 'than',
 'that',
 'the',
 'there',
 'those',
 'tim',
 'to',
 'ugly',
 'unless',
 'way',
 'youre',
 'zen'}

In [5]: len(unique_words_zen)
Out[5]: 85

We can now use in to determine if a given word is present in the set.

In [6]: 'the' in unique_words_zen
Out[6]: True

In [7]: 'mississippi' in unique_words_zen
Out[7]: False

While this looks no different that performing the same operations on a list, sets have a much more efficient item look-up (as discussed in a later section).

Caution

Similar to the requirement for dictionary keys, entries in a set must be immutable, hashable items. If you need to make a set of lists or dictionaries, you will need to convert them to the immutable alternative from pyrsistent.

5.2.2. Comparing Membership with Sets¶

As sets are modeled after sets from mathematics, they are also useful in comparing the membership of two groups. To illustrate this process, we will create a set containing the unique words from the Gettysburg Address.

In [8]: gettysburg_address = """Fourscore and seven years ago our fathers
   ...: brought forth, on this continent, a new nation, conceived in liberty,
   ...: and dedicated to the proposition that all men are created equal. Now we
   ...: are engaged in a great civil war, testing whether that nation, or any
   ...: nation so conceived, and so dedicated, can long endure. We are met on a
   ...: great battle-field of that war. We have come to dedicate a portion of
   ...: that field, as a final resting-place for those who here gave their
   ...: lives, that that nation might live. It is altogether fitting and proper
   ...: that we should do this. But, in a larger sense, we cannot dedicate, we
   ...: cannot consecrate—we cannot hallow—this ground. The brave men, living
   ...: and dead, who struggled here, have consecrated it far above our poor
   ...: power to add or detract. The world will little note, nor long remember
   ...: what we say here, but it can never forget what they did here. It is for
   ...: us the living, rather, to be dedicated here to the unfinished work which
   ...: they who fought here have thus far so nobly advanced.  It is rather for
   ...: us to be here dedicated to the great task remaining before us—that from
   ...: these honored dead we take increased devotion to that cause for which
   ...: they here gave the last full measure of devotion—that we here highly
   ...: resolve that these dead shall not have died in vain—that this nation,
   ...: under God, shall have a new birth of freedom, and that government of the
   ...: people, by the people, for the people, shall not perish from the
   ...: earth."""
   ...: 

In [9]: words_gettysburg = gettysburg_address.lower().split()

In [10]: unique_words_getty = set(words_gettysburg)

Sets come with a number of useful methods that will be familiar to anyone that has worked with sets in mathematics.

In [11]: [attr for attr in dir(unique_words_zen) if not attr.startswith('__')]
Out[11]: 
['add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']

The intersection method of a set can be used to construct another set that contains all elements the two sets have in common. The union method can be used to create the set of words that are in one or both original sets.

In [12]: in_common = unique_words_zen.intersection(unique_words_getty)

In [13]: in_common
Out[13]: 
{'a',
 'and',
 'are',
 'be',
 'by',
 'do',
 'great',
 'in',
 'is',
 'it',
 'never',
 'not',
 'now',
 'of',
 'should',
 'that',
 'the',
 'those',
 'to'}

In [14]: len(in_common)
Out[14]: 19

In [15]: union = unique_words_zen.union(unique_words_getty)

In [16]: len(union)
Out[16]: 218

Notice that each method is called on one set using the other set as an argument for the method. This is the general pattern for using the set methods to compare sets

We can check if one group is contained in another using issubset and issuperset.

In [17]: in_common.issubset(unique_words_zen)
Out[17]: True

In [18]: union.issuperset(in_common)
Out[18]: True

Suppose that you want to get the words in one text but not the other. This can be accomplished using the difference method.

In [19]: zen_only = unique_words_zen.difference(unique_words_getty)

In [20]: len(unique_words_zen) - len(zen_only)
Out[20]: 19

Similarly, the method `` symmetric_difference`` is used to find all words that are in one set but not both. (This is equivalent to union.difference(intersection))

In [21]: sym_diff = unique_words_zen.symmetric_difference(unique_words_getty)

In [22]: len(sym_diff)
Out[22]: 199

Note

Sets also have methods for mutation, such as add, pop, remove and difference_update that are not covered here.

Check your understanding

Q-14: What value will the following code return?

fear = "the only thing we have to fear is fear itself"
words = fear.split()
len(set([w for w in words])

(A) 9
The word "fear" will only be counted once.
(B) 10
The word "fear" will only be counted once. A value is either in a set or not in a set, there is no repetition of values.

Q-15: What value will the following code return?

fear = "the only thing we have to fear is fear itself"
fear_words = set([w for w in fear.split()])
success = "success consists of going from failure to failure without loss of enthusiasm"
success_words = set([w for w in success.split()])
len(success_words.intersection(fear_words))

(A) 22
The intersection would represent the words that are in common. This answer gives total number of words including repeats ('fear', 'of', 'failure').
(B) 19
The intersection would represent the words that are in common. This answer gives total number of unique words in the union of the sets.
(C) 1
These two quotes (by FDR and Winsten Churchhill, respectively) have 1 word in common ('to').

Q-16: What value will the following code return?

fear = "the only thing we have to fear is fear itself"
fear_words = set([w for w in fear.split()])
success = "success consists of going from failure to failure without loss of enthusiasm"
success_words = set([w for w in success.split()])
len(success_words.union(fear_words))

(A) 22
The union would represent the words unique words in the combined strings. This answer gives total number of words including repeats ('fear', 'of', 'failure').
(B) 19
This answer gives total number of unique words in the union of the sets.
(C) 0
The union would represent the words unique words in the combined strings. This answer given the size of the intersection, as these two quotes (by FDR and Winsten Churchhill, respectively) have no words in common.

5.2.3. Set Operations¶

Python sets can also be compared with a number of operations that are equivalent to the methods shown above.

# issubset with <=
In [23]: in_common <= unique_words_zen
Out[23]: True

# issuperset with >=
In [24]: union >= in_common
Out[24]: True

# union with |
In [25]: len(unique_words_zen | unique_words_getty)
Out[25]: 218

# intersection with &
In [26]: len(unique_words_zen & unique_words_getty)
Out[26]: 19

# difference with -
In [27]: zen_only = unique_words_zen - unique_words_getty

5.2.4. Persistent Sets¶

As with other Python data structures, the pyrsistent module provides an immutable and persistent implementation of a set. This can be used to build up a set without mutation with efficient operations that return new sets. In particular, we add and remove items from sets.

In [28]: from pyrsistent import s

In [29]: s1 = s(1,2,3,3,4,4)

In [30]: s1
Out[30]: pset([1, 2, 3, 4])

In [31]: s2 = s1.add(0)

In [32]: s1
Out[32]: pset([1, 2, 3, 4])

In [33]: s2
Out[33]: pset([0, 1, 2, 3, 4])

In [34]: s1 is s2
Out[34]: False

In [35]: s3 = s2.remove(4)

In [36]: s2
Out[36]: pset([0, 1, 2, 3, 4])

In [37]: s3
Out[37]: pset([0, 1, 2, 3])

In [38]: s2 is s3
Out[38]: False

These persistent sets come with all of the expected methods,

In [39]: pset_methods = set([attr for attr in dir(s1) if not attr.startswith('__')])

In [40]: python_set_methods = set([attr for attr in dir(set([])) if not attr.startswith('__')])

# The intersection of python set methods and pset methods
In [41]: pset_methods & python_set_methods
Out[41]: 
{'add',
 'copy',
 'difference',
 'discard',
 'intersection',
 'isdisjoint',
 'issubset',
 'issuperset',
 'remove',
 'symmetric_difference',
 'union',
 'update'}

and all of the operations that work on python sets also work on Psets.

In [42]: s2 - s1
Out[42]: pset([0])

In [43]: s3 & s1
Out[43]: pset([1, 2, 3])

In [44]: s1 <= s2
Out[44]: True

Next Section - 5.3. Dictionary and Set Comprehensions

5.2. Sets¶

5.2.1. Using a set to count unique values¶

5.2.2. Comparing Membership with Sets¶

5.2.3. Set Operations¶

5.2.4. Persistent Sets¶

5.2.1. Using a `set` to count unique values¶