Everyone knows Python's basic datatypes and their most common containers (list, tuple, dict and set).
However, few people know that they should use a deque to implement a queue, that using defaultdict their code would be cleaner and that they could be a bit more efficient using namedtuples instead of creating new classes.
This talk will review the data structures of Python's "collections" module of the standard library (namedtuple, deque, Counter, defaultdict and OrderedDict) and we will also compare them with the built-in basic datatypes.
2. Today we are going to talk about the
(unknown) collections module
And also about built-‐‑in containers
Welcome!
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
3. “This module implements
specialized container datatypes
providing alternatives to Python’s
general purpose built-‐‑in containers,
dict, list, set, and tuple”
The collections module
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
4. Let’s start with Python’s most used
container
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
10. OPERATION
AVERAGE
AMORTIZED WORST
Check item 'b' in s1
O(1)
O(n)
Union s1 | s2
O(len(s1) + len(s2))
Intersection s1 & s2
O(min(len(s1), len(s2)))
Difference s1 – s2
O(len(s1))
Symmetric diff s1 ^ s2
O(len(s1))
O(len(s1) * len(s2))
>
>
O(len(s1) * len(s2))
Implementation very similar to dicts (hash map)
Also has in-‐‑place modification methods (its average
cost depends on s2)
set performance
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
11. A bit boring, isn’t it?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
12. Let'ʹs do something more appealing
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
13. txt = """El desconocido módulo Collections
Todo el mundo conoce los tipos básicos de Python y sus
contenedores más comunes (list, tuple, dict y set). En cambio,
poca gente sabe que para implementar una cola debería utilizar
un deque, que con un defaultdict su código quedaría más limpio y
sería un poco más eficiente o que podría utilizar namedtuples en
lugar de crear nuevas clases. En esta charla repasaremos las
estructuras del módulo collections de la librería estándar:
namedtuple, deque, Counter, OrderedDict y defaultdict. Veremos
su funcionalidad, particularidades y casos prácticos de uso.
Pablo Enfedaque Vidal
Trabajo como R&D SW Engineer en Telefónica PDI en Barcelona, y
desde hace más de 5 años casi exclusivamente con Python, un
lenguaje que me encanta"""
During the talk we will use this str
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
14. >>> initials = {}
def classify_words(text):
for word in text.split():
word = word.lower()
if word[0] in initials:
initials[word[0]].append(word)
else:
initials[word[0]] = [word, ]
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y ['y', 'y', 'y', 'y', 'y', 'y']
s ['sus', 'set).', 'sabe', 'su', 'sería', 'su', 'sw']
r ['repasaremos', 'r&d']
q ['que', 'que', 'quedaría', 'que', 'que']
...
Let’s classify words
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
15. >>> initials = {}
def classify_words(text):
for word in text.split():
word = word.lower()
if word[0] in initials:
initials[word[0]].append(word)
else:
initials[word[0]] = [word, ]
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y ['y', 'y', 'y', 'y', 'y', 'y']
s ['sus', 'set).', 'sabe', 'su', 'sería', 'su', 'sw']
r ['repasaremos', 'r&d']
q ['que', 'que', 'quedaría', 'que', 'que']
...
Does it look pythonic?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
16. >>> initials = {}
def classify_words(text):
for word in text.split():
word = word.lower()
initials.setdefault(word[0], []).append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y ['y', 'y', 'y', 'y', 'y', 'y']
s ['sus', 'set).', 'sabe', 'su', 'sería', 'su', 'sw']
r ['repasaremos', 'r&d']
q ['que', 'que', 'quedaría', 'que', 'que']
...
What about now?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
17. from collections import defaultdict
>>> initials = defaultdict(list)
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y ['y', 'y', 'y', 'y', 'y', 'y']
s ['sus', 'set).', 'sabe', 'su', 'sería', 'su', 'sw']
r ['repasaremos', 'r&d']
q ['que', 'que', 'quedaría', 'que', 'que']
...
collections.defaultdict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
18. from collections import defaultdict
>>> initials = defaultdict(list)
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> initials.default_factory
<class 'list'>
>>> classify_words(txt)
y ['y', 'y', 'y', 'y', 'y', 'y']
s ['sus', 'set).', 'sabe', 'su', 'sería', 'su', 'sw']
r ['repasaremos', 'r&d']
q ['que', 'que', 'quedaría', 'que', 'que']
...
collections.defaultdict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
19. >
defaultdict is a subclass of the built-‐‑in dict class
>
The first argument provides the initial value for the
default_factory a2ribute (it defaults to None)
>
All remaining arguments are treated the same
>
It also overrides the __missing__ method to call the
default_factory when an key is not found
>
default_factory may raise an exception (e.g. KeyError)
>
Since Python 2.5
collections.defaultdict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
20. Let’s continue classifying words
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
21. class WordsByInitial():
"Holds initial letter and a set and a list of words"
def __init__(self, letter):
self.letter = letter
self.words = []
self.unique_words = set()
def append(self, word):
self.words.append(word)
self.unique_words.add(word)
def __str__(self):
return "<{}: {} {}>".format(self.letter,
self.unique_words,
self.words)
>>> a_words = WordsByInitial('a')
>>> a_words.append('ahora')
>>> a_words.append('adios')
>>> a_words.append('ahora')
>>> print(a_words)
<a: {'adios', 'ahora'} ['ahora', 'adios', 'ahora']>
Now we have this custom class
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
22. What if we want to use our class with
defaultdict?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
23. class WordsByInitial():
"Holds initial letter and set and list of words"
def __init__(self, letter):
self.letter = letter
self.words = []
self.unique_words = set()
def append(self, word):
self.words.append(word)
self.unique_words.add(word)
def __str__(self):
return "<{}: {} {}>".format(self.letter,
self.unique_words,
self.words)
>>> a_words = WordsByInitial('a')
>>> a_words.append('ahora')
>>> a_words.append('adios')
>>> a_words.append('ahora')
>>> print(a_words)
<a: {'adios', 'ahora'} ['ahora', 'adios', 'ahora']>
How do we get the le2er?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
24. What if we want the default_factory
to receive the missing key?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
25. class WordsDict(dict):
def __missing__(self, key):
res = self[key] = WordsByInitial(key)
return res
initials = WordsDict()
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y <y: {'y'} ['y', 'y', 'y', 'y', 'y', 'y']>
s <s: {'sería', 'sus', 'set).', 'sabe', 'sw', 'su'} ['sus’...
r <r: {'r&d', 'repasaremos'} ['repasaremos', 'r&d']>
q <q: {'quedaría', 'que'} ['que', 'que', 'quedaría', 'que’...
...
Time to code our custom dict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
26. class WordsDict(dict):
def __missing__(self, key):
res = self[key] = WordsByInitial(key)
return res
initials = WordsDict()
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y <y: {'y'} ['y', 'y', 'y', 'y', 'y', 'y']>
s <s: {'sería', 'sus', 'set).', 'sabe', 'sw', 'su'} ['sus’...
r <r: {'r&d', 'repasaremos'} ['repasaremos', 'r&d']>
q <q: {'quedaría', 'que'} ['que', 'que', 'quedaría', 'que’...
...
Subclass overriding __missing__
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
27. Let'ʹs move on to something different
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
28. from collections import defaultdict
def wordcount(s):
wc = defaultdict(int)
for word in s.split():
wc[word] += 1
return wc
>>> wc = wordcount(txt)
>>> for letter, num in wc.items():
print(letter, num)
del 1
implementar 1
exclusivamente 1
más 4
y 6
...
>>> sorted(wc.items(), reverse=True, key=lambda x: x[1])[:3]
[('y', 6), ('de', 5), ('más', 4)]
Let’s count words
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
29. from collections import Counter
def wordcount(s):
return Counter(s.split())
>>> wc = wordcount(txt)
>>> for letter, num in wc.items():
print(letter, num)
del 1
implementar 1
exclusivamente 1
más 4
y 6
...
>>> wc.most_common(3)
[('y', 6), ('de', 5), ('más', 4)]
collections.Counter
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
30. from collections import Counter
def wordcount(s):
return Counter(s.split())
>>> wc = wordcount(txt)
>>> for letter, num in wc.items():
print(letter, num)
del 1
implementar 1
exclusivamente 1
más 4
y 6
...
>>> wc.most_common(3)
[('y', 6), ('de', 5), ('más', 4)]
collections.Counter
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
33. >
Counter is a dict subclass for counting hashable objects
>
dict interface but they return 0 instead of KeyError
>
Three additional methods: most_common, elements,
subtract
>
update method has been overriden
>
Support for mathematical operators: +, -‐‑, &, |
>
Since Python 2.7
collections.Counter
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
34. Let'ʹs go back to words classification
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
35. from collections import defaultdict
>>> initials = defaultdict(list)
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y ['y', 'y', 'y', 'y', 'y', 'y']
s ['sus', 'set).', 'sabe', 'su', 'sería', 'su', 'sw']
r ['repasaremos', 'r&d']
q ['que', 'que', 'quedaría', 'que', 'que']
...
Classify words with defaultdict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
36. What if we only want to keep the
last three words for each le2er?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
37. from collections import defaultdict, deque
>>> initials = defaultdict(lambda: deque(maxlen=3))
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y deque(['y', 'y', 'y'], maxlen=3)
s deque(['sería', 'su', 'sw'], maxlen=3)
r deque(['repasaremos', 'r&d'], maxlen=3)
q deque(['quedaría', 'que', 'que'], maxlen=3)
...
collections.deque
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
38. from collections import defaultdict, deque
>>> initials = defaultdict(lambda: deque(maxlen=3))
def classify_words(text):
for word in text.split():
word = word.lower()
initials[word[0]].append(word)
for letter, letter_words in initials.items():
print(letter, letter_words)
>>> classify_words(txt)
y deque(['y', 'y', 'y'], maxlen=3)
s deque(['sería', 'su', 'sw'], maxlen=3)
r deque(['repasaremos', 'r&d'], maxlen=3)
q deque(['quedaría', 'que', 'que'], maxlen=3)
...
collections.deque
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
39. >>> d = deque(maxlen=5)
>>> d.extend(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
>>> d
deque(['d', 'e', 'f', 'g', 'h'], maxlen=5)
>>> d.append('i')
>>> d
deque(['e', 'f', 'g', 'h', 'i'], maxlen=5)
>>> d.appendleft('Z')
>>> d
deque(['Z', 'e', 'f', 'g', 'h'], maxlen=5)
>>> d.rotate(3)
>>> d
deque(['f', 'g', 'h', 'Z', 'e'], maxlen=5)
>>> d.popleft()
'f’
>>> d
deque(['g', 'h', 'Z', 'e'], maxlen=5)
More on collections.deque
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
41. OPERATION
AVERAGE
AMORTIZED WORST
append('b’)
O(1)*
O(1)*
insert(index, 'b’)
O(n)
O(n)
Get item d[4]
O(1)
O(1)
Set item d[4] = 'd'
O(1)
O(1)
Delete item del d[4]
O(n)
O(n)
extend(iterable)
O(k)*
O(k)*
Check item 'b' in list
O(n)
O(n)
O(n log n)
O(n log n)
Sort
>
Represented internally as an array
>
*: Amortized cost. Individual ops may be really slow
>
Ideal to implement stacks (LIFO)
list performance
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
42. Let’s move to a different example
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
43. CACHE = {}
def set_key(key, value):
"Set a key value"
CACHE[key] = value
def get_key(key):
"Retrieve a key value from the cache, or None if not found"
return CACHE.get(key, None)
>>> set_key("my_key", "the_value”)
>>> print(get_key("my_key"))
the_value
>>> print(get_key("not_found_key"))
None
Let’s implement a SW cache
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
44. What if we want to limit its size?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
45. from collections import OrderedDict
CACHE = OrderedDict()
MAX_SIZE = 3
def set_key(key, value):
"Set a key value, removing oldest key if MAX_SIZE exceeded"
CACHE[key] = value
if len(CACHE) > MAX_SIZE:
CACHE.popitem(last=False)
def get_key(key):
"Retrieve a key value from the cache, or None if not found"
return CACHE.get(key, None)
>>> set_key("my_key", "the_value”)
>>> print(get_key("my_key"))
the_value
>>> print(get_key("not_found_key"))
None
>>> CACHE
OrderedDict([('c', 3), ('d', 4), ('e', 5)])
collections.OrderedDict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
46. from collections import OrderedDict
CACHE = OrderedDict()
MAX_SIZE = 3
def set_key(key, value):
"Set a key value, removing oldest key if MAX_SIZE exceeded"
CACHE[key] = value
if len(CACHE) > MAX_SIZE:
CACHE.popitem(last=False)
def get_key(key):
"Retrieve a key value from the cache, or None if not found"
return CACHE.get(key, None)
>>> set_key("my_key", "the_value”)
>>> print(get_key("my_key"))
the_value
>>> print(get_key("not_found_key"))
None
>>> CACHE
OrderedDict([('c', 3), ('d', 4), ('e', 5)])
collections.OrderedDict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
48. >
OrderedDict is a subclass of the built-‐‑in dict class
>
Remembers the order that keys were first inserted
>
Updating a key does not modify its order
>
Two additional methods: popitem, move_to_end
>
Also supports reverse iteration using reversed
>
Since Python 2.7
collections.OrderedDict
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
49. And finally, one last example
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
50. class Color:
def __init__(self, r, g, b):
self.r = r
self.g = g
self.b = b
class Image:
def __init__(self, w, h, pixels):
self.w = w
self.h = h
self.pixels = pixels
def rotate(self):
pass
>>> pixels = [Color(127, 127, 127),
Color(127, 100, 100),
Color(127, 75, 75), ]
>>> picture = Image(1280, 720, pixels)
Let’s implement an image
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}
51. class Color:
def __init__(self, r, g, b):
self.r = r
self.g = g
self.b = b
class Image:
def __init__(self, w, h, pixels):
self.w = w
self.h = h
self.pixels = pixels
def rotate(self):
pass
>>> pixels = [Color(127, 127, 127),
Color(127, 100, 100),
Color(127, 75, 75), ]
>>> picture = Image(1280, 720, pixels)
Do we really need a class?
{ “event”: “PyCon ES 2013”, “author”: “Pablo Enfedaque”, “twi2er”: “pablitoev56”}