Python dictionary
past, present, future
Dmitry Alimov
Senior Software Engineer
Zodiac Interactive
SPb Python Interest Group
Dictionary in Python
>>> d = {} # the same as d = dict()
>>> d['a'] = 123
>>> d['b'] = 345
>>> d['c'] = 678
>>> d
{'a': 123, 'c': 678, 'b': 345}
>>> d['b']
>>> del d['c']
>>> d
{'a': 123, 'b': 345}
Dictionary keys must be hashable
An object is hashable if it has a hash value which never changes during its lifetime
>>> d[list()] = 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> d[set()] = 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
>>> d[dict()] = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
All of Python’s immutable built-in objects are hashable
import random
class A(object):
def __init__(self, index):
self.index = index
def __eq__(self, other):
return True
def __hash__(self):
return random.randint(0, 3)
def __repr__(self):
return 'A%d' % self.index
d = {A(0): 0, A(1): 1, A(2): 2}
print('keys: %s' % d.keys())
print('values: %s' % d.values())
for k in d:
print('%s = %s' % (k, d.get(k, 'not found')))
Random hash is a bad idea
Run 1
keys: [A1, A2, A0]
values: [1, 2, 0]
A1 = 1
A2 = not found
A0 = 0
Run 2
keys: [A1, A0]
values: [2, 0]
A1 = not found
A0 = not found
Three kinds of slots in the table:
1) Unused
2) Active
3) Dummy
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
- Hash table
- Open addressing collision resolution strategy
- Initial size = 8
- Load factor = 2/3
- Growth rate = 2 or 4 (depending on the number of cells used)
- “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt”
Dictionary in CPython >2.1
ma_fill – is the number of non-NULL keys (sum of Active and Dummy)
ma_used – number of Active items
ma_mask – mask == PyDict_MINSIZE - 1
ma_lookup – lookup function (lookdict_string by default)
#define PyDict_MINSIZE 8
typedef struct _dictobject PyDictObject;
struct _dictobject {
Py_ssize_t ma_fill;
Py_ssize_t ma_used;
Py_ssize_t ma_mask;
PyDictEntry *ma_table;
PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key,
long hash);
PyDictEntry ma_smalltable[PyDict_MINSIZE];
Good hash functions are needed
>>> map(hash, [0, 1, 2, 3, 4])
[0, 1, 2, 3, 4]
>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce'])
[1540938117, 1540938118, 1540938119, 1540938112, 1540938113]
Modified FNV (Fowler–Noll–Vo) hash function for strings
“-R” option – turns on hash randomization, so that the __hash__() values of str,
bytes and datetime objects are “salted” with an unpredictable random value
>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce'])
[-218138032, -218138029, -218138030, -218138027, -218138028]
Hash functions
Collision resolution
Collision is a situation that occurs when two distinct pieces of data have the
same hash value.
Probing is a scheme in computer programming for resolving collisions in hash
tables for maintaining a collection of key–value pairs and looking up the value
associated with a given key.
In CPython a pseudo-random probing is used
perturb = hash(key)
while True:
j = (5 * j) + 1 + perturb
perturb >>= PERTURB_SHIFT
index = j % 2**i
See “/Objects/dictobject.c”
In CPython <2.2 used a polynomial-based index computing
>>> PyDict_MINSIZE = 8
>>> key = 123
>>> hash(key) % PyDict_MINSIZE
>>> 3
Index computing
>>> mask = PyDict_MINSIZE - 1
>>> hash(key) & mask
>>> 3
Instead of the modulo operation use logical "AND" and the mask
Get least significant bits of the hash:
2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough
hash(123) = 123 = 0b1111011
mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111
index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3
mask = PyDict_MINSIZE - 1
index = hash(123) & mask
mask = PyDict_MINSIZE - 1
index = hash(123) & mask
Dictionary in CPython >2.1
Dictionary initialization
Add an item
PyDict_New() ma_used = 0
ma_fill = 0
ma_mask = PyDict_MINSIZE – 1
ma_table = ma_smalltable
ma_lookup = lookdict_string
ma_used += 1
ma_fill += 1
dictresize() if ma_fill >= 2/3 * size
Delete an item
PyDict_DelItem() ma_used -= 1
Add item
Add item
Add item
Add item
Add item
perturb = -1297030748
# i = (i * 5) + 1 + perturb
i = (4 * 5) + 1 + (-1297030748) = -1297030727
index = -1297030727 & 7 = 1
hash('!!!') = -1297030748
i = -1297030748 & 7 = 4
# perturb = perturb >> PERTURB_SHIFT
perturb = -1297030748 >> 5 = -40532211
# i = (i * 5) + 1 + perturb
i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845
index = -6525685845 & 7 = 3
>>> d
{'python': 2, 'article': 4, '!!!': 5, 'dict': 3, 'a key': 1}
>>> d.__sizeof__()
Add item
Hash table resize
>>> d
{'!!!': 5, 'python': 2, 'dict': 3, 'a key': 1, 'article': 4, ';)': 6}
>>> d.__sizeof__()
Hash table resize
/* Find the smallest table size > minused. */
for (newsize = 8;
newsize <= minused && newsize > 0;
newsize <<= 1)
dictresize(PyDictObject *mp, Py_ssize_t minused) {
PyDict_SetItem(...) {
dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
In the example:
ma_fill = 6 > (8 * 2 / 3)
ma_used = 6
Hence minused = 4 * 6 = 24, therefore newsize = 32
Addition order
>>> d1 = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}
>>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1}
>>> d1 == d2
>>> d1.keys()
['four', 'three', 'five', 'two', 'one']
>>> d2.keys()
['four', 'one', 'five', 'three', 'two']
The order of items added to the dictionary depends on the items already in it
>>> 7.0 == 7 == (7+0j)
>>> d = {}
>>> d[7.0] = 'float'
>>> d
{7.0: 'float'}
>>> d[7] = 'int'
>>> d
{7.0: 'int'}
>>> d[7+0j] = 'complex'
>>> d
{7.0: 'complex'}
>>> type(d.keys()[0])
<type 'float'>
int, float, complex
>>> hash(7)
>>> hash(7.0)
>>> hash(7+0j)
>>> d = {'a': 1}
>>> for i in d:
... d['new item'] = 123
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
Adding item during iteration
Delete item
dummy = PyString_FromString("<dummy key>"));
Interesting case
Interesting case
ma_fill = 6 > (8 * 2 / 3) dictresize()
Interesting case
ma_fill = 6 > (8 * 2 / 3)
ma_used = 1
hence minused = 4 * 1 = 4, therefore newsize = 8
PyDictEntry ma_smalltable[8];
On x86 with 64 bytes per cache line:
64 / (4 * 3) = 5.333 entries
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
Cache locality and collisions
See “/Objects/dictnotes.txt”
Source Access time
L1 Cache 1 ns
L2 Cache 4 ns
RAM 100 ns
Open addressing vs separate chaining
Although here is the linear probing rather than pseudo-random as in CPython
from collections import OrderedDict
- Internal dict
- Circular doubly linked list
- “/Lib/collections/”
Dictionary in CPython 3.5
- PEP 412 - Key-Sharing Dictionary
- The DictObject can be in one of two forms: combined table or split table
- Initial size = 4 (split table) or 8 (combined table)
- Maximum dictionary load = (2*n+1)/3
- Growth rate = used*2 + capacity/2
- “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”,
typedef struct {
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;
struct _dictkeysobject {
Py_ssize_t dk_refcnt;
Py_ssize_t dk_size;
dict_lookup_func dk_lookup;
Py_ssize_t dk_usable;
PyDictKeyEntry dk_entries[1];
typedef struct {
Py_ssize_t ma_used;
PyDictKeysObject *ma_keys;
PyObject **ma_values;
} PyDictObject;
Combined table vs split table
Combined table
- For explicit dictionaries (dict() and {})
- ma_values = NULL, dk_refcnt = 1
- Never becomes a split-table dictionary
Split table
- For attribute dictionaries (the__dict__ attribute of an object)
- ma_values != NULL, dk_refcnt >= 1
- Only string (unicode) keys are allowed
- Values are stored in the ma_values array
- When resizing a split dictionary it is converted to a combined table (but if
resizing is as a result of storing an instance attribute, and there is only
instance of a class, then the dictionary will be re-split immediately)
- Lookup function = lookdict_split
Dictionary in CPython 3.5
A new kind of slot:
1) Unused
2) Active
3) Dummy
4) Pending (me_key != NULL, me_key != dummy and me_value == NULL)
typedef struct {
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;
Split table
Initial size = 4
Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3,
i.e. initially ma_keys->dk_usable = 3
Split table
class A():
def __init__(self):
self.a = 1
self.b = 2
self.c = 3
a = A()
print(a.__dict__.__sizeof__()) # 72
setattr(a, 'd', 4) # re-split
print(a.__dict__.__sizeof__()) # 168
print({}.__sizeof__()) # 264
Initial size = 4
Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3
Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8,
therefore newsize = 16 (see dictresize)
class A():
def __init__(self):
self.a = 1
self.b = 2
self.c = 3
a = A()
print(a.__dict__.__sizeof__()) # 72
b = A()
setattr(a, 'd', 4) # no re-split because of b
print(a.__dict__.__sizeof__()) # 456
Split table
Split table is converted to a combined table
Key differences between this implementation and CPython 2.x:
- The table can be split into two parts – the keys and the values
- A new kind of slot
- No more ma_smalltable embedded in the dict
- General dictionaries are slightly larger
- All object dictionaries of a single class can share a single key-table, saving
about 60% memory for such cases (accordint to
Bugs still happens: Unbounded memory growth resizing split-table dicts
Hash functions in CPython 3.5
SipHash for strings and bytes (>= CPython 3.4)
- Resistant against hash flooding DoS attacks
- Successfully used in many other languages
Slightly modified hash function for float
PEP 456 – Secure and interchangeable hash algorithm
hash(float("+inf")) == 314159,
hash(float("-inf")) == -314159, was -271828
OrderedDict in CPython 3.5
- Doubly-linked-list
- od_fast_nodes hash table that mirrors the od_dict table
- “/Include/odictobject.h”, “/Objects/odictobject.c”
Alternative versions
Dictionary in PyPy
- Starting from PyPy 2.5.0 – ordereddict is used by default
- Initial size = 16
- Load factor up to 2/3
- Growth rate = 4 (up to 30000 items) or 2
- If a lot of items are deleted the compaction is performed
- “/rpython/rtyper/lltypesystem/”
struct dicttable {
int num_live_items;
int num_ever_used_items;
int resize_counter;
variable_int *indexes; // byte, short, int, long
dictentry *entries;
struct dictentry {
PyObject *key;
PyObject *value;
long hash;
bool valid;
Dictionary in PyPy
struct dicttable {
variable_int *indexes;
dictentry *entries;
FREE = 0
PyDictionary in Jython
- Based on ConcurrentHashMap
- Separate chaining collision resolution
- Initial size = 16, load factor = 0.75, growth rate = 2
- Segments and thread safety
PythonDictionary in IronPython
- Based on Dictionary (.NET)
- Separate chaining collision resolution
- Initial size = 0, load factor = 1.0
- Rehashing if the number of collisions >= 100
- Growth rate = 2 (the new size is equal to the next higher prime number) from a set of
primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}
Raymond Hettinger is happy
Dictionary in CPython 3.6
typedef struct {
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;
typedef struct {
Py_ssize_t ma_used; /* number of items in the dictionary */
uint64_t ma_version_tag; /* unique, changes when dict modified */
PyDictKeysObject *ma_keys;
PyObject **ma_values;
} PyDictObject;
- ma_version_tag is added (PEP 509 – Add a private version to dict)
- Initial size = 8 (for split table too)
- Maximum dictionary load = (2*n)/3
- Contributed by INADA Naoki in
Four kinds of slots in the table:
1) Unused (index == DKIX_EMPTY == -1)
2) Active (index >= 0 , me_key != NULL and me_value != NULL)
3) Dummy (index == DKIX_DUMMY == -2, only for combined table)
4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)
Dictionary in CPython 3.6
- Added dk_nentries and dk_indices
struct _dictkeysobject {
Py_ssize_t dk_refcnt;
Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */
dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */
Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */
Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */
union {
int8_t as_1[8];
int16_t as_2[4];
int32_t as_4[2];
int64_t as_8[1];
} dk_indices;
PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */
Dictionary in CPython 3.6
(Combined table)
Key differences between this implementation and CPython 3.5:
- Compact and ordered
- Added dk_indices with type, depending on the size of dictionary
- Added ma_version_tag (PEP 509)
- Initial size for split table is changed to 8
- Maximum dictionary load changed to (2*n)/3
- Deleting item cause converting the dict to the combined table
- Preserving the order of **kwargs in a function (PEP 468) is implemented
- Preserving Class Attribute Definition Order (PEP 520) is implemented
- The memory usage of the new dict() is between 20% and 25% smaller compared
to Python 3.5 (
1. The implementation of a dictionary in Python 2.7
2. Python hash calculation algorithms
3. PEP 412 - Key-Sharing Dictionary
4. PEP 456 - Secure and interchangeable hash algorithm
5. Mirror of the CPython repository
6. Faster, more memory efficient and more ordered dictionaries on PyPy
7. PyDictionary (Jython API documentation)
8. Jython repository
9. Java theory and practice: Building a better HashMap
10. Back to basics: Dictionary part 2, .NET implementation
15. PEP 509 - Add a private version to dict
16. Compact and ordered dict
17. What’s New In Python 3.6
18. PEP 468 - Preserving the order of **kwargs in a function
19. PEP 520 - Preserving Class Attribute Definition Order
Images from:
Q & A
SPb Python Interest Group
Additional slides
Separate chaining collision resolution
Open addressing collision resolution
(pseudo-random probing)

  2016
  • 3. >>> d = {} # the same as d = dict() >>> d['a'] = 123 >>> d['b'] = 345 >>> d['c'] = 678 >>> d {'a': 123, 'c': 678, 'b': 345} >>> d['b'] 345 >>> del d['c'] >>> d {'a': 123, 'b': 345}
  • 4. Dictionary keys must be hashable An object is hashable if it has a hash value which never changes during its lifetime >>> d[list()] = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'list' >>> d[set()] = 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'set' >>> d[dict()] = 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict' All of Python’s immutable built-in objects are hashable
  • 5. import random class A(object): def __init__(self, index): self.index = index def __eq__(self, other): return True def __hash__(self): return random.randint(0, 3) def __repr__(self): return 'A%d' % self.index d = {A(0): 0, A(1): 1, A(2): 2} print('keys: %s' % d.keys()) print('values: %s' % d.values()) for k in d: print('%s = %s' % (k, d.get(k, 'not found'))) Random hash is a bad idea Run 1 keys: [A1, A2, A0] values: [1, 2, 0] A1 = 1 A2 = not found A0 = 0 Run 2 keys: [A1, A0] values: [2, 0] A1 = not found A0 = not found
  • 7. Three kinds of slots in the table: 1) Unused 2) Active 3) Dummy typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; - Hash table - Open addressing collision resolution strategy - Initial size = 8 - Load factor = 2/3 - Growth rate = 2 or 4 (depending on the number of cells used) - “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” Dictionary in CPython >2.1
  • 8. ma_fill – is the number of non-NULL keys (sum of Active and Dummy) ma_used – number of Active items ma_mask – mask == PyDict_MINSIZE - 1 ma_lookup – lookup function (lookdict_string by default) #define PyDict_MINSIZE 8 typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };
  • 9. Good hash functions are needed >>> map(hash, [0, 1, 2, 3, 4]) [0, 1, 2, 3, 4] >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [1540938117, 1540938118, 1540938119, 1540938112, 1540938113] Modified FNV (Fowler–Noll–Vo) hash function for strings “-R” option – turns on hash randomization, so that the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value >>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce']) [-218138032, -218138029, -218138030, -218138027, -218138028] Hash functions
  • 10. Collision resolution Collision is a situation that occurs when two distinct pieces of data have the same hash value. Probing is a scheme in computer programming for resolving collisions in hash tables for maintaining a collection of key–value pairs and looking up the value associated with a given key. In CPython a pseudo-random probing is used PERTURB_SHIFT = 5 perturb = hash(key) while True: j = (5 * j) + 1 + perturb perturb >>= PERTURB_SHIFT index = j % 2**i See “/Objects/dictobject.c” In CPython <2.2 used a polynomial-based index computing
  • 11. >>> PyDict_MINSIZE = 8 >>> key = 123 >>> hash(key) % PyDict_MINSIZE >>> 3 Index computing >>> mask = PyDict_MINSIZE - 1 >>> hash(key) & mask >>> 3 Instead of the modulo operation use logical "AND" and the mask Get least significant bits of the hash: 2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough hash(123) = 123 = 0b1111011 mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111 index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3
  • 12. mask = PyDict_MINSIZE - 1 index = hash(123) & mask Integers
  • 13. Strings mask = PyDict_MINSIZE - 1 index = hash(123) & mask
  • 14. Dictionary in CPython >2.1 Dictionary initialization Add an item PyDict_SetItem() PyDict_New() ma_used = 0 ma_fill = 0 ma_mask = PyDict_MINSIZE – 1 ma_table = ma_smalltable ma_lookup = lookdict_string insertdict() ma_used += 1 ma_fill += 1 dictresize() if ma_fill >= 2/3 * size Delete an item PyDict_DelItem() ma_used -= 1
  • 20. perturb = -1297030748 # i = (i * 5) + 1 + perturb i = (4 * 5) + 1 + (-1297030748) = -1297030727 index = -1297030727 & 7 = 1 hash('!!!') = -1297030748 i = -1297030748 & 7 = 4 # perturb = perturb >> PERTURB_SHIFT perturb = -1297030748 >> 5 = -40532211 # i = (i * 5) + 1 + perturb i = (-1297030727 * 5) + 1 + (-40532211) = -6525685845 index = -6525685845 & 7 = 3
  • 21. >>> d {'python': 2, 'article': 4, '!!!': 5, 'dict': 3, 'a key': 1} >>> d.__sizeof__() 248 Add item
  • 22. Hash table resize >>> d {'!!!': 5, 'python': 2, 'dict': 3, 'a key': 1, 'article': 4, ';)': 6} >>> d.__sizeof__() 1016
  • 23. Hash table resize /* Find the smallest table size > minused. */ for (newsize = 8; newsize <= minused && newsize > 0; newsize <<= 1) ; ... } dictresize(PyDictObject *mp, Py_ssize_t minused) { ... PyDict_SetItem(...) { ... dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used); ... } In the example: ma_fill = 6 > (8 * 2 / 3) ma_used = 6 Hence minused = 4 * 6 = 24, therefore newsize = 32
  • 24. Addition order >>> d1 = {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5} >>> d2 = {'three': 3, 'two': 2, 'five': 5, 'four': 4, 'one': 1} >>> d1 == d2 True >>> d1.keys() ['four', 'three', 'five', 'two', 'one'] >>> d2.keys() ['four', 'one', 'five', 'three', 'two'] The order of items added to the dictionary depends on the items already in it
  • 25. >>> 7.0 == 7 == (7+0j) True >>> d = {} >>> d[7.0] = 'float' >>> d {7.0: 'float'} >>> d[7] = 'int' >>> d {7.0: 'int'} >>> d[7+0j] = 'complex' >>> d {7.0: 'complex'} >>> type(d.keys()[0]) <type 'float'> int, float, complex >>> hash(7) 7 >>> hash(7.0) 7 >>> hash(7+0j) 7
  • 26. >>> d = {'a': 1} >>> for i in d: ... d['new item'] = 123 ... Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dictionary changed size during iteration Adding item during iteration
  • 27. Delete item dummy = PyString_FromString("<dummy key>"));
  • 29. Interesting case ma_fill = 6 > (8 * 2 / 3) dictresize()
  • 30. Interesting case ma_fill = 6 > (8 * 2 / 3) ma_used = 1 hence minused = 4 * 1 = 4, therefore newsize = 8
  • 31. Cache PyDictEntry ma_smalltable[8]; On x86 with 64 bytes per cache line: 64 / (4 * 3) = 5.333 entries typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry; Cache locality and collisions See “/Objects/dictnotes.txt” Source Access time L1 Cache 1 ns L2 Cache 4 ns RAM 100 ns
  • 32. Open addressing vs separate chaining Although here is the linear probing rather than pseudo-random as in CPython
  • 33. OrderedDict from collections import OrderedDict - Internal dict - Circular doubly linked list - “/Lib/collections/”
  • 35. Dictionary in CPython 3.5 - PEP 412 - Key-Sharing Dictionary - The DictObject can be in one of two forms: combined table or split table - Initial size = 4 (split table) or 8 (combined table) - Maximum dictionary load = (2*n+1)/3 - Growth rate = used*2 + capacity/2 - “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt” typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; dict_lookup_func dk_lookup; Py_ssize_t dk_usable; PyDictKeyEntry dk_entries[1]; }; typedef struct { PyObject_HEAD Py_ssize_t ma_used; PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject;
  • 36. Combined table vs split table Combined table - For explicit dictionaries (dict() and {}) - ma_values = NULL, dk_refcnt = 1 - Never becomes a split-table dictionary Split table - For attribute dictionaries (the__dict__ attribute of an object) - ma_values != NULL, dk_refcnt >= 1 - Only string (unicode) keys are allowed - Values are stored in the ma_values array - When resizing a split dictionary it is converted to a combined table (but if resizing is as a result of storing an instance attribute, and there is only instance of a class, then the dictionary will be re-split immediately) - Lookup function = lookdict_split
  • 37. Dictionary in CPython 3.5 A new kind of slot: 1) Unused 2) Active 3) Dummy 4) Pending (me_key != NULL, me_key != dummy and me_value == NULL) typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry;
  • 38. Split table Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3, i.e. initially ma_keys->dk_usable = 3
  • 39. Split table class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 setattr(a, 'd', 4) # re-split print(a.__dict__.__sizeof__()) # 168 print({}.__sizeof__()) # 264 Initial size = 4 Maximum dictionary load = (2*n+1)/3 = (2*4+1)/3 = 3 Growth rate = used*2 + capacity/2 = 3*2 + 4/2 = 8, hence minused = 8, therefore newsize = 16 (see dictresize)
  • 40. class A(): def __init__(self): self.a = 1 self.b = 2 self.c = 3 a = A() print(a.__dict__.__sizeof__()) # 72 b = A() setattr(a, 'd', 4) # no re-split because of b print(a.__dict__.__sizeof__()) # 456 Split table Split table is converted to a combined table
  • 41. Key differences between this implementation and CPython 2.x: - The table can be split into two parts – the keys and the values - A new kind of slot - No more ma_smalltable embedded in the dict - General dictionaries are slightly larger - All object dictionaries of a single class can share a single key-table, saving about 60% memory for such cases (accordint to Bugs still happens: Unbounded memory growth resizing split-table dicts ( Summary
  • 42. Hash functions in CPython 3.5 SipHash for strings and bytes (>= CPython 3.4) - Resistant against hash flooding DoS attacks - Successfully used in many other languages Slightly modified hash function for float PEP 456 – Secure and interchangeable hash algorithm hash(float("+inf")) == 314159, hash(float("-inf")) == -314159, was -271828
  • 43. OrderedDict in CPython 3.5 - Doubly-linked-list - od_fast_nodes hash table that mirrors the od_dict table - “/Include/odictobject.h”, “/Objects/odictobject.c”
  • 45. Dictionary in PyPy - Starting from PyPy 2.5.0 – ordereddict is used by default - Initial size = 16 - Load factor up to 2/3 - Growth rate = 4 (up to 30000 items) or 2 - If a lot of items are deleted the compaction is performed - “/rpython/rtyper/lltypesystem/” struct dicttable { int num_live_items; int num_ever_used_items; int resize_counter; variable_int *indexes; // byte, short, int, long dictentry *entries; ... } struct dictentry { PyObject *key; PyObject *value; long hash; bool valid; }
  • 46. Dictionary in PyPy struct dicttable { variable_int *indexes; dictentry *entries; ... } FREE = 0 DELETED = 1 VALID_OFFSET = 2
  • 47. PyDictionary in Jython - Based on ConcurrentHashMap - Separate chaining collision resolution - Initial size = 16, load factor = 0.75, growth rate = 2 - Segments and thread safety
  • 48. PythonDictionary in IronPython - Based on Dictionary (.NET) - Separate chaining collision resolution - Initial size = 0, load factor = 1.0 - Rehashing if the number of collisions >= 100 - Growth rate = 2 (the new size is equal to the next higher prime number) from a set of primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}
  • 51. Dictionary in CPython 3.6 typedef struct { Py_hash_t me_hash; PyObject *me_key; PyObject *me_value; /* only meaningful for combined tables */ } PyDictKeyEntry; typedef struct { PyObject_HEAD Py_ssize_t ma_used; /* number of items in the dictionary */ uint64_t ma_version_tag; /* unique, changes when dict modified */ PyDictKeysObject *ma_keys; PyObject **ma_values; } PyDictObject; - ma_version_tag is added (PEP 509 – Add a private version to dict) - Initial size = 8 (for split table too) - Maximum dictionary load = (2*n)/3 - Contributed by INADA Naoki in Four kinds of slots in the table: 1) Unused (index == DKIX_EMPTY == -1) 2) Active (index >= 0 , me_key != NULL and me_value != NULL) 3) Dummy (index == DKIX_DUMMY == -2, only for combined table) 4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)
  • 52. Dictionary in CPython 3.6 - Added dk_nentries and dk_indices struct _dictkeysobject { Py_ssize_t dk_refcnt; Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */ dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */ Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */ Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */ union { int8_t as_1[8]; int16_t as_2[4]; int32_t as_4[2]; #if SIZEOF_VOID_P > 4 int64_t as_8[1]; #endif } dk_indices; PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */ };
  • 53. Dictionary in CPython 3.6 (Combined table)
  • 54. Key differences between this implementation and CPython 3.5: - Compact and ordered - Added dk_indices with type, depending on the size of dictionary - Added ma_version_tag (PEP 509) - Initial size for split table is changed to 8 - Maximum dictionary load changed to (2*n)/3 - Deleting item cause converting the dict to the combined table - Preserving the order of **kwargs in a function (PEP 468) is implemented - Preserving Class Attribute Definition Order (PEP 520) is implemented - The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5 ( changes) Summary
  • 55. References 1. The implementation of a dictionary in Python 2.7 2. Python hash calculation algorithms 3. PEP 412 - Key-Sharing Dictionary 4. PEP 456 - Secure and interchangeable hash algorithm 5. Mirror of the CPython repository 6. Faster, more memory efficient and more ordered dictionaries on PyPy more-memory-efficient-and-more.html 7. PyDictionary (Jython API documentation) 8. Jython repository 9. Java theory and practice: Building a better HashMap 10. Back to basics: Dictionary part 2, .NET implementation net-implementation/ 11. 12. 13. 14. 15. PEP 509 - Add a private version to dict 16. Compact and ordered dict 17. What’s New In Python 3.6 18. PEP 468 - Preserving the order of **kwargs in a function 19. PEP 520 - Preserving Class Attribute Definition Order 20. Images from:
  • 56. Q & A @delimitry SPb Python Interest Group
  • 58. Separate chaining collision resolution Open addressing collision resolution (pseudo-random probing)