[PyCon KR 2019] Pickle & Custom Binary Serializer

PyCon Korea 2019
Pickle & Custom Binary Serializer
Young Seok Tony Kim
(김영석)

About the speaker
www.aitrics.com
contact@aitrics.com
Soyoung Yoon <lovelife@kaist.ac.kr>
● Software Engineer
● Machine Learning Research Engineer
Master’s degree in Data Science

• Understand the general idea how pickle works internally

• Know when is the best to use pickle, when to use other serialization methods

• Build a simple custom serializer
!4
Objective of this talk

!5
What is Pickle? What is serialization?
0110000101101100 …
Serialization De-serialization
Python object Python object

!6
0110000101101100 …
pickle.dump() 
pickle.dumps()
pickle.load() 
pickle.loads()

!7
0110000101101100 …
pickle.dump() 
pickle.dumps()
pickle.load() 
pickle.loads()
Pickle module

!8
Other serialization methods
JSON example from Wikipedia
XML example from W3schools
Protobuf example from Google

!9
Other serialization methods
Pickle JSON Protobuf MessagePack
Human-readable
No 
(except protocol 0)
Yes No No
Python-speciﬁc Yes No No No
User-deﬁned class Yes No No No

!10
Pickle API
pickle.dump(obj, file, protocol=None, *, fix_imports=True)
pickle.dumps(obj, protocol=None, *, fix_imports=True)
pickle.load(file, *, fix_imports=True, encoding=“ASCII”, errors=“strict”)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict")

!11
Pickle API
Python object
10110001
01110111
010111…
File
Writes

!12
Pickle API
Python object
Returns
b'x80x03Kn.'
Bytes

!13
Pickle API
Python object
10110001
01110111
010111…
File
Reads

!14
Pickle API
Python object
Returns
b'x80x03Kn.'
Bytes

!15
Pickle API

!16
Pickle API
pickle.dump(obj, file)
pickle.load(file)

!17
Pickle Protocols
Protocol #
#number
0 1 2 3 4 5
Introduced in - - 2.3 3.0 3.4 3.8
Relevant PEP - - PEP 307
What’s new in
Python 3.0
PEP 3154 PEP 574
What’s  
added
human-readable,
original protocol old binary format
provides  
much more
efﬁcient pickling
of new-style
classes.
It has explicit
support for bytes
objects
adds support for
very large
objects, pickling
more kinds of
objects, and
some data format
optimizations
supports 
out-of-band  
data
Note
was called 
“text mode”
was called  
“binary mode”
Cannot be
unpickled by
Python 2.x
Support for
Unicode

• Pickle protocol opcodes never changes. Only the new ones are introduced.

• This make sure old pickles continue to be readable forever

• If older unpickler tries to read a pickle generated by newer protocol, it will
either

• Work well, if the newer protocol does not use higher protocol opcode.

• Explicitly give you an error, by raising PicklingError exception
!18
Pickle Protocols

Opcode examples
Opcode Name Opcode (Byte)
INT I
LONG L
LONG1 x8a
BININT J
BININT1 K
STRING S
NONE N
NEWTRUE x88
NEWFALSE x89
… …

!21
LONG (0)
‘L’ Before and After the decimal value

!22
LONG (0)
HEX 4C 33 4C 0A 2E
ASCII L 3 L n .
INT 76 51 76 10 46

!23
LONG (0)
HEX 4C 33 4C 0A 2E
ASCII L 3 L n .
INT 76 51 76 10 46

!24
LONG (0)
HEX 4C 33 4C 0A 2E
ASCII L 3 L n .
INT 76 51 76 10 46

!25
LONG (0)
HEX 4C 33 4C 0A 2E
ASCII L 3 L n .
INT 76 51 76 10 46

!27
LONG (0)
HEX 4C 31 32 33 34 35 36 37 38 39 30 … 4C 0A 2E
ASCII L 1 2 3 4 5 6 7 8 9 0 … L n .
INT 76 49 50 51 52 53 54 55 56 57 48 … 76 10 46

Takes quadratic in the number of digits when unpickling
!28
LONG (0)
HEX 4C 31 32 33 34 35 36 37 38 39 30 … 4C 0A 2E
ASCII L 1 2 3 4 5 6 7 8 9 0 … L n .
INT 76 49 50 51 52 53 54 55 56 57 48 … 76 10 46

!29
FLOAT (0)
HEX 46 33 2E 31 34 0A 2E
ASCII F 3 . 1 4 n .
INT 70 51 46 49 52 10 46

!30
FLOAT (0)
HEX 46 33 2E 31 34 0A 2E
ASCII F 3 . 1 4 n .
INT 70 51 46 49 52 10 46

!33
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
ASCII ( L 0 L n L 1 L n L 2 L n t p 0 n .
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46

!34
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46
MARK
MARK
Stack

!35
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46
0
0
MARK
Stack

!36
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46
1
1
0
MARK
Stack

!37
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46
2 2
1
0
MARK
Stack

!38
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46
TUPLE
(0,1,2)
Stack

!39
TUPLE (0)
HEX 28 4C 30 4C 0A 4C 31 4C 0A 4C 32 4C 0A 74 70 30 0A 2E
INT 40 76 48 76 10 76 49 76 10 76 50 76 10 116 112 48 10 46
PUT
(0,1,2)
Stack
Index 0 1 2 3 4 5
Value (0,1,2)
Memo

!42
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
ASCII ( l p 0 n L 0 L n a L 1 L n a g 0 n a .
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46

!43
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value
Memo
MARK
Stack
MARK

!44
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value
Memo
list1
Stack
LIST
list1

!45
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
list1
Stack
PUT
list1

!46
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
0
list1
Stack
list1

!47
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
list1
Stack
APPEND
0list1

!48
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
1
list1
Stack
0list1

!49
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
list1
Stack
APPEND
0 1list1

!50
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
list1
list1
Stack
GET
0 1list1

!51
LIST (0)
HEX 28 6C 70 30 0A 4C 30 4C 0A 61 4C 31 4C 0A 61 67 30 0A 61 2E
INT 40 108 112 48 10 76 48 76 10 97 76 49 76 10 97 103 48 10 97 46
Index 0 1 2 3 4 5
Value list1
Memo
list1
Stack
APPEND
0 1 list1list1

!54
BININT2 (1)
HEX 4D 00 01 2E
ASCII M x00 x01 .
INT 76 49 50 51
BININT2

!58
PyCon
HEX
ASCII
INT
Picture from http://rahmonov.me/posts/python-decorators/

• Oﬃcial doc: https://docs.python.org/3/library/pickletools.html

• Usage
!59
pickletools

• New pickle protocol 5 can send extra metadata needed for out-of-band data
buﬀers.

• There is a new PickleBuffer to return out-of-band data buﬀers.

• Reduces unnecessary memory copies.
!67
PEP 574 -- Pickle protocol 5 with  
out-of-band data

!68
PEP 574 -- Pickle protocol 5 with  
out-of-band data
https://github.com/numpy/numpy/issues/11161

• I was implementing an ICLR 2016 paper -  
“Deep Compression: Compressing Deep Neural Networks with Pruning,
Trained Quantization and Huffman Coding”  
by Song Han, Huizi Mao, William J. Dally

• This paper was about compressing Neural Network.

• For example, compressing AlexNet from 233MB to 8.9MB  
without loss of accuracy

• The official repository did not contain Huffman Coding
!69
Custom Binary Serializer

!70
Your custom object
Custom serialization logic
bytearray object
1A FF 8B 9C

!71
Quantized
sparse 
numpy array
Your custom object
Custom serialization logic
bytearray object
1A FF 8B 9C

!72
Quantized
sparse 
numpy array
Your custom object
Custom
serialization
logic
bytearray object
1A FF 8B 9C
Binary string
to
“0110110101010101000…”

!73
bytearray object
1A FF 8B 9C
Binary string
to
“0110110101010101000…”

• struct module performs conversions between Python values and C structs
represented as Python bytes objects.
!74
struct module
Python object
bytes object
1A FF 8B 9C
struct.pack()
struct.unpack()

!75
struct module
https://docs.python.org/3/library/struct.html

• What is diﬀerence between cPickle and pickle?

• What can be pickled? What cannot pickled?

• When should we use pickle?
!76
FAQ

What is the diﬀerence between cPickle and pickle?
• _pickle is a pickle module implemented in C language.  
It is faster than Python implementation.

• _pickle was known as cPickle in Python 2.

• In Python 2, cPickle and pickle modules were separate.

• In Python 3, pickle module imports and uses _pickle, if available.  
Otherwise, it uses python implementation.

• TLDR; Use Python 3, you don’t need to worry about anything.

• None, True, and False

• integers, floating point numbers,
complex numbers

• strings, bytes, bytearrays

• tuples, lists, sets, and dictionaries
containing only picklable objects

• Classes, functions  
(both built-in and user-defined)
defined at the top level of a module
(using def, not lambda)

• Instances of such classes whose
__dict__ or the result of calling
__getstate__() is picklable (see
section Pickling Class Instances for
details).
!78
What can be pickled and unpickled?

!79
Can a function be pickled?
10110001
01110111
010111…
function.pickle
File

!80
10110001
01110111
010111…
function.pickle
File

!81
10110001
01110111
010111…
function.pickle
File

• Only the function’s name is pickled, along with the name of the module the
function is deﬁned in.

• Functions are pickled by name reference, not by value. This means you
cannot pickle lambda functions.

• Similarly, classes, methods, decorators (which is also a function) are pickled
by reference, not by the value
!82
Function content is not pickled

• Pickle is probably most useful when you want to quickly store python objects to
restore later.

• JSON is more readable, simple, and cross-platform

• Protobuf is good for its performance, especially for network communication

• messagepack is an alternative for JSON, when you want faster and smaller
serialization

• Only the function’s name is pickled, along with the name of the module the function is
deﬁned in. A function’s content is NOT pickled.

• When building a custom serializer, I recommend looking struct module in Python
!83
Summary / Takeaways

References
• Oﬃcial Python document

• https://docs.python.org/3/library/pickle.html

• CPython Source code  
(Great content. Extensive comments here are quite easy to understand)

• https://github.com/python/cpython/blob/master/Lib/pickle.py

• https://github.com/python/cpython/blob/master/Lib/pickletools.py

!85
pickle.py
(from cPython implementation)

• 파이썬 웹서버 REST API 문서 쉽고 빠르게 작성하기 (Yongseon Lee)  
18th (Sun) 11:55 ~ 12:35
• Advanced Python testing techniques (Jaeman An) 
18th (Sun) 11:55 ~ 12:35
• 실시간 의료 인공지능 데이터 처리를 위한 Django Query Optimization (Soyoung Yoon) 
18th (Sun) 13:55 ~ 14:35
• Pickle & Custom Binary Serializer (Young Seok Kim) 
18th (Sun) 14:55 ~ 15:35
Talks from AITRICS
www.aitrics.com
contact@aitrics.com
Thank
Soyoung Yoon <lovelife@kaist.ac.kr>
We are Hiring!
● Software Engineer
● Machine Learning Researc

[PyCon KR 2019] Pickle & Custom Binary Serializer

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

[PyCon KR 2019] Pickle & Custom Binary Serializer