Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Nitty Gritty of Data Serialisation

26 views

Published on

Talk delivered at CodeStar Night, discussing details of data representation and serialisation.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Nitty Gritty of Data Serialisation

  1. 1. Video
  2. 2. The nitty gritty of data serialisation Maxim Zaks @iceX33
  3. 3. _
  4. 4. One bit 2^1 true / false
  5. 5. _ _ _ _ _ _ _ _
  6. 6. One byte 2^8 = 256
  7. 7. {0..255} {-128..0..127} {ASCII / UTF-8}
  8. 8. _ _ _ _ _ _ _ _ | _ _ _ _ _ _ _ _
  9. 9. ___ | ___
  10. 10. Little endian Big endian
  11. 11. 005 | 000 000 | 005
  12. 12. Two bytes 2^16 = 65,536
  13. 13. {0 .. 65,535} {-32,768 .. 0 .. 32,767} {UTF-16}
  14. 14. ___ | ___ | ___ | ___
  15. 15. 2^32 = 4,294,967,296
  16. 16. {0 .. 4,294,967,296} {-2,147,483,648 .. 0 .. 2,147,483,647 } {0, -0, NaN, Inf, -Inf, SinglePrecision} {UTF-32} {Pointer on arch-32}
  17. 17. ___ | ___ | ___ | ___ | ___ | ___ | ___ | ___
  18. 18. 2^64 = 1.844674407E19
  19. 19. {0 .. 1.844674407E19} {-9.223372037E18 .. 0 .. 9.223372037E18} {0, -0, NaN, Inf, -Inf, DoublePrecision} {Pointer on arch-64}
  20. 20. Real numbers / IEEE 754
  21. 21. https://en.wikipedia.org/wiki/Single-precision_floating- point_format
  22. 22. https://en.wikipedia.org/wiki/Double-precision_floating- point_format
  23. 23. https://en.wikipedia.org/wiki/Half-precision_floating- point_format
  24. 24. String / Text
  25. 25. https://en.wikipedia.org/wiki/ASCII
  26. 26. Unicode • Code • UTF-8 • UTF-16 • UTF-32 • etc...
  27. 27. https://en.wikipedia.org/wiki/Unicode
  28. 28. https://en.wikipedia.org/wiki/UTF-8
  29. 29. https://en.wikipedia.org/wiki/UTF-16
  30. 30. Human readable / text based serialisation • CSV • EDIFACT • XML • JSON • YAML • TOML • Java property file
  31. 31. Second level encoding
  32. 32. Int as text • Base 10 • Worst case 3x (* UTF-?) size • Parsing overhead
  33. 33. Real number as text • Min 3 bytes to ... • What you see is not what you get * • Parsing overhead
  34. 34. Bytes as Text • Base 16 (2^4) - 2x • Base 64 (2^6) - 1.3x
  35. 35. Form of structured data representation • Table (row / column based) • Tree / hierarchy • Graph
  36. 36. Bit packing techniques
  37. 37. Variable length quantity https://en.wikipedia.org/wiki/Variable-length_quantity
  38. 38. VLQ + ZigZag
  39. 39. Pack Multiple values in one byte • Array of booleans • Multiple enum values
  40. 40. Deduplication • Lookup table as in GIF • Most frequent values only
  41. 41. Storing Deltas
  42. 42. Memory alignment • Fast/direct access • Padding
  43. 43. Good Talk on Parque and Arrow https://www.youtube.com/watch?v=dPb2ZXnt2_U
  44. 44. Binary serialisation • Format specification • Code • Schema • Tooling for human readable
  45. 45. Format specification • Support for endianness and archs • Bit-packed or memory aligned • Need for temporary memory allocation • Form of data representation (see schema) • Evolution strategies
  46. 46. Code • Code generation / library • Supported languages • Computational efficiency • Memory vs. speed
  47. 47. Schema • Schema based, or self described • Language and tooling • High level concepts in regards to data representation and evolution • Capabilities for documentation and meta data
  48. 48. Tooling for human readable • Convert to and from a text based serialisation format • GUI • Data importer (e.g. DB integration)
  49. 49. Benefits of binary formats • Compact size • Efficient read and write • Data representation • Human readable
  50. 50. Environment • Data centres consume 4% of world wide electricity
  51. 51. Thank you! Maxim Zaks @iceX33 Questions?

×