2. Overview
• PHP was used to build a prototype data collection system for an
Intensive Care Unit in a large children’s hospital in Canada
• The system collected approximately 100,000 values per second from
about 40 bed spaces
• Data was compressed to a custom format
• Arrays were manipulated to reduce their size/memory footprint
3. Data collection in an Intensive Care Unit
• Collecting data continuously from 42 beds (expanding to
~100 over the next year)
• Data collected for research purposes
• Approximately 30 different measurements from each patient
• ECG (500Hz)
• EEG (1KHz)
• Heart Rate (1Hz)
• etc.
4. Scale of data collection
• ~35 beds occupied at any time
• ~2000 samples per second per patient from 30 different sensors
• Total of ~70,000 samples per second
• 31.5 million seconds in a year
• 2.2 trillion samples per year
• Each value has a time with millisecond precision
• Data arrives as JSON messages (Polled from a RabbitMQ)
• ~100TB/ year (~30GB/day)
5. Motivation – Physiological Data Collection
• Apply engineering, data science and computer science approaches to
physiological data collection, storage and analysis in the CCU
• Capture Store Structure Analyse all physiological data
• Accelerate research by “unlocking the data” and making it more usable
• Make this process cost effective
• Improve patient care by bringing real time insight to the bedside
6. Streaming Data Data Retention
Rapid Data Retrieval
Retaining Physiological data for research and analysis
Large, labelled databases are required for some kinds of analysis
7. Storage Options for Physiological Time Series
• InfluxDB
• Raw Binary Data
• Relational Databases (MySQL, Postgres)
• MongoDB
• Physiological file formats (eg Physionet)
• CSV
• Zipped CSV
8. Storage Options - InfluxDB
• Specifically designed for time series storage
• Recently introduced compression
• Relatively slow performance (We are I/O Bound)
• Required 3 to 4 Tb per year disk space
9. Storage Options - MySQL
• MySQL version 5.6.4 or later allows columns with fractional-second time
datatypes
• Each row is at least 8 bytes:
• 4 bytes for the timestamp
• 2 bytes for the fractional seconds
• 2 bytes for the value
• Rows may also need information about sensors and patients
• Plus indexes
• 17TB of MySQL tables per year!
• Can be compressed, but row based compression is not very effective
• Tried a custom partitioning scheme, ended up with 300,000 tables
10. Storage Options - MongoDB
• Storage format is essentially JSON (or BSON)
• Large disk footprint
• Scales well, but is disk hungry
• Allows for unstructured data, but our data is highly structured
• Very slow read rates (~100,000 values per second)
11. Storage options – pure binary data?
• Most values are 2 byte integers
• Each time could be stored in say 4 bytes
• 2.2 trillion time/value pairs at 6 bytes per value is 13.2 TB/year
• How do we index the binary data?
12. Storage Options – Custom File Format
• Created a custom compressed binary format called TSC
• Stored metadata about the compressed files in a MySQL database
• NOSQL solution (Not Only SQL)
• Stores each time-value pair in an average of ~2.1 bits
• 2.2 trillion values/year = 0.57 Tb/year
• File headers provide a map of the contents
13. Comparative Data Size on Disk
0
10
20
30
40
50
60
MongoDB MySQL InfluxDB Gzip CSV Physionet TSC
DataSizeonDiskfor1BillionValues(Gb)
Data Format
Data Size on Disk (GB)
14. Benefits of using a customised approach
• Store 10 years of data (25 trillion “rows” of data) on a single server
• Server cost approx. $50k
• Smaller disk footprint means faster disks can be used (RAID array of NVME
SSDs)
• Backup costs are significantly reduced
• Data structures and indexes are tailored to our use case
• Parallelised and distributed decompression
• Still I/O bound, but can send 100 million values per second to a distributed
computing system
15. How did we do this?
• Physiological data is relatively predictable (and therefore compressible)
• Make a prediction about the next value, then encode the error
• Custom data compression was developed for each data type
• PHP was used to develop the prototype system
• PHP Prototype system has been in use for 14 months, relatively stable
• Now being rewritten in Java (Streamsets) and c++
16. Difference Encoding
• Because we are dealing with integer values we can precicely
encode the differences between the values
• These differences are more predictable than the original array
(lower entropy) and are therefore more compressible
• The range of values is smaller so the array can be represented
using fewer bits
• Eg, if the range of the array is <8, then all values can be
represented using 3 bits
Time HR Difference
1 120 0
2 117 -3
3 112 -5
4 107 -5
5 108 1
6 112 4
7 112 0
8 112 0
9 110 -2
10 112 2
11 116 4
12 118 2
13 119 1
14 117 -2
15 122 5
16 122 0
17 125 3
18 122 -3
19 126 4
20 125 -1
21 122 -3
22 117 -5
23 115 -2
24 120 5
25 118 -2
26 118 0
27 118 0
28 119 1
29 117 -2
30 122 5
31 122 0
32 120 -2
33 119 -1
34 120 1
35 115 -5
17. Packing data
• PHP uses a dictionary for its arrays, very inefficient
• Requires approximately 40 bytes per array element
• If the array can be represented using just 3 bits, then
how can we efficiently pack those values end to end in a
binary file
• PHP uses ~40MB to store one million elements in an
array
• But, 1 million elements @ 3 bits per element should only
need 3 million bits (375KB)
• Ie, we can represent the array 100 times more efficiently
bits range
1 2
2 4
3 8
4 16
5 32
6 64
7 128
8 256
18.
19.
20.
21.
22. Notes on Speed
• PHP7 is approximately 3 times faster than PHP5 for these functions
• PHP7 can process ~1M array elements per second
• Further compression (eg. Gzip) is fast
23. But WHY?
• PHP is fast and effective for rapid prototyping
• Sometimes useful to preload data from a compressed binary file
• Caching information in a local script – minimise database server load
• Useful for command line scripts that may run for a long time (minutes?
Hours? Months?)
• Compressed arrays can be stored to disk en-masse, forming the beginnings
of a NOSQL system
24. Why NOT?
• PHP is relatively slow – rewrite in C or Java
• PHP doesn’t have strict data types which gets messy
• PHP’s garbage collection is … well … garbage
• This script runs as a daemon 24/7 – PHP7 stability is ok, but not great
25. Toy Example – Animating /r/place experiment
• Crowdsourced artwork created at reddit.com
• Archive of 50,000 image snapshots
• 1 million pixels per image (1000x1000px)
• Total of 50 billion pixels – each pixel is 3 bytes
(rgb)
• 150GB of data if stored uncompressed bytes
• Much less data if only the changes (differences)
are stored
• Resulted in a 30MB binary file that contained
the same information
• 30MB differences file can be loaded into
“extended memory” quickly
26. <?PHP
ini_set('memory_limit','1024M');
include "class.place.php";
include "config_spacescience.php";
$p=new place($archive_file);
$p->initiate_palette($palette);
$p->set_frame_details($frame_details);
$irand=mt_rand(0,999);
$jrand=mt_rand(0,999);
echo $p->pixelinfo($irand,$jrand);
?>
Information for pixel(840,76)
COLOR COUNT PERCENTAGE
Grey 19065 78.16%
Light Purple 2040 8.36%
Yellow/Green 1369 5.61%
Brown 825 3.38%
Light Grey 736 3.02%
White 150 0.61%
Light Green 103 0.42%
Turquise 102 0.42%
Dark Grey 2 0.01%
Tan/Orange 0 0%
Light Blue 0 0%
Purple 0 0%
Green 0 0%
Blue 0 0%
Red 0 0%
Pink 0 0%
edited 15 times from frame 0 to 24391
first color White
last color Grey
Shannon Entropy of this pixel is 1.2407
Mode of this pixel is color 7 (Grey)
output
29. Colour activity calculations for 50 billion pixels
performed
PHP7 did this in about 35 minutes on a single
thread
Fading the colours required that calculations
spanned multiple frames (ie, each frame couldn’t
be considered independently)
This required some kind of persistent (and large)
memory structure in PHP
30. Conclusions
• PHP’s array structures consume a lot of memory
• Pack arrays into binary objects to minimise their RAM footprint
• Save only the differences between subsequent array elements to
minimise the number of bits required to represent the data
• Packed arrays can be further compressed using gzip or other built in
compression functions
• Packed arrays can be saved to disk for later retrieval