SlideShare a Scribd company logo
1 of 31
Extended Memory Access in PHP
SydPHP Meetup - Thursday April 27, 2017
Overview
• PHP was used to build a prototype data collection system for an
Intensive Care Unit in a large children’s hospital in Canada
• The system collected approximately 100,000 values per second from
about 40 bed spaces
• Data was compressed to a custom format
• Arrays were manipulated to reduce their size/memory footprint
Data collection in an Intensive Care Unit
• Collecting data continuously from 42 beds (expanding to
~100 over the next year)
• Data collected for research purposes
• Approximately 30 different measurements from each patient
• ECG (500Hz)
• EEG (1KHz)
• Heart Rate (1Hz)
• etc.
Scale of data collection
• ~35 beds occupied at any time
• ~2000 samples per second per patient from 30 different sensors
• Total of ~70,000 samples per second
• 31.5 million seconds in a year
• 2.2 trillion samples per year
• Each value has a time with millisecond precision
• Data arrives as JSON messages (Polled from a RabbitMQ)
• ~100TB/ year (~30GB/day)
Motivation – Physiological Data Collection
• Apply engineering, data science and computer science approaches to
physiological data collection, storage and analysis in the CCU
• Capture  Store  Structure  Analyse all physiological data
• Accelerate research by “unlocking the data” and making it more usable
• Make this process cost effective
• Improve patient care by bringing real time insight to the bedside
Streaming Data Data Retention
Rapid Data Retrieval
Retaining Physiological data for research and analysis
Large, labelled databases are required for some kinds of analysis
Storage Options for Physiological Time Series
• InfluxDB
• Raw Binary Data
• Relational Databases (MySQL, Postgres)
• MongoDB
• Physiological file formats (eg Physionet)
• CSV
• Zipped CSV
Storage Options - InfluxDB
• Specifically designed for time series storage
• Recently introduced compression
• Relatively slow performance (We are I/O Bound)
• Required 3 to 4 Tb per year disk space
Storage Options - MySQL
• MySQL version 5.6.4 or later allows columns with fractional-second time
datatypes
• Each row is at least 8 bytes:
• 4 bytes for the timestamp
• 2 bytes for the fractional seconds
• 2 bytes for the value
• Rows may also need information about sensors and patients
• Plus indexes
• 17TB of MySQL tables per year!
• Can be compressed, but row based compression is not very effective
• Tried a custom partitioning scheme, ended up with 300,000 tables
Storage Options - MongoDB
• Storage format is essentially JSON (or BSON)
• Large disk footprint
• Scales well, but is disk hungry
• Allows for unstructured data, but our data is highly structured
• Very slow read rates (~100,000 values per second)
Storage options – pure binary data?
• Most values are 2 byte integers
• Each time could be stored in say 4 bytes
• 2.2 trillion time/value pairs at 6 bytes per value is 13.2 TB/year
• How do we index the binary data?
Storage Options – Custom File Format
• Created a custom compressed binary format called TSC
• Stored metadata about the compressed files in a MySQL database
• NOSQL solution (Not Only SQL)
• Stores each time-value pair in an average of ~2.1 bits
• 2.2 trillion values/year = 0.57 Tb/year
• File headers provide a map of the contents
Comparative Data Size on Disk
0
10
20
30
40
50
60
MongoDB MySQL InfluxDB Gzip CSV Physionet TSC
DataSizeonDiskfor1BillionValues(Gb)
Data Format
Data Size on Disk (GB)
Benefits of using a customised approach
• Store 10 years of data (25 trillion “rows” of data) on a single server
• Server cost approx. $50k
• Smaller disk footprint means faster disks can be used (RAID array of NVME
SSDs)
• Backup costs are significantly reduced
• Data structures and indexes are tailored to our use case
• Parallelised and distributed decompression
• Still I/O bound, but can send 100 million values per second to a distributed
computing system
How did we do this?
• Physiological data is relatively predictable (and therefore compressible)
• Make a prediction about the next value, then encode the error
• Custom data compression was developed for each data type
• PHP was used to develop the prototype system
• PHP Prototype system has been in use for 14 months, relatively stable
• Now being rewritten in Java (Streamsets) and c++
Difference Encoding
• Because we are dealing with integer values we can precicely
encode the differences between the values
• These differences are more predictable than the original array
(lower entropy) and are therefore more compressible
• The range of values is smaller so the array can be represented
using fewer bits
• Eg, if the range of the array is <8, then all values can be
represented using 3 bits
Time HR Difference
1 120 0
2 117 -3
3 112 -5
4 107 -5
5 108 1
6 112 4
7 112 0
8 112 0
9 110 -2
10 112 2
11 116 4
12 118 2
13 119 1
14 117 -2
15 122 5
16 122 0
17 125 3
18 122 -3
19 126 4
20 125 -1
21 122 -3
22 117 -5
23 115 -2
24 120 5
25 118 -2
26 118 0
27 118 0
28 119 1
29 117 -2
30 122 5
31 122 0
32 120 -2
33 119 -1
34 120 1
35 115 -5
Packing data
• PHP uses a dictionary for its arrays, very inefficient
• Requires approximately 40 bytes per array element
• If the array can be represented using just 3 bits, then
how can we efficiently pack those values end to end in a
binary file
• PHP uses ~40MB to store one million elements in an
array
• But, 1 million elements @ 3 bits per element should only
need 3 million bits (375KB)
• Ie, we can represent the array 100 times more efficiently
bits range
1 2
2 4
3 8
4 16
5 32
6 64
7 128
8 256
Notes on Speed
• PHP7 is approximately 3 times faster than PHP5 for these functions
• PHP7 can process ~1M array elements per second
• Further compression (eg. Gzip) is fast
But WHY?
• PHP is fast and effective for rapid prototyping
• Sometimes useful to preload data from a compressed binary file
• Caching information in a local script – minimise database server load
• Useful for command line scripts that may run for a long time (minutes?
Hours? Months?)
• Compressed arrays can be stored to disk en-masse, forming the beginnings
of a NOSQL system
Why NOT?
• PHP is relatively slow – rewrite in C or Java
• PHP doesn’t have strict data types which gets messy
• PHP’s garbage collection is … well … garbage
• This script runs as a daemon 24/7 – PHP7 stability is ok, but not great
Toy Example – Animating /r/place experiment
• Crowdsourced artwork created at reddit.com
• Archive of 50,000 image snapshots
• 1 million pixels per image (1000x1000px)
• Total of 50 billion pixels – each pixel is 3 bytes
(rgb)
• 150GB of data if stored uncompressed bytes
• Much less data if only the changes (differences)
are stored
• Resulted in a 30MB binary file that contained
the same information
• 30MB differences file can be loaded into
“extended memory” quickly
<?PHP
ini_set('memory_limit','1024M');
include "class.place.php";
include "config_spacescience.php";
$p=new place($archive_file);
$p->initiate_palette($palette);
$p->set_frame_details($frame_details);
$irand=mt_rand(0,999);
$jrand=mt_rand(0,999);
echo $p->pixelinfo($irand,$jrand);
?>
Information for pixel(840,76)
COLOR COUNT PERCENTAGE
Grey 19065 78.16%
Light Purple 2040 8.36%
Yellow/Green 1369 5.61%
Brown 825 3.38%
Light Grey 736 3.02%
White 150 0.61%
Light Green 103 0.42%
Turquise 102 0.42%
Dark Grey 2 0.01%
Tan/Orange 0 0%
Light Blue 0 0%
Purple 0 0%
Green 0 0%
Blue 0 0%
Red 0 0%
Pink 0 0%
edited 15 times from frame 0 to 24391
first color White
last color Grey
Shannon Entropy of this pixel is 1.2407
Mode of this pixel is color 7 (Grey)
output
Example – Animating /r/place experiment
Pixel Activity – Lower right corner
Colour activity calculations for 50 billion pixels
performed
PHP7 did this in about 35 minutes on a single
thread
Fading the colours required that calculations
spanned multiple frames (ie, each frame couldn’t
be considered independently)
This required some kind of persistent (and large)
memory structure in PHP
Conclusions
• PHP’s array structures consume a lot of memory
• Pack arrays into binary objects to minimise their RAM footprint
• Save only the differences between subsequent array elements to
minimise the number of bits required to represent the data
• Packed arrays can be further compressed using gzip or other built in
compression functions
• Packed arrays can be saved to disk for later retrieval
Thanks – Questions?

More Related Content

What's hot

Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API akvalex
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache CassandraEric Evans
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Eric Evans
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...yaevents
 
Strings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryStrings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryMichael Yarichuk
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in CassandraEric Evans
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDEric Evans
 
MongoDB Versatility: Scaling the MapMyFitness Platform
MongoDB Versatility: Scaling the MapMyFitness PlatformMongoDB Versatility: Scaling the MapMyFitness Platform
MongoDB Versatility: Scaling the MapMyFitness PlatformMongoDB
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupJelena Zanko
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Markus Höfer
 

What's hot (20)

NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
Time Series Data with Apache Cassandra
Time Series Data with Apache CassandraTime Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
 
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
 
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
 
Hadoop at datasift
Hadoop at datasiftHadoop at datasift
Hadoop at datasift
 
Hadoop at datasift
Hadoop at datasiftHadoop at datasift
Hadoop at datasift
 
Strings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryStrings, C# and Unmanaged Memory
Strings, C# and Unmanaged Memory
 
Time series storage in Cassandra
Time series storage in CassandraTime series storage in Cassandra
Time series storage in Cassandra
 
It's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRDIt's not you, it's me: Ending a 15 year relationship with RRD
It's not you, it's me: Ending a 15 year relationship with RRD
 
MongoDB Versatility: Scaling the MapMyFitness Platform
MongoDB Versatility: Scaling the MapMyFitness PlatformMongoDB Versatility: Scaling the MapMyFitness Platform
MongoDB Versatility: Scaling the MapMyFitness Platform
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Cimagraphi8
Cimagraphi8Cimagraphi8
Cimagraphi8
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
 
Why Spark for large scale data analysis
Why Spark for large scale data analysisWhy Spark for large scale data analysis
Why Spark for large scale data analysis
 

Similar to Extended memory access in PHP

MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL Bernd Ocklin
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
 
Deep Dive - Maximising EC2 & EBS Performance
Deep Dive - Maximising EC2 & EBS PerformanceDeep Dive - Maximising EC2 & EBS Performance
Deep Dive - Maximising EC2 & EBS PerformanceAmazon Web Services
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceAmazon Web Services
 
Deep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceDeep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceAmazon Web Services
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceAmazon Web Services
 
(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for PerformanceAmazon Web Services
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Responding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaResponding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaPeter Lawrey
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 

Similar to Extended memory access in PHP (20)

MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Deep Dive - Maximising EC2 & EBS Performance
Deep Dive - Maximising EC2 & EBS PerformanceDeep Dive - Maximising EC2 & EBS Performance
Deep Dive - Maximising EC2 & EBS Performance
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
 
Deep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceDeep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS Performance
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
 
(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance(STG403) Amazon EBS: Designing for Performance
(STG403) Amazon EBS: Designing for Performance
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Responding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in JavaResponding rapidly when you have 100+ GB data sets in Java
Responding rapidly when you have 100+ GB data sets in Java
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

Extended memory access in PHP

  • 1. Extended Memory Access in PHP SydPHP Meetup - Thursday April 27, 2017
  • 2. Overview • PHP was used to build a prototype data collection system for an Intensive Care Unit in a large children’s hospital in Canada • The system collected approximately 100,000 values per second from about 40 bed spaces • Data was compressed to a custom format • Arrays were manipulated to reduce their size/memory footprint
  • 3. Data collection in an Intensive Care Unit • Collecting data continuously from 42 beds (expanding to ~100 over the next year) • Data collected for research purposes • Approximately 30 different measurements from each patient • ECG (500Hz) • EEG (1KHz) • Heart Rate (1Hz) • etc.
  • 4. Scale of data collection • ~35 beds occupied at any time • ~2000 samples per second per patient from 30 different sensors • Total of ~70,000 samples per second • 31.5 million seconds in a year • 2.2 trillion samples per year • Each value has a time with millisecond precision • Data arrives as JSON messages (Polled from a RabbitMQ) • ~100TB/ year (~30GB/day)
  • 5. Motivation – Physiological Data Collection • Apply engineering, data science and computer science approaches to physiological data collection, storage and analysis in the CCU • Capture  Store  Structure  Analyse all physiological data • Accelerate research by “unlocking the data” and making it more usable • Make this process cost effective • Improve patient care by bringing real time insight to the bedside
  • 6. Streaming Data Data Retention Rapid Data Retrieval Retaining Physiological data for research and analysis Large, labelled databases are required for some kinds of analysis
  • 7. Storage Options for Physiological Time Series • InfluxDB • Raw Binary Data • Relational Databases (MySQL, Postgres) • MongoDB • Physiological file formats (eg Physionet) • CSV • Zipped CSV
  • 8. Storage Options - InfluxDB • Specifically designed for time series storage • Recently introduced compression • Relatively slow performance (We are I/O Bound) • Required 3 to 4 Tb per year disk space
  • 9. Storage Options - MySQL • MySQL version 5.6.4 or later allows columns with fractional-second time datatypes • Each row is at least 8 bytes: • 4 bytes for the timestamp • 2 bytes for the fractional seconds • 2 bytes for the value • Rows may also need information about sensors and patients • Plus indexes • 17TB of MySQL tables per year! • Can be compressed, but row based compression is not very effective • Tried a custom partitioning scheme, ended up with 300,000 tables
  • 10. Storage Options - MongoDB • Storage format is essentially JSON (or BSON) • Large disk footprint • Scales well, but is disk hungry • Allows for unstructured data, but our data is highly structured • Very slow read rates (~100,000 values per second)
  • 11. Storage options – pure binary data? • Most values are 2 byte integers • Each time could be stored in say 4 bytes • 2.2 trillion time/value pairs at 6 bytes per value is 13.2 TB/year • How do we index the binary data?
  • 12. Storage Options – Custom File Format • Created a custom compressed binary format called TSC • Stored metadata about the compressed files in a MySQL database • NOSQL solution (Not Only SQL) • Stores each time-value pair in an average of ~2.1 bits • 2.2 trillion values/year = 0.57 Tb/year • File headers provide a map of the contents
  • 13. Comparative Data Size on Disk 0 10 20 30 40 50 60 MongoDB MySQL InfluxDB Gzip CSV Physionet TSC DataSizeonDiskfor1BillionValues(Gb) Data Format Data Size on Disk (GB)
  • 14. Benefits of using a customised approach • Store 10 years of data (25 trillion “rows” of data) on a single server • Server cost approx. $50k • Smaller disk footprint means faster disks can be used (RAID array of NVME SSDs) • Backup costs are significantly reduced • Data structures and indexes are tailored to our use case • Parallelised and distributed decompression • Still I/O bound, but can send 100 million values per second to a distributed computing system
  • 15. How did we do this? • Physiological data is relatively predictable (and therefore compressible) • Make a prediction about the next value, then encode the error • Custom data compression was developed for each data type • PHP was used to develop the prototype system • PHP Prototype system has been in use for 14 months, relatively stable • Now being rewritten in Java (Streamsets) and c++
  • 16. Difference Encoding • Because we are dealing with integer values we can precicely encode the differences between the values • These differences are more predictable than the original array (lower entropy) and are therefore more compressible • The range of values is smaller so the array can be represented using fewer bits • Eg, if the range of the array is <8, then all values can be represented using 3 bits Time HR Difference 1 120 0 2 117 -3 3 112 -5 4 107 -5 5 108 1 6 112 4 7 112 0 8 112 0 9 110 -2 10 112 2 11 116 4 12 118 2 13 119 1 14 117 -2 15 122 5 16 122 0 17 125 3 18 122 -3 19 126 4 20 125 -1 21 122 -3 22 117 -5 23 115 -2 24 120 5 25 118 -2 26 118 0 27 118 0 28 119 1 29 117 -2 30 122 5 31 122 0 32 120 -2 33 119 -1 34 120 1 35 115 -5
  • 17. Packing data • PHP uses a dictionary for its arrays, very inefficient • Requires approximately 40 bytes per array element • If the array can be represented using just 3 bits, then how can we efficiently pack those values end to end in a binary file • PHP uses ~40MB to store one million elements in an array • But, 1 million elements @ 3 bits per element should only need 3 million bits (375KB) • Ie, we can represent the array 100 times more efficiently bits range 1 2 2 4 3 8 4 16 5 32 6 64 7 128 8 256
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Notes on Speed • PHP7 is approximately 3 times faster than PHP5 for these functions • PHP7 can process ~1M array elements per second • Further compression (eg. Gzip) is fast
  • 23. But WHY? • PHP is fast and effective for rapid prototyping • Sometimes useful to preload data from a compressed binary file • Caching information in a local script – minimise database server load • Useful for command line scripts that may run for a long time (minutes? Hours? Months?) • Compressed arrays can be stored to disk en-masse, forming the beginnings of a NOSQL system
  • 24. Why NOT? • PHP is relatively slow – rewrite in C or Java • PHP doesn’t have strict data types which gets messy • PHP’s garbage collection is … well … garbage • This script runs as a daemon 24/7 – PHP7 stability is ok, but not great
  • 25. Toy Example – Animating /r/place experiment • Crowdsourced artwork created at reddit.com • Archive of 50,000 image snapshots • 1 million pixels per image (1000x1000px) • Total of 50 billion pixels – each pixel is 3 bytes (rgb) • 150GB of data if stored uncompressed bytes • Much less data if only the changes (differences) are stored • Resulted in a 30MB binary file that contained the same information • 30MB differences file can be loaded into “extended memory” quickly
  • 26. <?PHP ini_set('memory_limit','1024M'); include "class.place.php"; include "config_spacescience.php"; $p=new place($archive_file); $p->initiate_palette($palette); $p->set_frame_details($frame_details); $irand=mt_rand(0,999); $jrand=mt_rand(0,999); echo $p->pixelinfo($irand,$jrand); ?> Information for pixel(840,76) COLOR COUNT PERCENTAGE Grey 19065 78.16% Light Purple 2040 8.36% Yellow/Green 1369 5.61% Brown 825 3.38% Light Grey 736 3.02% White 150 0.61% Light Green 103 0.42% Turquise 102 0.42% Dark Grey 2 0.01% Tan/Orange 0 0% Light Blue 0 0% Purple 0 0% Green 0 0% Blue 0 0% Red 0 0% Pink 0 0% edited 15 times from frame 0 to 24391 first color White last color Grey Shannon Entropy of this pixel is 1.2407 Mode of this pixel is color 7 (Grey) output
  • 27. Example – Animating /r/place experiment
  • 28. Pixel Activity – Lower right corner
  • 29. Colour activity calculations for 50 billion pixels performed PHP7 did this in about 35 minutes on a single thread Fading the colours required that calculations spanned multiple frames (ie, each frame couldn’t be considered independently) This required some kind of persistent (and large) memory structure in PHP
  • 30. Conclusions • PHP’s array structures consume a lot of memory • Pack arrays into binary objects to minimise their RAM footprint • Save only the differences between subsequent array elements to minimise the number of bits required to represent the data • Packed arrays can be further compressed using gzip or other built in compression functions • Packed arrays can be saved to disk for later retrieval