SlideShare a Scribd company logo
1 of 71
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Agenda
• The Internet is turning upside down
• A story
• The last (mile) shall be first
• Time series on NO-SQL
• Faster time series on NO-SQL
• Summary
© 2014 MapR Technologies 3
How the Internet Works
• Big content servers feed data across the backbone to
• Regional caches and servers feed data across neighborhood
transport to
• The “last mile”
• Bits are nearly conserved, $ are concentrated centrally
– But total $ mass at the edge is much higher
© 2014 MapR Technologies 4
How The Internet Works
Server
Cache
Cache
Gateway
Switch
Firewall
c1
c2
Gateway
Switch Firewall
c1
c2
Switch
Firewall c1
c2
© 2014 MapR Technologies 5
Conservation of Bits Decreases Bandwidth
Server
Cache
Cache
Gateway
Switch
Firewall
c1
c2
Gateway
Switch Firewall
c1
c2
Switch
Firewall c1
c2
© 2014 MapR Technologies 6
Total Investment Dominated by Last Mile
Server
Cache
Cache
Gateway
Switch
Firewall
c1
c2
Gateway
Switch Firewall
c1
c2
Switch
Firewall c1
c2
© 2014 MapR Technologies 7
The Rub
• What's the problem?
– Speed (end-to-end latency, backbone bw)
– Feasibility (cost for consumer links)
– Caching
• What do we need?
– Cheap last-mile hardware
– Good caches
© 2014 MapR Technologies 8
First:
An apology for going
off-script
© 2014 MapR Technologies 9
Now, the story
© 2014 MapR Technologies 10
© 2014 MapR Technologies 11
By the 1840’s, the NY-SF
sailing time was down to
130-180 days
© 2014 MapR Technologies 12
© 2014 MapR Technologies 13
In 1851, the record was
set at 89 days by the
Flying Cloud
© 2014 MapR Technologies 14
The difference was due
(in part) to big data
and a primitive kind of
time-series database
© 2014 MapR Technologies 15
© 2014 MapR Technologies 16
© 2014 MapR Technologies 17
© 2014 MapR Technologies 18
These charts were free …
If you donated your data
© 2014 MapR Technologies 19
But how does this apply
today?
© 2014 MapR Technologies 20
What has changed?
Where will it lead?
© 2014 MapR Technologies 21
© 2014 MapR Technologies 22
© 2014 MapR Technologies 23
© 2014 MapR Technologies 24
© 2014 MapR Technologies 25
© 2014 MapR Technologies 26
© 2014 MapR Technologies 27
© 2014 MapR Technologies 28
© 2014 MapR Technologies 29
© 2014 MapR Technologies 30
© 2014 MapR Technologies 31
Things
© 2014 MapR Technologies 32
Emitting data
© 2014 MapR Technologies 33
How The Internet Works
Server
Cache
Cache
Gateway
Switch
Firewall
c1
c2
Gateway
Switch Firewall
c1
c2
Switch
Firewall c1
c2
© 2014 MapR Technologies 34
How the Internet is Going to Work
Server
Cache
Cache
GatewaySwitchController
m4
m3
Gateway
Switch
Controller
m6
m5
Switch
Controllerm2
m1
© 2014 MapR Technologies 35
Where Will The $ Go?
Server
Cache
Cache
GatewaySwitchController
m4
m3
Gateway
Switch
Controller
m6
m5
Switch
Controllerm2
m1
© 2014 MapR Technologies 36
Sensors
© 2014 MapR Technologies 37
Controllers
© 2014 MapR Technologies 38
The Problems
• Sensors and controllers have little processing or space
– SIM cards = 20Mhz processor, 128kb space = 16kB
– Arduino mini = 15kB RAM (more EPROM)
– BeagleBone/Raspberry Pi = 500 kB RAM
• Sensors and controllers have little power
– Very common to power down 99% of the time
• Sensors and controls often have very low bandwidth
– Mesh networks with base rates << 1Mb/s
– Power line networking
– Intermittent 3G/4G/LTE connectivity
© 2014 MapR Technologies 39
What Do We Need to Do With a Time Series
• Acquire
– Measurement, transmission, reception
– Mostly not our problem
• Store
– We own this
• Retrieve
– We have to allow this
• Analyze and visualize
– We facilitate this via retrieval
© 2014 MapR Technologies 40
Retrieval Requirements
• Retrieve by time-series, time range, tags
– Possibly pull millions of data points at a time
– Possibly do on-the-fly windowed aggregations
• Search by unstructured data
– Typically require time windowed facetting after search
– Also need to dive in with first kind of retrieval
© 2014 MapR Technologies 41
Storage choices and trade-offs
• Flat files
– Great for rapid ingest with massive data
– Handles essentially any data type
– Less good for data requiring frequent updates
– Harder to find specific ranges
• Traditional relational db
– Ingests up to 10,000’s/ sec; prefers well structured (numerical) data; expensive
• Non-relational db: Tables (such as MapR tables in M7 or HBase)
– Ingests up to 100,000 rows/sec
– Handles wide variety of data
– Good for frequent updates
– Easily scanned in a range
© 2014 MapR Technologies 42
Specific Example
• Consider a server farm
• Lots of system metrics
• Typically 100-300 stats / 30 s
• Loads, RPC’s, packets, requests/s
• Common to have 100 – 10,000 machines
© 2014 MapR Technologies 43
The General Outline
• 10 samples / second / machine
x 1,000 machines
= 10,000 samples / second
• This is what Open TSDB was designed to handle
• Install and go, but don’t test at scale
© 2014 MapR Technologies 44
Specific Example
• Consider oil drilling rigs
• When drilling wells, there are *lots* of moving parts
• Typically a drilling rig makes about 10K samples/s
• Temperatures, pressures, magnetics,
machine vibration levels, salinity, voltage,
currents, many others
• Typical project has 100 rigs
© 2014 MapR Technologies 45
The General Outline
• 10K samples / second / rig
x 100 rigs
= 1M samples / second
© 2014 MapR Technologies 46
The General Outline
• 10K samples / second / rig
x 100 rigs
= 1M samples / second
• But wait, there’s more
– Suppose you want to test your system
– Perhaps with a year of data
– And you want to load that data in << 1 year
• 100x real-time = 100M samples / second
© 2014 MapR Technologies 47
How Should That Work?
Message
queue
Collector
MapR
table
Samples
Web service Users
© 2014 MapR Technologies 48
A First Attempt
OpenTSDB is a distributed Time Series Database build on top of
HBase, enabling you …
– to store & index, as well as
– to query & plot
… metrics at scale.
© 2014 MapR Technologies 49
Design Goals
• Distributed storage of metrics
• Metrics query fast and easy
• Scale out to thousands of machines and billions of data points
• No SPOF
© 2014 MapR Technologies 50
Key concepts
© 2014 MapR Technologies 51
Key concepts
(00:38, 56) mysql.com_delete schema=userdb
© 2014 MapR Technologies 52
Key concepts
data point: (timestamp, value)
+ metric
+ tag: key=value
 time series
© 2014 MapR Technologies 53
Example TS
...
1409497082 327810227706 mysql.bytes_received schema=foo host=db1
1409497099 6604859181710 mysql.bytes_sent schema=foo host=db1
1409497106 327812421706 mysql.bytes_received schema=foo host=db1
1409497113 6604901075387 mysql.bytes_sent schema=foo host=db
...
UNIX epoch timestamp: $(date +%s)
a metric (often hierarchical)
two tags
© 2014 MapR Technologies 54
Declare metric
$ tsdb mkmetric mysql.bytes_sent mysql.bytes_received
metrics mysql.bytes_sent: [0, 0, 1]
metrics mysql.bytes_received: [0, 0, 2]
… or use –auto-metric
© 2014 MapR Technologies 55
Collect metric
• tcollector: gathers data from local
collectors, pushes to TSDs and
providing deduplication
• lots bundled
– General: iostat, netstat, etc.
– Others: MySQL, HBase, etc.
• … or roll your own
© 2014 MapR Technologies 56
The Whole Picture
HBase
or
MapR-DB
© 2014 MapR Technologies 57
Wide Table Design: Point-by-Point
© 2014 MapR Technologies 58
Wide Table Design: Hybrid Point-by-Point + Blob
Insertion of data as blob makes original columns redundant
Non-relational, but you can query these tables with Drill
© 2014 MapR Technologies 59
Status to This Point
• Each sample requires one insertion, compaction requires
another
• Typical performance on SE cluster
– 1 edge node + 4 cluster nodes
– 20,000 samples per second observed
– Would be faster on performance cluster, possibly not a lot
• Suitable for server monitoring
• Not suitable for large scale history ingestion
• Bulk load helps a little, but not much
• Still 1000x too slow for industrial work
© 2014 MapR Technologies 60
Speeding up OpenTSDB
20,000 data points per second per node in the test cluster
Why can’t it be faster ?
© 2014 MapR Technologies 61
Speeding up OpenTSDB: open source MapR extensions
Available on Github: https://github.com/mapr-demos/opentsdb
© 2014 MapR Technologies 62
Status to This Point
• 3600 samples require one insertion
• Typical results on SE cluster
– 1 edge node + 4 cluster nodes
– 14 million samples per second observed
– ~700x faster ingestion
• Typical results on performance cluster
– 2-4 edge nodes + 4-9 cluster nodes
– 110 million samples/s (4 nodes) to >200 million samples/s (8 nodes)
• Suitable for large scale history ingestion
• 30 million data points retrieved in 20s
• Ready for industrial work
© 2014 MapR Technologies 63
Key Results
• Ingestion is network limited
– Edge nodes are the critical resource
– Number of edge nodes defines a limit to scaling
• With enough edge nodes scaling is near perfect
• Performance of raw OpenTSDB is limited by stateless demon
• Modified OpenTSDB can run 1000x faster
© 2014 MapR Technologies 64
Overall Ingestion Rate
Nodes
TotalIngestionRate(millionsofpoints/second)
4 5 8 9
050150250
Two ingestors
One ingestor
© 2014 MapR Technologies 65
Normalized Ingestion Rate
Nodes
Ingestionpernode(millionsofpoints/second)
4 5 8 9
010203040 Two ingestors
One ingestor
© 2014 MapR Technologies 66
Why MapR?
• MapR tables are inherently faster, safer
– Sustained > 1GB/s ingest rate in tests
• Mirror to M5 or M7 cluster to isolate analytics load
• Transaction logs involves frequent appends, many files
© 2014 MapR Technologies 67
When is this All Wrong?
• In some cases, retrieval by series-id + time range not sufficient
• May need very flexible retrieval of events based on text-like
criteria
• Search may be better than class time-series database
• Can scale Lucene based search to > 1 million events / second
© 2014 MapR Technologies 68
When is it Even More Right
• In many industrial settings, data rates from individual sensors are
relatively high
– Latency to view is still measured in seconds, not sample points
• This allows batching at source
• Common requirement for highly variable sample rates
– 1 sample/s, baseline, switch to 10 k sample/s
– Small batches during slow times are just fine since number of sensors is
constant
– Requires variable window sizes
© 2014 MapR Technologies 69
Summary
• The internet is turning upside down
• This will make time series ubiquitous
• Current open source systems are much too slow
• We can fix that with modern NoSQL systems
– (I wear a red hat for a reason)
© 2014 MapR Technologies 70
Questions
© 2014 MapR Technologies 71
Thank You
@mapr maprtech
tdunning@mapr.com
tdunning@apache.org
Ted Dunning, ChiefApplicationArchitect
MapRTechnologies
maprtech
mapr-technologies

More Related Content

What's hot

Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningTed Dunning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendationsTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationTed Dunning
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 

What's hot (20)

Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Buzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learningBuzz words-dunning-real-time-learning
Buzz words-dunning-real-time-learning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 
Polyvalent recommendations
Polyvalent recommendationsPolyvalent recommendations
Polyvalent recommendations
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Using Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for RecommendationUsing Mahout and a Search Engine for Recommendation
Using Mahout and a Search Engine for Recommendation
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 

Similar to How the Internet of Things is Transforming Time Series Databases

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...NoSQLmatters
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningMapR Technologies
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningJohn Mulhall
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
CitySprint Fleetmapper use case -Big Data Bootcamp
CitySprint  Fleetmapper use case -Big Data BootcampCitySprint  Fleetmapper use case -Big Data Bootcamp
CitySprint Fleetmapper use case -Big Data BootcampEduard Lazar
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Next Generation Enterprise Architecture
Next Generation Enterprise ArchitectureNext Generation Enterprise Architecture
Next Generation Enterprise ArchitectureMapR Technologies
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
 

Similar to How the Internet of Things is Transforming Time Series Databases (20)

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Building HBase Applications - Ted Dunning
Building HBase Applications - Ted DunningBuilding HBase Applications - Ted Dunning
Building HBase Applications - Ted Dunning
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
HUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_DunningHUG_Ireland_Streaming_Ted_Dunning
HUG_Ireland_Streaming_Ted_Dunning
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
CitySprint Fleetmapper use case -Big Data Bootcamp
CitySprint  Fleetmapper use case -Big Data BootcampCitySprint  Fleetmapper use case -Big Data Bootcamp
CitySprint Fleetmapper use case -Big Data Bootcamp
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
try
trytry
try
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Next Generation Enterprise Architecture
Next Generation Enterprise ArchitectureNext Generation Enterprise Architecture
Next Generation Enterprise Architecture
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 

More from Ted Dunning (9)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 

Recently uploaded

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 

Recently uploaded (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 

How the Internet of Things is Transforming Time Series Databases

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Agenda • The Internet is turning upside down • A story • The last (mile) shall be first • Time series on NO-SQL • Faster time series on NO-SQL • Summary
  • 3. © 2014 MapR Technologies 3 How the Internet Works • Big content servers feed data across the backbone to • Regional caches and servers feed data across neighborhood transport to • The “last mile” • Bits are nearly conserved, $ are concentrated centrally – But total $ mass at the edge is much higher
  • 4. © 2014 MapR Technologies 4 How The Internet Works Server Cache Cache Gateway Switch Firewall c1 c2 Gateway Switch Firewall c1 c2 Switch Firewall c1 c2
  • 5. © 2014 MapR Technologies 5 Conservation of Bits Decreases Bandwidth Server Cache Cache Gateway Switch Firewall c1 c2 Gateway Switch Firewall c1 c2 Switch Firewall c1 c2
  • 6. © 2014 MapR Technologies 6 Total Investment Dominated by Last Mile Server Cache Cache Gateway Switch Firewall c1 c2 Gateway Switch Firewall c1 c2 Switch Firewall c1 c2
  • 7. © 2014 MapR Technologies 7 The Rub • What's the problem? – Speed (end-to-end latency, backbone bw) – Feasibility (cost for consumer links) – Caching • What do we need? – Cheap last-mile hardware – Good caches
  • 8. © 2014 MapR Technologies 8 First: An apology for going off-script
  • 9. © 2014 MapR Technologies 9 Now, the story
  • 10. © 2014 MapR Technologies 10
  • 11. © 2014 MapR Technologies 11 By the 1840’s, the NY-SF sailing time was down to 130-180 days
  • 12. © 2014 MapR Technologies 12
  • 13. © 2014 MapR Technologies 13 In 1851, the record was set at 89 days by the Flying Cloud
  • 14. © 2014 MapR Technologies 14 The difference was due (in part) to big data and a primitive kind of time-series database
  • 15. © 2014 MapR Technologies 15
  • 16. © 2014 MapR Technologies 16
  • 17. © 2014 MapR Technologies 17
  • 18. © 2014 MapR Technologies 18 These charts were free … If you donated your data
  • 19. © 2014 MapR Technologies 19 But how does this apply today?
  • 20. © 2014 MapR Technologies 20 What has changed? Where will it lead?
  • 21. © 2014 MapR Technologies 21
  • 22. © 2014 MapR Technologies 22
  • 23. © 2014 MapR Technologies 23
  • 24. © 2014 MapR Technologies 24
  • 25. © 2014 MapR Technologies 25
  • 26. © 2014 MapR Technologies 26
  • 27. © 2014 MapR Technologies 27
  • 28. © 2014 MapR Technologies 28
  • 29. © 2014 MapR Technologies 29
  • 30. © 2014 MapR Technologies 30
  • 31. © 2014 MapR Technologies 31 Things
  • 32. © 2014 MapR Technologies 32 Emitting data
  • 33. © 2014 MapR Technologies 33 How The Internet Works Server Cache Cache Gateway Switch Firewall c1 c2 Gateway Switch Firewall c1 c2 Switch Firewall c1 c2
  • 34. © 2014 MapR Technologies 34 How the Internet is Going to Work Server Cache Cache GatewaySwitchController m4 m3 Gateway Switch Controller m6 m5 Switch Controllerm2 m1
  • 35. © 2014 MapR Technologies 35 Where Will The $ Go? Server Cache Cache GatewaySwitchController m4 m3 Gateway Switch Controller m6 m5 Switch Controllerm2 m1
  • 36. © 2014 MapR Technologies 36 Sensors
  • 37. © 2014 MapR Technologies 37 Controllers
  • 38. © 2014 MapR Technologies 38 The Problems • Sensors and controllers have little processing or space – SIM cards = 20Mhz processor, 128kb space = 16kB – Arduino mini = 15kB RAM (more EPROM) – BeagleBone/Raspberry Pi = 500 kB RAM • Sensors and controllers have little power – Very common to power down 99% of the time • Sensors and controls often have very low bandwidth – Mesh networks with base rates << 1Mb/s – Power line networking – Intermittent 3G/4G/LTE connectivity
  • 39. © 2014 MapR Technologies 39 What Do We Need to Do With a Time Series • Acquire – Measurement, transmission, reception – Mostly not our problem • Store – We own this • Retrieve – We have to allow this • Analyze and visualize – We facilitate this via retrieval
  • 40. © 2014 MapR Technologies 40 Retrieval Requirements • Retrieve by time-series, time range, tags – Possibly pull millions of data points at a time – Possibly do on-the-fly windowed aggregations • Search by unstructured data – Typically require time windowed facetting after search – Also need to dive in with first kind of retrieval
  • 41. © 2014 MapR Technologies 41 Storage choices and trade-offs • Flat files – Great for rapid ingest with massive data – Handles essentially any data type – Less good for data requiring frequent updates – Harder to find specific ranges • Traditional relational db – Ingests up to 10,000’s/ sec; prefers well structured (numerical) data; expensive • Non-relational db: Tables (such as MapR tables in M7 or HBase) – Ingests up to 100,000 rows/sec – Handles wide variety of data – Good for frequent updates – Easily scanned in a range
  • 42. © 2014 MapR Technologies 42 Specific Example • Consider a server farm • Lots of system metrics • Typically 100-300 stats / 30 s • Loads, RPC’s, packets, requests/s • Common to have 100 – 10,000 machines
  • 43. © 2014 MapR Technologies 43 The General Outline • 10 samples / second / machine x 1,000 machines = 10,000 samples / second • This is what Open TSDB was designed to handle • Install and go, but don’t test at scale
  • 44. © 2014 MapR Technologies 44 Specific Example • Consider oil drilling rigs • When drilling wells, there are *lots* of moving parts • Typically a drilling rig makes about 10K samples/s • Temperatures, pressures, magnetics, machine vibration levels, salinity, voltage, currents, many others • Typical project has 100 rigs
  • 45. © 2014 MapR Technologies 45 The General Outline • 10K samples / second / rig x 100 rigs = 1M samples / second
  • 46. © 2014 MapR Technologies 46 The General Outline • 10K samples / second / rig x 100 rigs = 1M samples / second • But wait, there’s more – Suppose you want to test your system – Perhaps with a year of data – And you want to load that data in << 1 year • 100x real-time = 100M samples / second
  • 47. © 2014 MapR Technologies 47 How Should That Work? Message queue Collector MapR table Samples Web service Users
  • 48. © 2014 MapR Technologies 48 A First Attempt OpenTSDB is a distributed Time Series Database build on top of HBase, enabling you … – to store & index, as well as – to query & plot … metrics at scale.
  • 49. © 2014 MapR Technologies 49 Design Goals • Distributed storage of metrics • Metrics query fast and easy • Scale out to thousands of machines and billions of data points • No SPOF
  • 50. © 2014 MapR Technologies 50 Key concepts
  • 51. © 2014 MapR Technologies 51 Key concepts (00:38, 56) mysql.com_delete schema=userdb
  • 52. © 2014 MapR Technologies 52 Key concepts data point: (timestamp, value) + metric + tag: key=value  time series
  • 53. © 2014 MapR Technologies 53 Example TS ... 1409497082 327810227706 mysql.bytes_received schema=foo host=db1 1409497099 6604859181710 mysql.bytes_sent schema=foo host=db1 1409497106 327812421706 mysql.bytes_received schema=foo host=db1 1409497113 6604901075387 mysql.bytes_sent schema=foo host=db ... UNIX epoch timestamp: $(date +%s) a metric (often hierarchical) two tags
  • 54. © 2014 MapR Technologies 54 Declare metric $ tsdb mkmetric mysql.bytes_sent mysql.bytes_received metrics mysql.bytes_sent: [0, 0, 1] metrics mysql.bytes_received: [0, 0, 2] … or use –auto-metric
  • 55. © 2014 MapR Technologies 55 Collect metric • tcollector: gathers data from local collectors, pushes to TSDs and providing deduplication • lots bundled – General: iostat, netstat, etc. – Others: MySQL, HBase, etc. • … or roll your own
  • 56. © 2014 MapR Technologies 56 The Whole Picture HBase or MapR-DB
  • 57. © 2014 MapR Technologies 57 Wide Table Design: Point-by-Point
  • 58. © 2014 MapR Technologies 58 Wide Table Design: Hybrid Point-by-Point + Blob Insertion of data as blob makes original columns redundant Non-relational, but you can query these tables with Drill
  • 59. © 2014 MapR Technologies 59 Status to This Point • Each sample requires one insertion, compaction requires another • Typical performance on SE cluster – 1 edge node + 4 cluster nodes – 20,000 samples per second observed – Would be faster on performance cluster, possibly not a lot • Suitable for server monitoring • Not suitable for large scale history ingestion • Bulk load helps a little, but not much • Still 1000x too slow for industrial work
  • 60. © 2014 MapR Technologies 60 Speeding up OpenTSDB 20,000 data points per second per node in the test cluster Why can’t it be faster ?
  • 61. © 2014 MapR Technologies 61 Speeding up OpenTSDB: open source MapR extensions Available on Github: https://github.com/mapr-demos/opentsdb
  • 62. © 2014 MapR Technologies 62 Status to This Point • 3600 samples require one insertion • Typical results on SE cluster – 1 edge node + 4 cluster nodes – 14 million samples per second observed – ~700x faster ingestion • Typical results on performance cluster – 2-4 edge nodes + 4-9 cluster nodes – 110 million samples/s (4 nodes) to >200 million samples/s (8 nodes) • Suitable for large scale history ingestion • 30 million data points retrieved in 20s • Ready for industrial work
  • 63. © 2014 MapR Technologies 63 Key Results • Ingestion is network limited – Edge nodes are the critical resource – Number of edge nodes defines a limit to scaling • With enough edge nodes scaling is near perfect • Performance of raw OpenTSDB is limited by stateless demon • Modified OpenTSDB can run 1000x faster
  • 64. © 2014 MapR Technologies 64 Overall Ingestion Rate Nodes TotalIngestionRate(millionsofpoints/second) 4 5 8 9 050150250 Two ingestors One ingestor
  • 65. © 2014 MapR Technologies 65 Normalized Ingestion Rate Nodes Ingestionpernode(millionsofpoints/second) 4 5 8 9 010203040 Two ingestors One ingestor
  • 66. © 2014 MapR Technologies 66 Why MapR? • MapR tables are inherently faster, safer – Sustained > 1GB/s ingest rate in tests • Mirror to M5 or M7 cluster to isolate analytics load • Transaction logs involves frequent appends, many files
  • 67. © 2014 MapR Technologies 67 When is this All Wrong? • In some cases, retrieval by series-id + time range not sufficient • May need very flexible retrieval of events based on text-like criteria • Search may be better than class time-series database • Can scale Lucene based search to > 1 million events / second
  • 68. © 2014 MapR Technologies 68 When is it Even More Right • In many industrial settings, data rates from individual sensors are relatively high – Latency to view is still measured in seconds, not sample points • This allows batching at source • Common requirement for highly variable sample rates – 1 sample/s, baseline, switch to 10 k sample/s – Small batches during slow times are just fine since number of sensors is constant – Requires variable window sizes
  • 69. © 2014 MapR Technologies 69 Summary • The internet is turning upside down • This will make time series ubiquitous • Current open source systems are much too slow • We can fix that with modern NoSQL systems – (I wear a red hat for a reason)
  • 70. © 2014 MapR Technologies 70 Questions
  • 71. © 2014 MapR Technologies 71 Thank You @mapr maprtech tdunning@mapr.com tdunning@apache.org Ted Dunning, ChiefApplicationArchitect MapRTechnologies maprtech mapr-technologies

Editor's Notes

  1. In the HBase shell: scan 'tsdb-uid', {STARTROW => "\0\0\1"}
  2. Try out MySQL collector or for advanced once: write your own in Python
  3. Ted’s original talk notes: OpenTSDB consists of a Time Series Daemon (TSD) as well as set of command line utilities. Interaction with OpenTSDB is primarily achieved by running one or more of the TSDs. Each TSD is independent. There is no master, no shared state so you can run as many TSDs as required to handle any load you throw at it. Each TSD uses the open source databaseHBase to store and retrieve time-series data. The HBase schema is highly optimized for fast aggregations of similar time series to minimize storage space. Users of the TSD never need to access HBase directly. You can communicate with the TSD via a simple telnet-style protocol, an HTTP API or a simple built-in GUI. All communications happen on the same port (the TSD figures out the protocol of the client by looking at the first few bytes it receives).
  4. Key ideas: Unique row key based on an id for each time series (looked up from a separate look-up table); important part of the efficiency of design is to have each column be a time off-set from the start time shown in the row key. Note that data is stored point-by-point in this wide table design. Ted’s notes from his original slide: One technique for increasing the rate at which data can be retrieved from a time series database is to store many values in each row. Doing this allows data points to be retrieved at a higher speed Because both HBase and MapR-DB store data ordered by the primary key, this design will cause rows containing data from a single time series to wind up near one another on disk. Retrieving data from a particular time series for a time range will involve largely sequential disk operations and therefore will be much faster than would be the case if the rows were widely scattered. Typically, the time window is adjusted so that 100–1,000 samples are in each row.
  5. Ted’s notes from original slide: The table design is improved by collapsing all of the data for a row into a single data structure known as a blob. This blob can be highly compressed so that less data needs to be read from disk. Also, having a single column per row decreases the per-column overhead incurred by the on-disk format that HBase uses, which further increases performance. Data can be progressively converted to the compressed format as soon as it is known that little or no new data is likely to arrive for that time series and time window. Commonly, once the time window ends, new data will only arrive for a few more seconds, and the compression of the data can begin. Since compressed and uncompressed data can coexist in the same row, if a few samples arrive after the row is compressed, the row can simply be compressed again to merge the blob and the late-arriving samples.
  6. Richard: This is based on a figure from Chapter 3 of our book. Point here is to show that with standard Open TSDB, data is loaded into the wide table point-by-point, then pulled out and compressed to blob, then reloaded to form the hybrid table. This is a fairly efficient arrangement. Next slide will show how this is speeded up with the MapR open source extensions. Here are Ted’s original notes: Since data is inserted in the uncompressed format, the arrival of each data point requires a row update operation to insert the value into the database. Then read again by the blob maker. Reads are approximately equal to writes. Once data is compressed to blobs, it is again written to the database. This row update can limit the insertion rate for data to as little as 20,000 data points per second per node in the cluster.
  7. Richard: Also based on a figure from Chapter 3 of book: This slide shows the increased performance using the open source code MapR made open on github. I’ve added the github link. The key differences is that the blob production occurs upstream, before the data is ever loaded into the table. The restart logs are useful so that if there were ever a glitch with the process of compressing data to blobs and insertion, you would not lose the original data. Note that there is still the delay while blobs are made… see explanation in book, chapters 3 and 4. Richard: Please preserve the rest of the material on fast ingestion with MapR extensions (direct blob loading) for Ted’s talk on Sat. Use this slide as a preview and mention that Ted will be talking about this on Fiday. Ted’s original notes: the direct blob insertion data flow allows the insertion rate to be increased by as much as roughly 1,000-fold. How does the direct blob approach get this bump in performance? The essential difference is that the blob maker has been moved into the data flow between the catcher and the NoSQL time series database. This way, the blob maker can use incoming data from a memory cache rather than extracting its input from wide table rows already stored in the storage tier. the full data stream is only written to the memory cache, which is fast, rather than to the database. Data is not written to the storage tier until it’s compressed into blobs, so writing can be much faster. The number of database operations is decreased by the average number of data points in each of the compressed data blobs. This decrease can easily be a factor in the thousands.