SlideShare a Scribd company logo
1 of 30
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Sparks Ignite, Inc.
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
IoT Under the Hood
There are any number of vendors and publications stating
that IT departments need to invest big in Big Data and Big
Analytics to meet the challenges of the Internet of Things.
Saying something does not make it true, even if you saying
it very loudly and very often. It just makes it noisy.
Let's swap out marketing and hype for logic and math and
separate the signal from the noise.
We'll come up with a clear problem definition and come up
with an algorithmic approach to the problem.
Once we have a framework, we can more intelligently choose
an implementation.
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Essentially, there has been two fundamental changes:
one fundamental technical change: more devices are reporting more data more
frequently.
one fundamental business change: provide more frequent and robust analytics.
Let's break down the requirements into something measurable.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
More devices are reporting more data more frequently.
What are these “devices” again?
Sensors, programmable logic controllers (PLC), RFID readers, etc
What do we mean by “reporting”?
Each device is really only capable of generating a text based log file. It could be
fixed or variable length, xml or json but it will be text.
Most importantly, all of these devices now have an IP address.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
More devices are reporting more data more frequently.
What do we mean by “more”?
For a device to be considered part of your Internet of Things, it must be
connected to the network. At this point, “more” could mean connecting some or all
of your existing devices, which has technical and security issues. It could mean
connecting your supply chain partner's devices; same issues magnified. It could
mean adding more devices. Plan for it to mean all of these things, and come up
with a reasonable strategy for a staged onboarding process.
What do we mean by “data”?
This type of data is called time series data: which Wikipedia tells us “is a
sequence of data points, measured typically at successive points in time spaced
at uniform time intervals.” It's typically geolocated as well.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
More devices are reporting more data more frequently.
What do we mean by “frequently”?
Normally, we are talking about anywhere between near real time and fifteen
minutes. Time, or frequency, is money and storage space is inexpensive, but not
free. The maximum speed at which a device can report data is not necessarily the
speed in which you are best served receiving the data.
Frequency is a measurement that should be arrived upon by both the business
and the data scientists. I have found that planning on an average frequency of
one minute is reasonable and makes for easy estimations.
From experience, I have also found that although devices are more than happy to
talk nonstop, not everything they say is worth listening to.There are some
technical advantages to pushing some basic quality control logic to the device,
assuming it has the ability.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Provide more frequent and robust analytics.
What do we mean by “provide”
There will need to be both ad-hoc and structured analytics and reporting. It is
worth noting that data at scale is often not amenable to the same types of reports
that are used for more modest, enterprise-size data.
What do we mean by “more frequent”?
For most use cases, the difference between an “advanced” and a “standard”
analytics platform is the speed at which the data can be made available and
actionable, not necessarily the level of detail. This difference can mean a report
that provides advance warning of a potential system failure versus a detailed post
mortem of broken equipment.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Provide more frequent and robust analytics.
What do we mean by “robust”?
A robust data set, to a data scientist, means less work spent on cleansing,
processing, and massaging the data and more time spent running, comparing
and fine tuning algorithms. Many data scientists estimate that 80% of their work
involves data munging. There is no other way to classify that time other than
'wasted'.
What do we mean by “analytics”?
Analytics, or analysis of data, is a process of inspecting, cleaning, transforming,
and modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision making. The major categories of analytics
that are typically performed on time series data are on the following slides.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Time Series Data Analytics
Summarization
Given a time series Q containing n data points where n is an extremely large
number, create a (possibly graphic) approximation of Q which retains its essential
features but fits on a single page, screen, etc.
Anomoly Detection
Given a time series Q, assumed to be normal, and an unannotated time series R,
find all sections of R which contain anomalies or
“surprising/interesting/unexpected” occurrences
Segmentation
Given a time series Q containing n data points, construct a model Q1 from K
piecewise segments (K << n), such that Q closely approximates Q1
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Time Series Data Analytics
Indexing
Given a query time series Q, and some similarity/dissimilarity measure D(Q, C),
find the most similar time series in database DB
Clustering
Find natural groupings of the time series in database DB under some
similarity/dissimilarity measure D(Q, C)
Classification
Given an unlabeled time series Q, assign it to one of two or more predefined
classes
Prediction
Given a time series Q containing n data points, predict the value at time n + 1.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Let's assume that we want our analytic platform to run on data refreshed every
minute rather than data that was batch processed the day before.That's a 10^3
performance increase.
We have discussed before that a new device can be
●An existing device previously not connected to the network
●A new device installed on a new product type
●An interface to a partner's device
●An interface to a customer's device
Let's just say you have to deal with a million new devices.
We now have a problem definition with an order of magnitude estimation: Provide
analytic capability 10^3 faster based on 10^6 more time series data.
Are there data structures and algorithms that can provide for such an increase in
time series data?
IoT Under the Hood
Big O Review
Big O Review
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Storing time series data means taking a large number of small files and persisting
them using the time (at a minimum) as the natural order.
The emphasis is on insert, since there is no mandatory prerequisite for updating
or deleting time series data.
This data will come from multiple sources so time (and possibly location) is really
the only metric that they must have in common.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
In order to rapidly query time series data, there needs to be a clear and relevant
sequential-based index
A reasonable key could composed of metric name : timestamp : random 64 bit
integer : geo tag. If you needed to retrieve a block of data that would provide all of
the sensor readings from a particular device from a particular manufacturing plant
in the third quarter of 2014, that would certainly be readily available.
Note that this key also represent a strong hashing function. Since a hash table
provides the fastest possible data retrieval (constant time), it is very important to
ensure that the hash is well generated. A bad hash degrades a hash table to a
linked list and we will get data in O(n) rather than O(1).
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Time series data unfortunately tends to heavy disk I/O.
●Seek Rate
●Time it takes for data to be written or read to disk.
●Transfer Rate
●Time is takes for to move data between the controller and the host system
(external rate) as well as between the disk surface and the controller (internal
rate).
As far as disk operations are concerned, it is better to transfer than to seek and
even then it's helpful to minimize the frequency of the transfers in favor of larger
payloads.
Since the data is both sequential and immutable, we can have a reasonable
expectation that this can be optimized.
The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds
(1E-03), so we can short-circuit any deep conversations about partial stroking and
hybrid drives.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
We need a storage system that
1.minimizes seek time
2.optimized for index searches
3.optimized for sequential searches
4.optimized for inserts
A storage system in this context refers to both database engines, file systems,
messaging platforms, etc. Anything that would store the data.
For estimation purposes, we are looking at a storage system of O(log n) or better.
Since we are going to be looking at data at scale, we will need our performance
to level off.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
What algorithm are you using today for your storage systems?
Almost certainly a B (or B+) Tree. The primary value of a B tree is in storing data
for efficient retrieval in a block-oriented storage context. Most file systems use
this structure and it is used by every major relational database vendor for their
key indexes.
B trees are great for random access. When inserting a record into a B Tree, you
need to search the tree to find the location to insert the record. Since B-Trees are
designed to be wide and shallow, there should be a minimal number of drive
seeks.
B Tree inserts are O(log n), which can be argued is the mathematical lower bound
for balanced trees.
So can we do better?
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
There is a data structure called a Log Structured Merge Tree (LSM) and a storage
model called Log Structured Storage (LSS) that provide the same estimated
performance O(log n) as a B Tree but provide two key potential areas of
improvement for B Trees that are applicable for systems that are going to do large
quantities of sequential writes:
●Moving seek time from disk to memory
●Moving from block data to log structured data
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSM Tree
An LSM-Tree is a hybrid tree model that uses two trees: C0 and C1. C0 is smaller
and entirely resident in memory, whereas C1 is resident on disk. New records are
written to C0 from C1 based on a size threshold.
Insertions now run primarily at RAM rather than HDD speeds, or 1E-09 rather
than 1E-03 seconds. Of course, they are written to disk, but that is where the LSS
comes in.
Note that many production systems systems concurrently write to a commit log on
disk and C0 with the commit log getting deleted after flushing.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSS
In a traditional storage system, there needs to be a considerable amount of
overhead for updating and deleting existing members. In a log structured storage
system, this overhead does not exist because a log structured storage system
provides for an append-only sequence of data entries. Unlike a B+ Tree-based
system, you don't find a location for new data, you merely append it to the end.
Because new records are always added to the end, there is never any need for
searching a tree for insertion, like in a B-tree storage structure. This allows for
extremely predictable horizontal scaling.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSS
Providing concurrency and transactional semantics using Multiversion
Concurrency and Control (MVCC) is easier in LSS than B-tree since existing data
in not modified. A view of the system at state Q at time A is just as valid is a view
of Q at time B.
Being able to manage concurrency and transactions in a distributed environment
just by using immutable objects is a key to successful software development
projects.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSS/LSM versus B-Tree
While both options provide O(log n) performance, the LSS/LSM algorithm and
data structure solution is clearly optimized for the IoT use case and can give us
the order of magnitude speed increases that we need.
Consider that insertions into a Log Structured Merge Tree occur in memory rather
than to disk, we are inserting into a medium that takes 1E-09 rather than 1E-03
seconds. For relational databases, that insertion would also occur in main
memory, but that is only referring to GB size datasets, not TB sizes.
Once the LSM writes to disk, the Log Structure Storage System will always write
to the end of the file with no searching or sorting. This actually occurs in O(1)
time.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
We have identified an optimal data structure and algorithm, now we need to
identify the level of compliance needed for the data. You may have seen the CAP
Theorem :
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
The CAP Theorem
Consistency
All clients see the same view of the data, even in the presence
of updates
Availability
All clients can find some replica of the data, even in the
presence of failure
Partition Tolerance
The system property holds even if the system is
partitioned.
Now, define your problem set(s) and pick two. The easiest way to identify where a
use case falls on the CAP Theorem is to identify the consistency model you need.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
For time series data from devices, availability and partition tolerance are key
drivers. The data should never be lost and the system should not be unavailable.
The data should be partitioned across multiple based on a reasonable hash in
order to avoid the hot-spotting problem that can arise with time series data
indexes. This is an AP model.
For example, consider a banking system. If a customer makes a transfer from
checking to savings, anyone who looks at that data must see the same result.
This would not be the case if the check and savings account were separately
partitioned, so this a CA model.
By the way, relational databases are all CA and CA is the only way to be ACID.
NoSQL databases are either CP or AP.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
So what are our choices?
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Consistency
Commits are available across entire distributed system
Availability
System remains accessible and operational at all times
Partition Tolerance
Only a total system failure can cause the system to respond incorrectly
Now, define your problem set(s) and pick two.
CA
Traditional relational databases
AP
Dynamo-like systems, Cassandra, CouchDB, Voldemort, Riak
CP
BigTable-like systems, MongoDB, HBase, Memcached, Redis
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
There are caveats here. For your company, you may value CA over AP, in which
case you may prefer MongoDB or HBase. Your company may already have a
Hadoop stack, and HBase is part of the basic ecosystem. MongoDB is very easy
to query and uses BSON (binary JSON) as its storage engine, making it very
easy to use JSON across the stack.
My personal bias is towards DataStax's Cassandra offering. Elastic scalability,
flexible data model and the Cassandra Query Language looks a lot like SQL.
IBM has now started offering CouchBase in their BlueMix offering since they
acquired Cloudant. Couchbase is similar to MongoDB in its JSON integration, but
(at the time of this writing) queries needed to be structured as map-reduce rather
than SQL.
IoT Under the Hood
Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
To summarize,
The best data structure for inserting into our persistent storage engine for time
series data would run in O(log n) or logarithmic time.
B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM
Tree will deliver better performance for inserting time series data. Proper
configuration of the LSM Tree engine could move a substantial amount of the
operations from disk to memory (1E-09 rather than 1E-03).
Since we need to process 1E06 time more data, moving as much processing
from 1E-03 to 1E-09 will absolutely get us there.
The consistency model will likely be availability and partition tolerance (AP).
IoT Under the Hood

More Related Content

What's hot

13 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v0213 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v02
Erin Kerrigan
 
An Extensible Architecture for Avionics Sensor Health Assessment Using DDS
An Extensible Architecture for Avionics Sensor Health Assessment Using DDSAn Extensible Architecture for Avionics Sensor Health Assessment Using DDS
An Extensible Architecture for Avionics Sensor Health Assessment Using DDS
Sumant Tambe
 

What's hot (20)

Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
Loyalty Management Innovator AIMIA's Transformation Journey to Modernized and...
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware Detection
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Live Seminar Cloudera & Big Data Ecosystem
Live Seminar Cloudera & Big Data Ecosystem Live Seminar Cloudera & Big Data Ecosystem
Live Seminar Cloudera & Big Data Ecosystem
 
IDOL presentation
IDOL presentationIDOL presentation
IDOL presentation
 
Connectivity to business outcomes
Connectivity to business outcomesConnectivity to business outcomes
Connectivity to business outcomes
 
Creating the Foundations for the Internet of Things
Creating the Foundations for the Internet of ThingsCreating the Foundations for the Internet of Things
Creating the Foundations for the Internet of Things
 
Future of Big Data
Future of Big DataFuture of Big Data
Future of Big Data
 
Dashboards for Business Intelligence
Dashboards for Business IntelligenceDashboards for Business Intelligence
Dashboards for Business Intelligence
 
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
AI-SDV 2021: Heiko Wongel - Machine learning tools in patent searching - are ...
 
13 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v0213 2792 big-data_keynote_presentation_finalpass_05_d_v02
13 2792 big-data_keynote_presentation_finalpass_05_d_v02
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and Analytics
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
Big Data Scotland
Big Data ScotlandBig Data Scotland
Big Data Scotland
 
Bigdata notes
Bigdata notesBigdata notes
Bigdata notes
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
 
An Extensible Architecture for Avionics Sensor Health Assessment Using DDS
An Extensible Architecture for Avionics Sensor Health Assessment Using DDSAn Extensible Architecture for Avionics Sensor Health Assessment Using DDS
An Extensible Architecture for Avionics Sensor Health Assessment Using DDS
 

Similar to IoT underthe hood

Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida  Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
CLARA CAMPROVIN
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
Amazon Web Services Korea
 

Similar to IoT underthe hood (20)

iot_module4.pdf
iot_module4.pdfiot_module4.pdf
iot_module4.pdf
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
NSW-IOT-Summit-July2018.pdf
NSW-IOT-Summit-July2018.pdfNSW-IOT-Summit-July2018.pdf
NSW-IOT-Summit-July2018.pdf
 
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida  Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
Jet Reports es la herramienta para construir el mejor BI y de forma mas rapida
 
SplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and LogsSplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and Logs
 
Smart Grids and Big Data
Smart Grids and Big DataSmart Grids and Big Data
Smart Grids and Big Data
 
Ingesting Click Data for Analytics
Ingesting Click Data for AnalyticsIngesting Click Data for Analytics
Ingesting Click Data for Analytics
 
Ingesting click events for analytics
Ingesting click events for analyticsIngesting click events for analytics
Ingesting click events for analytics
 
Introduction to the Internet of Things (IoT)
Introduction to the Internet of Things (IoT)Introduction to the Internet of Things (IoT)
Introduction to the Internet of Things (IoT)
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Converged IoT Systems: Bringing the Data Center to the Edge of Everything
Converged IoT Systems: Bringing the Data Center to the Edge of EverythingConverged IoT Systems: Bringing the Data Center to the Edge of Everything
Converged IoT Systems: Bringing the Data Center to the Edge of Everything
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
 
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
 
Internet of Things (IoT) - in the cloud or rather on-premises?
Internet of Things (IoT) - in the cloud or rather on-premises?Internet of Things (IoT) - in the cloud or rather on-premises?
Internet of Things (IoT) - in the cloud or rather on-premises?
 
Fog Computing An Empirical Study
Fog Computing An Empirical StudyFog Computing An Empirical Study
Fog Computing An Empirical Study
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT Breakout
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPoint
 
Big Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical IntelligenceBig Data Analytics and Artifical Intelligence
Big Data Analytics and Artifical Intelligence
 
Big data
Big dataBig data
Big data
 

More from Dave Callaghan (9)

Big Brother Big Sister Bluemix Architecture from #HackathonCLT
Big Brother Big Sister Bluemix Architecture from #HackathonCLTBig Brother Big Sister Bluemix Architecture from #HackathonCLT
Big Brother Big Sister Bluemix Architecture from #HackathonCLT
 
Stormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and PentahoStormwater analytics with MongoDB and Pentaho
Stormwater analytics with MongoDB and Pentaho
 
MongoDB – Build, Adapt, Reduce, Improve
MongoDB – Build, Adapt, Reduce, ImproveMongoDB – Build, Adapt, Reduce, Improve
MongoDB – Build, Adapt, Reduce, Improve
 
MongoDB - Build, Adapt, Reduce, Improve
MongoDB - Build, Adapt, Reduce, ImproveMongoDB - Build, Adapt, Reduce, Improve
MongoDB - Build, Adapt, Reduce, Improve
 
SegmentOfOne
SegmentOfOneSegmentOfOne
SegmentOfOne
 
BigFastData
BigFastDataBigFastData
BigFastData
 
Orphans in the Desert Presentation
Orphans in the Desert PresentationOrphans in the Desert Presentation
Orphans in the Desert Presentation
 
AtlasCHUG
AtlasCHUGAtlasCHUG
AtlasCHUG
 
BigDataInTelco
BigDataInTelcoBigDataInTelco
BigDataInTelco
 

Recently uploaded

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 

Recently uploaded (20)

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 

IoT underthe hood

  • 1. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Sparks Ignite, Inc.
  • 2. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. IoT Under the Hood There are any number of vendors and publications stating that IT departments need to invest big in Big Data and Big Analytics to meet the challenges of the Internet of Things. Saying something does not make it true, even if you saying it very loudly and very often. It just makes it noisy. Let's swap out marketing and hype for logic and math and separate the signal from the noise. We'll come up with a clear problem definition and come up with an algorithmic approach to the problem. Once we have a framework, we can more intelligently choose an implementation.
  • 3. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Essentially, there has been two fundamental changes: one fundamental technical change: more devices are reporting more data more frequently. one fundamental business change: provide more frequent and robust analytics. Let's break down the requirements into something measurable. IoT Under the Hood
  • 4. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. More devices are reporting more data more frequently. What are these “devices” again? Sensors, programmable logic controllers (PLC), RFID readers, etc What do we mean by “reporting”? Each device is really only capable of generating a text based log file. It could be fixed or variable length, xml or json but it will be text. Most importantly, all of these devices now have an IP address. IoT Under the Hood
  • 5. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. More devices are reporting more data more frequently. What do we mean by “more”? For a device to be considered part of your Internet of Things, it must be connected to the network. At this point, “more” could mean connecting some or all of your existing devices, which has technical and security issues. It could mean connecting your supply chain partner's devices; same issues magnified. It could mean adding more devices. Plan for it to mean all of these things, and come up with a reasonable strategy for a staged onboarding process. What do we mean by “data”? This type of data is called time series data: which Wikipedia tells us “is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals.” It's typically geolocated as well. IoT Under the Hood
  • 6. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. More devices are reporting more data more frequently. What do we mean by “frequently”? Normally, we are talking about anywhere between near real time and fifteen minutes. Time, or frequency, is money and storage space is inexpensive, but not free. The maximum speed at which a device can report data is not necessarily the speed in which you are best served receiving the data. Frequency is a measurement that should be arrived upon by both the business and the data scientists. I have found that planning on an average frequency of one minute is reasonable and makes for easy estimations. From experience, I have also found that although devices are more than happy to talk nonstop, not everything they say is worth listening to.There are some technical advantages to pushing some basic quality control logic to the device, assuming it has the ability. IoT Under the Hood
  • 7. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Provide more frequent and robust analytics. What do we mean by “provide” There will need to be both ad-hoc and structured analytics and reporting. It is worth noting that data at scale is often not amenable to the same types of reports that are used for more modest, enterprise-size data. What do we mean by “more frequent”? For most use cases, the difference between an “advanced” and a “standard” analytics platform is the speed at which the data can be made available and actionable, not necessarily the level of detail. This difference can mean a report that provides advance warning of a potential system failure versus a detailed post mortem of broken equipment. IoT Under the Hood
  • 8. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Provide more frequent and robust analytics. What do we mean by “robust”? A robust data set, to a data scientist, means less work spent on cleansing, processing, and massaging the data and more time spent running, comparing and fine tuning algorithms. Many data scientists estimate that 80% of their work involves data munging. There is no other way to classify that time other than 'wasted'. What do we mean by “analytics”? Analytics, or analysis of data, is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. The major categories of analytics that are typically performed on time series data are on the following slides. IoT Under the Hood
  • 9. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Time Series Data Analytics Summarization Given a time series Q containing n data points where n is an extremely large number, create a (possibly graphic) approximation of Q which retains its essential features but fits on a single page, screen, etc. Anomoly Detection Given a time series Q, assumed to be normal, and an unannotated time series R, find all sections of R which contain anomalies or “surprising/interesting/unexpected” occurrences Segmentation Given a time series Q containing n data points, construct a model Q1 from K piecewise segments (K << n), such that Q closely approximates Q1 IoT Under the Hood
  • 10. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Time Series Data Analytics Indexing Given a query time series Q, and some similarity/dissimilarity measure D(Q, C), find the most similar time series in database DB Clustering Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q, C) Classification Given an unlabeled time series Q, assign it to one of two or more predefined classes Prediction Given a time series Q containing n data points, predict the value at time n + 1. IoT Under the Hood
  • 11. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Let's assume that we want our analytic platform to run on data refreshed every minute rather than data that was batch processed the day before.That's a 10^3 performance increase. We have discussed before that a new device can be ●An existing device previously not connected to the network ●A new device installed on a new product type ●An interface to a partner's device ●An interface to a customer's device Let's just say you have to deal with a million new devices. We now have a problem definition with an order of magnitude estimation: Provide analytic capability 10^3 faster based on 10^6 more time series data. Are there data structures and algorithms that can provide for such an increase in time series data? IoT Under the Hood
  • 14. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Storing time series data means taking a large number of small files and persisting them using the time (at a minimum) as the natural order. The emphasis is on insert, since there is no mandatory prerequisite for updating or deleting time series data. This data will come from multiple sources so time (and possibly location) is really the only metric that they must have in common. IoT Under the Hood
  • 15. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. In order to rapidly query time series data, there needs to be a clear and relevant sequential-based index A reasonable key could composed of metric name : timestamp : random 64 bit integer : geo tag. If you needed to retrieve a block of data that would provide all of the sensor readings from a particular device from a particular manufacturing plant in the third quarter of 2014, that would certainly be readily available. Note that this key also represent a strong hashing function. Since a hash table provides the fastest possible data retrieval (constant time), it is very important to ensure that the hash is well generated. A bad hash degrades a hash table to a linked list and we will get data in O(n) rather than O(1). IoT Under the Hood
  • 16. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Time series data unfortunately tends to heavy disk I/O. ●Seek Rate ●Time it takes for data to be written or read to disk. ●Transfer Rate ●Time is takes for to move data between the controller and the host system (external rate) as well as between the disk surface and the controller (internal rate). As far as disk operations are concerned, it is better to transfer than to seek and even then it's helpful to minimize the frequency of the transfers in favor of larger payloads. Since the data is both sequential and immutable, we can have a reasonable expectation that this can be optimized. The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds (1E-03), so we can short-circuit any deep conversations about partial stroking and hybrid drives. IoT Under the Hood
  • 17. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. We need a storage system that 1.minimizes seek time 2.optimized for index searches 3.optimized for sequential searches 4.optimized for inserts A storage system in this context refers to both database engines, file systems, messaging platforms, etc. Anything that would store the data. For estimation purposes, we are looking at a storage system of O(log n) or better. Since we are going to be looking at data at scale, we will need our performance to level off. IoT Under the Hood
  • 18. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. What algorithm are you using today for your storage systems? Almost certainly a B (or B+) Tree. The primary value of a B tree is in storing data for efficient retrieval in a block-oriented storage context. Most file systems use this structure and it is used by every major relational database vendor for their key indexes. B trees are great for random access. When inserting a record into a B Tree, you need to search the tree to find the location to insert the record. Since B-Trees are designed to be wide and shallow, there should be a minimal number of drive seeks. B Tree inserts are O(log n), which can be argued is the mathematical lower bound for balanced trees. So can we do better? IoT Under the Hood
  • 19. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. There is a data structure called a Log Structured Merge Tree (LSM) and a storage model called Log Structured Storage (LSS) that provide the same estimated performance O(log n) as a B Tree but provide two key potential areas of improvement for B Trees that are applicable for systems that are going to do large quantities of sequential writes: ●Moving seek time from disk to memory ●Moving from block data to log structured data IoT Under the Hood
  • 20. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. LSM Tree An LSM-Tree is a hybrid tree model that uses two trees: C0 and C1. C0 is smaller and entirely resident in memory, whereas C1 is resident on disk. New records are written to C0 from C1 based on a size threshold. Insertions now run primarily at RAM rather than HDD speeds, or 1E-09 rather than 1E-03 seconds. Of course, they are written to disk, but that is where the LSS comes in. Note that many production systems systems concurrently write to a commit log on disk and C0 with the commit log getting deleted after flushing. IoT Under the Hood
  • 21. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. LSS In a traditional storage system, there needs to be a considerable amount of overhead for updating and deleting existing members. In a log structured storage system, this overhead does not exist because a log structured storage system provides for an append-only sequence of data entries. Unlike a B+ Tree-based system, you don't find a location for new data, you merely append it to the end. Because new records are always added to the end, there is never any need for searching a tree for insertion, like in a B-tree storage structure. This allows for extremely predictable horizontal scaling. IoT Under the Hood
  • 22. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. LSS Providing concurrency and transactional semantics using Multiversion Concurrency and Control (MVCC) is easier in LSS than B-tree since existing data in not modified. A view of the system at state Q at time A is just as valid is a view of Q at time B. Being able to manage concurrency and transactions in a distributed environment just by using immutable objects is a key to successful software development projects. IoT Under the Hood
  • 23. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. LSS/LSM versus B-Tree While both options provide O(log n) performance, the LSS/LSM algorithm and data structure solution is clearly optimized for the IoT use case and can give us the order of magnitude speed increases that we need. Consider that insertions into a Log Structured Merge Tree occur in memory rather than to disk, we are inserting into a medium that takes 1E-09 rather than 1E-03 seconds. For relational databases, that insertion would also occur in main memory, but that is only referring to GB size datasets, not TB sizes. Once the LSM writes to disk, the Log Structure Storage System will always write to the end of the file with no searching or sorting. This actually occurs in O(1) time. IoT Under the Hood
  • 24. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. We have identified an optimal data structure and algorithm, now we need to identify the level of compliance needed for the data. You may have seen the CAP Theorem : IoT Under the Hood
  • 25. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. The CAP Theorem Consistency All clients see the same view of the data, even in the presence of updates Availability All clients can find some replica of the data, even in the presence of failure Partition Tolerance The system property holds even if the system is partitioned. Now, define your problem set(s) and pick two. The easiest way to identify where a use case falls on the CAP Theorem is to identify the consistency model you need. IoT Under the Hood
  • 26. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. For time series data from devices, availability and partition tolerance are key drivers. The data should never be lost and the system should not be unavailable. The data should be partitioned across multiple based on a reasonable hash in order to avoid the hot-spotting problem that can arise with time series data indexes. This is an AP model. For example, consider a banking system. If a customer makes a transfer from checking to savings, anyone who looks at that data must see the same result. This would not be the case if the check and savings account were separately partitioned, so this a CA model. By the way, relational databases are all CA and CA is the only way to be ACID. NoSQL databases are either CP or AP. IoT Under the Hood
  • 27. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. So what are our choices? IoT Under the Hood
  • 28. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. Consistency Commits are available across entire distributed system Availability System remains accessible and operational at all times Partition Tolerance Only a total system failure can cause the system to respond incorrectly Now, define your problem set(s) and pick two. CA Traditional relational databases AP Dynamo-like systems, Cassandra, CouchDB, Voldemort, Riak CP BigTable-like systems, MongoDB, HBase, Memcached, Redis IoT Under the Hood
  • 29. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. There are caveats here. For your company, you may value CA over AP, in which case you may prefer MongoDB or HBase. Your company may already have a Hadoop stack, and HBase is part of the basic ecosystem. MongoDB is very easy to query and uses BSON (binary JSON) as its storage engine, making it very easy to use JSON across the stack. My personal bias is towards DataStax's Cassandra offering. Elastic scalability, flexible data model and the Cassandra Query Language looks a lot like SQL. IBM has now started offering CouchBase in their BlueMix offering since they acquired Cloudant. Couchbase is similar to MongoDB in its JSON integration, but (at the time of this writing) queries needed to be structured as map-reduce rather than SQL. IoT Under the Hood
  • 30. Sparks Ignite, Inc. A technology consulting firm. We build outcomes. To summarize, The best data structure for inserting into our persistent storage engine for time series data would run in O(log n) or logarithmic time. B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM Tree will deliver better performance for inserting time series data. Proper configuration of the LSM Tree engine could move a substantial amount of the operations from disk to memory (1E-09 rather than 1E-03). Since we need to process 1E06 time more data, moving as much processing from 1E-03 to 1E-09 will absolutely get us there. The consistency model will likely be availability and partition tolerance (AP). IoT Under the Hood