There are any number of vendors and publications stating that IT departments need to invest big in Big Data and Big Analytics to meet the challenges of the Internet of Things. Let's swap out marketing and hype for logic and math and separate the signal from the noise. We'll come up with a clear problem definition and come up with an algorithmic approach to the problem. Once we have a framework, we can more intelligently choose an implementation.
1. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Sparks Ignite, Inc.
2. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
IoT Under the Hood
There are any number of vendors and publications stating
that IT departments need to invest big in Big Data and Big
Analytics to meet the challenges of the Internet of Things.
Saying something does not make it true, even if you saying
it very loudly and very often. It just makes it noisy.
Let's swap out marketing and hype for logic and math and
separate the signal from the noise.
We'll come up with a clear problem definition and come up
with an algorithmic approach to the problem.
Once we have a framework, we can more intelligently choose
an implementation.
3. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Essentially, there has been two fundamental changes:
one fundamental technical change: more devices are reporting more data more
frequently.
one fundamental business change: provide more frequent and robust analytics.
Let's break down the requirements into something measurable.
IoT Under the Hood
4. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
More devices are reporting more data more frequently.
What are these “devices” again?
Sensors, programmable logic controllers (PLC), RFID readers, etc
What do we mean by “reporting”?
Each device is really only capable of generating a text based log file. It could be
fixed or variable length, xml or json but it will be text.
Most importantly, all of these devices now have an IP address.
IoT Under the Hood
5. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
More devices are reporting more data more frequently.
What do we mean by “more”?
For a device to be considered part of your Internet of Things, it must be
connected to the network. At this point, “more” could mean connecting some or all
of your existing devices, which has technical and security issues. It could mean
connecting your supply chain partner's devices; same issues magnified. It could
mean adding more devices. Plan for it to mean all of these things, and come up
with a reasonable strategy for a staged onboarding process.
What do we mean by “data”?
This type of data is called time series data: which Wikipedia tells us “is a
sequence of data points, measured typically at successive points in time spaced
at uniform time intervals.” It's typically geolocated as well.
IoT Under the Hood
6. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
More devices are reporting more data more frequently.
What do we mean by “frequently”?
Normally, we are talking about anywhere between near real time and fifteen
minutes. Time, or frequency, is money and storage space is inexpensive, but not
free. The maximum speed at which a device can report data is not necessarily the
speed in which you are best served receiving the data.
Frequency is a measurement that should be arrived upon by both the business
and the data scientists. I have found that planning on an average frequency of
one minute is reasonable and makes for easy estimations.
From experience, I have also found that although devices are more than happy to
talk nonstop, not everything they say is worth listening to.There are some
technical advantages to pushing some basic quality control logic to the device,
assuming it has the ability.
IoT Under the Hood
7. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Provide more frequent and robust analytics.
What do we mean by “provide”
There will need to be both ad-hoc and structured analytics and reporting. It is
worth noting that data at scale is often not amenable to the same types of reports
that are used for more modest, enterprise-size data.
What do we mean by “more frequent”?
For most use cases, the difference between an “advanced” and a “standard”
analytics platform is the speed at which the data can be made available and
actionable, not necessarily the level of detail. This difference can mean a report
that provides advance warning of a potential system failure versus a detailed post
mortem of broken equipment.
IoT Under the Hood
8. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Provide more frequent and robust analytics.
What do we mean by “robust”?
A robust data set, to a data scientist, means less work spent on cleansing,
processing, and massaging the data and more time spent running, comparing
and fine tuning algorithms. Many data scientists estimate that 80% of their work
involves data munging. There is no other way to classify that time other than
'wasted'.
What do we mean by “analytics”?
Analytics, or analysis of data, is a process of inspecting, cleaning, transforming,
and modeling data with the goal of discovering useful information, suggesting
conclusions, and supporting decision making. The major categories of analytics
that are typically performed on time series data are on the following slides.
IoT Under the Hood
9. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Time Series Data Analytics
Summarization
Given a time series Q containing n data points where n is an extremely large
number, create a (possibly graphic) approximation of Q which retains its essential
features but fits on a single page, screen, etc.
Anomoly Detection
Given a time series Q, assumed to be normal, and an unannotated time series R,
find all sections of R which contain anomalies or
“surprising/interesting/unexpected” occurrences
Segmentation
Given a time series Q containing n data points, construct a model Q1 from K
piecewise segments (K << n), such that Q closely approximates Q1
IoT Under the Hood
10. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Time Series Data Analytics
Indexing
Given a query time series Q, and some similarity/dissimilarity measure D(Q, C),
find the most similar time series in database DB
Clustering
Find natural groupings of the time series in database DB under some
similarity/dissimilarity measure D(Q, C)
Classification
Given an unlabeled time series Q, assign it to one of two or more predefined
classes
Prediction
Given a time series Q containing n data points, predict the value at time n + 1.
IoT Under the Hood
11. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Let's assume that we want our analytic platform to run on data refreshed every
minute rather than data that was batch processed the day before.That's a 10^3
performance increase.
We have discussed before that a new device can be
●An existing device previously not connected to the network
●A new device installed on a new product type
●An interface to a partner's device
●An interface to a customer's device
Let's just say you have to deal with a million new devices.
We now have a problem definition with an order of magnitude estimation: Provide
analytic capability 10^3 faster based on 10^6 more time series data.
Are there data structures and algorithms that can provide for such an increase in
time series data?
IoT Under the Hood
14. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Storing time series data means taking a large number of small files and persisting
them using the time (at a minimum) as the natural order.
The emphasis is on insert, since there is no mandatory prerequisite for updating
or deleting time series data.
This data will come from multiple sources so time (and possibly location) is really
the only metric that they must have in common.
IoT Under the Hood
15. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
In order to rapidly query time series data, there needs to be a clear and relevant
sequential-based index
A reasonable key could composed of metric name : timestamp : random 64 bit
integer : geo tag. If you needed to retrieve a block of data that would provide all of
the sensor readings from a particular device from a particular manufacturing plant
in the third quarter of 2014, that would certainly be readily available.
Note that this key also represent a strong hashing function. Since a hash table
provides the fastest possible data retrieval (constant time), it is very important to
ensure that the hash is well generated. A bad hash degrades a hash table to a
linked list and we will get data in O(n) rather than O(1).
IoT Under the Hood
16. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Time series data unfortunately tends to heavy disk I/O.
●Seek Rate
●Time it takes for data to be written or read to disk.
●Transfer Rate
●Time is takes for to move data between the controller and the host system
(external rate) as well as between the disk surface and the controller (internal
rate).
As far as disk operations are concerned, it is better to transfer than to seek and
even then it's helpful to minimize the frequency of the transfers in favor of larger
payloads.
Since the data is both sequential and immutable, we can have a reasonable
expectation that this can be optimized.
The seek times for RAM are in nanoseconds (1E-09) rather than milliseconds
(1E-03), so we can short-circuit any deep conversations about partial stroking and
hybrid drives.
IoT Under the Hood
17. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
We need a storage system that
1.minimizes seek time
2.optimized for index searches
3.optimized for sequential searches
4.optimized for inserts
A storage system in this context refers to both database engines, file systems,
messaging platforms, etc. Anything that would store the data.
For estimation purposes, we are looking at a storage system of O(log n) or better.
Since we are going to be looking at data at scale, we will need our performance
to level off.
IoT Under the Hood
18. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
What algorithm are you using today for your storage systems?
Almost certainly a B (or B+) Tree. The primary value of a B tree is in storing data
for efficient retrieval in a block-oriented storage context. Most file systems use
this structure and it is used by every major relational database vendor for their
key indexes.
B trees are great for random access. When inserting a record into a B Tree, you
need to search the tree to find the location to insert the record. Since B-Trees are
designed to be wide and shallow, there should be a minimal number of drive
seeks.
B Tree inserts are O(log n), which can be argued is the mathematical lower bound
for balanced trees.
So can we do better?
IoT Under the Hood
19. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
There is a data structure called a Log Structured Merge Tree (LSM) and a storage
model called Log Structured Storage (LSS) that provide the same estimated
performance O(log n) as a B Tree but provide two key potential areas of
improvement for B Trees that are applicable for systems that are going to do large
quantities of sequential writes:
●Moving seek time from disk to memory
●Moving from block data to log structured data
IoT Under the Hood
20. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSM Tree
An LSM-Tree is a hybrid tree model that uses two trees: C0 and C1. C0 is smaller
and entirely resident in memory, whereas C1 is resident on disk. New records are
written to C0 from C1 based on a size threshold.
Insertions now run primarily at RAM rather than HDD speeds, or 1E-09 rather
than 1E-03 seconds. Of course, they are written to disk, but that is where the LSS
comes in.
Note that many production systems systems concurrently write to a commit log on
disk and C0 with the commit log getting deleted after flushing.
IoT Under the Hood
21. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSS
In a traditional storage system, there needs to be a considerable amount of
overhead for updating and deleting existing members. In a log structured storage
system, this overhead does not exist because a log structured storage system
provides for an append-only sequence of data entries. Unlike a B+ Tree-based
system, you don't find a location for new data, you merely append it to the end.
Because new records are always added to the end, there is never any need for
searching a tree for insertion, like in a B-tree storage structure. This allows for
extremely predictable horizontal scaling.
IoT Under the Hood
22. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSS
Providing concurrency and transactional semantics using Multiversion
Concurrency and Control (MVCC) is easier in LSS than B-tree since existing data
in not modified. A view of the system at state Q at time A is just as valid is a view
of Q at time B.
Being able to manage concurrency and transactions in a distributed environment
just by using immutable objects is a key to successful software development
projects.
IoT Under the Hood
23. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
LSS/LSM versus B-Tree
While both options provide O(log n) performance, the LSS/LSM algorithm and
data structure solution is clearly optimized for the IoT use case and can give us
the order of magnitude speed increases that we need.
Consider that insertions into a Log Structured Merge Tree occur in memory rather
than to disk, we are inserting into a medium that takes 1E-09 rather than 1E-03
seconds. For relational databases, that insertion would also occur in main
memory, but that is only referring to GB size datasets, not TB sizes.
Once the LSM writes to disk, the Log Structure Storage System will always write
to the end of the file with no searching or sorting. This actually occurs in O(1)
time.
IoT Under the Hood
24. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
We have identified an optimal data structure and algorithm, now we need to
identify the level of compliance needed for the data. You may have seen the CAP
Theorem :
IoT Under the Hood
25. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
The CAP Theorem
Consistency
All clients see the same view of the data, even in the presence
of updates
Availability
All clients can find some replica of the data, even in the
presence of failure
Partition Tolerance
The system property holds even if the system is
partitioned.
Now, define your problem set(s) and pick two. The easiest way to identify where a
use case falls on the CAP Theorem is to identify the consistency model you need.
IoT Under the Hood
26. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
For time series data from devices, availability and partition tolerance are key
drivers. The data should never be lost and the system should not be unavailable.
The data should be partitioned across multiple based on a reasonable hash in
order to avoid the hot-spotting problem that can arise with time series data
indexes. This is an AP model.
For example, consider a banking system. If a customer makes a transfer from
checking to savings, anyone who looks at that data must see the same result.
This would not be the case if the check and savings account were separately
partitioned, so this a CA model.
By the way, relational databases are all CA and CA is the only way to be ACID.
NoSQL databases are either CP or AP.
IoT Under the Hood
27. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
So what are our choices?
IoT Under the Hood
28. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
Consistency
Commits are available across entire distributed system
Availability
System remains accessible and operational at all times
Partition Tolerance
Only a total system failure can cause the system to respond incorrectly
Now, define your problem set(s) and pick two.
CA
Traditional relational databases
AP
Dynamo-like systems, Cassandra, CouchDB, Voldemort, Riak
CP
BigTable-like systems, MongoDB, HBase, Memcached, Redis
IoT Under the Hood
29. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
There are caveats here. For your company, you may value CA over AP, in which
case you may prefer MongoDB or HBase. Your company may already have a
Hadoop stack, and HBase is part of the basic ecosystem. MongoDB is very easy
to query and uses BSON (binary JSON) as its storage engine, making it very
easy to use JSON across the stack.
My personal bias is towards DataStax's Cassandra offering. Elastic scalability,
flexible data model and the Cassandra Query Language looks a lot like SQL.
IBM has now started offering CouchBase in their BlueMix offering since they
acquired Cloudant. Couchbase is similar to MongoDB in its JSON integration, but
(at the time of this writing) queries needed to be structured as map-reduce rather
than SQL.
IoT Under the Hood
30. Sparks Ignite, Inc.
A technology consulting firm. We build outcomes.
To summarize,
The best data structure for inserting into our persistent storage engine for time
series data would run in O(log n) or logarithmic time.
B+ Trees and Log Structured Merge Trees are both appropriate, but the LSM
Tree will deliver better performance for inserting time series data. Proper
configuration of the LSM Tree engine could move a substantial amount of the
operations from disk to memory (1E-09 rather than 1E-03).
Since we need to process 1E06 time more data, moving as much processing
from 1E-03 to 1E-09 will absolutely get us there.
The consistency model will likely be availability and partition tolerance (AP).
IoT Under the Hood