IoT and Big Data - Iot Asia 2014

© 2014 MapR Technologies 1© 2014 MapR Technologies
The Internet of Things and Big Data: Intro

© 2014 MapR Technologies 2
What This Is; What This Is Not
• It’s not specific to IoT
– It’s not about any specific type of data or protocol
– It’s not specific to any particular industry
• It’s about processing big data
– IoT data can be big data
– IoT might be the biggest data of the coming decade
– But it’s just big data
– Same strategies & technologies apply

When Does Data Become ―Big?‖
• When the size of the data, itself, becomes a problem
• When the ―old way‖ of processing data just doesn’t work
effectively
• It’s ―big‖ when we have to rethink:
– How we store that much data
– How we move that much data
– How we extract, load & transform that much data
– How we explore and analyze that much data
– How we process and get meaningful insights from that much data

C’mon! What does that mean in size?
• Not gigabytes
• Most likely not a few terabytes
• Possibly not 10’s of terabytes
• Probably 100’s of terabytes
• Definitely petabytes

So How Do We Handle Big Data?
• Distribute & parallelize!

MPP Analytic Databases or Hadoop

Big Data Analytics
Bridging classic & big data worlds
“Capture only what’s needed”
SQL performance and structure
Hadoop scale and flexibility
IT delivers a platform for storing,
refining, and analyzing all data
sources
Business explores data for
questions worth answering
Big Data Method
Multi-structured & iterative analysis
IT structures the data
to answer those questions
Business determines
what questions to ask
Classic Method
Structured & Repeatable Analysis
“Capture in case it’s needed”

Philosophical Differences
Traditional Methods
• More power
• Summarize data
• Transform and store
• Pre-defined schema
• Move data -> compute
• Less data / more complex
algorithms
Big Data
• More machines
• Keep all data
• Transform on demand
• Flexible / no schema
• Move compute -> data
• Mode data / simple
algorithms

answer = f(all data)
• Save all raw data
• Data immutability
• Transform as needed
• Result is based on the raw data

Q&A
@mapr maprtech
jberns@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Iot and Big Data:
Hadoop as a Data Platform

Hadoop: The Disruptive Technology at the Core of Big Data

Forces of Adoption
Hadoop TAM comes from disrupting enterprise data warehouse and storage spending
Data
IT Budgets
• Gartner, "Forecast Analysis: Enterprise IT Spending by Vertical Industry Market, Worldwide, 2010-2016, 3Q12 Update.―
• Wall Street Journal, ―Financial Services Companies Firms See Results from Big Data Push‖, Jan. 27, 2014
$9,000
$40,000
<$1,000
2013
ENTERPRISE
STORAGE
IT BUDGETS
GROWING AT 2.5%
2014 2015 2016 2017
DATABASE
WAREHOUSE
DATA GROWING
AT 40%
$ PER TERABYTE
HADOOP

Hadoop 101 (External Presentation)

Hadoop Hardware

Typical Compute Node
• Two CPUs, each with 4-8 cores per CPU
• 32-128 GB Memory
• 6-24 hard disks
• 2-4 10GB Network cards

Hadoop Ecosystem

Ecosystem of Projects Built of Hadoop

SQL On Hadoop

SQL on Hadoop
• Generally data has no inherent ―schema‖
• Schema is defined by user / interpreted from structure
• Schema is applied during processing
• One file can have many schemas applied
• Works for many kinds of data—but not all
– Temperature sensor data? Sure
– Video feeds? Not really

Key Use Cases
• Exploratory analysis on large
scale raw data
• Unknown value
• No defined schema
• Variety of data types
• Large-scale SQL queries on
long history
• Well defined schema
• Known value, but high cost in
existing systems
2
Big Data Analysis Big Data Exploration

What is Driving the Need for SQL-on-Hadoop?
Organizations are looking for
• Reuse existing tools and skills to unlock Hadoop data to broader
audience
• Analysis on new types of data
• More complete data analysis
• More up-to-date and real-time data analysis
(not just ―after the fact‖)

Drill 1.0 Hive 0.13 with Tez Impala 1.x Presto 0.56 Shark 0.8 Vertica
Latency Low Medium Low Low Medium Low
Files Yes (all Hive file
formats)
Yes (all Hive file
formats)
Yes (Parquet,
Sequence, …)
Yes (RC,
Sequence, Text)
Yes (all Hive file
formats)
Yes (all Hive file
formats)
HBase/M7 Yes Yes Various issues No Yes No
Schema Hive or schema-
less
Hive Hive Hive Hive Proprietary or Hive
SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL +
advanced analytics
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC,
ADO.NET, …
Large joins Yes Yes No No No Yes
Nested data Yes Limited No Limited Limited Limited
Hive UDFs Yes Yes Limited No Yes No
Transactions No No No No No Yes
Optimizer Limited Limited Limited Limited Limited Yes
Concurrency Limited Limited Limited Limited Limited Yes
SQL on Hadoop: Many Options
Flexibility to choose when to use which based on use case

ENTERPRISE
DATA HUB
MARKETING
ANALYTICS
RISK
ANALYTICS
OPERATIONS
INTELLIGENCE
• Multi-structured
data staging & archive
• ETL / DW optimization
• Mainframe
optimization
• Data exploration
• Recommendation
engines & targeting
• Ad optimization
• Pricing analysis
• Lead scoring
• Network security
monitoring
• Security information &
event management
• Fraudulent behavioral
analysis
• Supply chain & logistics
• System log analysis
• Manufacturing quality
assurance
• Preventative
maintenance
• Sensor analysis
Proven Hadoop Production Success

Other Tools & Frameworks of Note

Pig
• Procedural Language
• Loops, if-then statements

• Map Reduce Framwork
• Lingual: SQL-like operations
• Pattern: Machine Learning Applications
• Scalding: Cascading for Scala
• Cascalog: Cascading for Clojure

• Python, Scala and Java
• Spark powers a stack of high-level tools including
– Shark for SQL,
– MLlib for machine learning,
– GraphX, and
– Spark Streaming.
• You can combine these frameworks seamlessly in the same
application.

• Machine Learning / Predictive Analytics
– Collaborative Filtering
– Linear / Logistic Regression
– Naïve Bayes
– Random Forests
– K-Mean Clustering
– Canopy Clustering
– Principal Component Analysis

• Database on Hadoop
• Highly scalable
• Columnar – Flexible schema
• Data source for Map Reduce and Spark jobs

Q&A
@mapr maprtech
jberns@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Iot and Big Data:
Architectures & Use Cases

NoSQL

NoSQL Databases
• No-SQL or ―Not only‖ SQL
• Give up some of the functionality of traditional relational
databases for speed and scalability
• Types
– Key-Value
– Columnar
– Document
– Graph
• NoSQL databases favor flexible schemas

HBase

Queues

Queues
• Just like a queue at an amusement park
• First-in-first out
• Queues messages or events

Message Queue

Stream Processing

Stream Processing
• Handles data at high velocity
• If Hadoop is the ocean, streams are the firehose
• Processing in near real-time

Storm

Batch Processing

Combination Architectures

Lambda Architecture

Complex Architectures Using Many Big Data Technologies

Wanna Play?
• http://www.mapr.com/products/mapr-sandbox-hadoop

Q&A
@mapr maprtech
jberns@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

IoT and Big Data - Iot Asia 2014

More Related Content

What's hot

Viewers also liked

Similar to IoT and Big Data - Iot Asia 2014

Recently uploaded

IoT and Big Data - Iot Asia 2014

Editor's Notes