© 2014 MapR Technologies 1© 2014 MapR Technologies
The Internet of Things and Big Data: Intro
© 2014 MapR Technologies 2
What This Is; What This Is Not
• It’s not specific to IoT
– It’s not about any specific type of data or protocol
– It’s not specific to any particular industry
• It’s about processing big data
– IoT data can be big data
– IoT might be the biggest data of the coming decade
– But it’s just big data
– Same strategies & technologies apply
© 2014 MapR Technologies 3
© 2014 MapR Technologies 4
© 2014 MapR Technologies 5
When Does Data Become ―Big?‖
• When the size of the data, itself, becomes a problem
• When the ―old way‖ of processing data just doesn’t work
effectively
• It’s ―big‖ when we have to rethink:
– How we store that much data
– How we move that much data
– How we extract, load & transform that much data
– How we explore and analyze that much data
– How we process and get meaningful insights from that much data
© 2014 MapR Technologies 6
C’mon! What does that mean in size?
• Not gigabytes
• Most likely not a few terabytes
• Possibly not 10’s of terabytes
• Probably 100’s of terabytes
• Definitely petabytes
© 2014 MapR Technologies 7
So How Do We Handle Big Data?
• Distribute & parallelize!
© 2014 MapR Technologies 8
MPP Analytic Databases or Hadoop
© 2014 MapR Technologies 9
Big Data Analytics
Bridging classic & big data worlds
“Capture only what’s needed”
SQL performance and structure
Hadoop scale and flexibility
IT delivers a platform for storing,
refining, and analyzing all data
sources
Business explores data for
questions worth answering
Big Data Method
Multi-structured & iterative analysis
IT structures the data
to answer those questions
Business determines
what questions to ask
Classic Method
Structured & Repeatable Analysis
“Capture in case it’s needed”
© 2014 MapR Technologies 10
Philosophical Differences
Traditional Methods
• More power
• Summarize data
• Transform and store
• Pre-defined schema
• Move data -> compute
• Less data / more complex
algorithms
Big Data
• More machines
• Keep all data
• Transform on demand
• Flexible / no schema
• Move compute -> data
• Mode data / simple
algorithms
© 2014 MapR Technologies 11
answer = f(all data)
• Save all raw data
• Data immutability
• Transform as needed
• Result is based on the raw data
© 2014 MapR Technologies 12
Q&A
@mapr maprtech
jberns@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
© 2014 MapR Technologies 13© 2014 MapR Technologies
Iot and Big Data:
Hadoop as a Data Platform
© 2014 MapR Technologies 14
Hadoop: The Disruptive Technology at the Core of Big Data
© 2014 MapR Technologies 15
Forces of Adoption
Hadoop TAM comes from disrupting enterprise data warehouse and storage spending
Data
IT Budgets
• Gartner, "Forecast Analysis: Enterprise IT Spending by Vertical Industry Market, Worldwide, 2010-2016, 3Q12 Update.―
• Wall Street Journal, ―Financial Services Companies Firms See Results from Big Data Push‖, Jan. 27, 2014
$9,000
$40,000
<$1,000
2013
ENTERPRISE
STORAGE
IT BUDGETS
GROWING AT 2.5%
2014 2015 2016 2017
DATABASE
WAREHOUSE
DATA GROWING
AT 40%
$ PER TERABYTE
HADOOP
© 2014 MapR Technologies 16© 2014 MapR Technologies
Hadoop 101 (External Presentation)
© 2014 MapR Technologies 17© 2014 MapR Technologies
Hadoop Hardware
© 2014 MapR Technologies 18
Typical Compute Node
• Two CPUs, each with 4-8 cores per CPU
• 32-128 GB Memory
• 6-24 hard disks
• 2-4 10GB Network cards
© 2014 MapR Technologies 19© 2014 MapR Technologies
Hadoop Ecosystem
© 2014 MapR Technologies 20
Ecosystem of Projects Built of Hadoop
© 2014 MapR Technologies 21© 2014 MapR Technologies
SQL On Hadoop
© 2014 MapR Technologies 22
SQL on Hadoop
• Generally data has no inherent ―schema‖
• Schema is defined by user / interpreted from structure
• Schema is applied during processing
• One file can have many schemas applied
• Works for many kinds of data—but not all
– Temperature sensor data? Sure
– Video feeds? Not really
© 2014 MapR Technologies 23
Key Use Cases
• Exploratory analysis on large
scale raw data
• Unknown value
• No defined schema
• Variety of data types
• Large-scale SQL queries on
long history
• Well defined schema
• Known value, but high cost in
existing systems
2
Big Data Analysis Big Data Exploration
© 2014 MapR Technologies 24
What is Driving the Need for SQL-on-Hadoop?
Organizations are looking for
• Reuse existing tools and skills to unlock Hadoop data to broader
audience
• Analysis on new types of data
• More complete data analysis
• More up-to-date and real-time data analysis
(not just ―after the fact‖)
© 2014 MapR Technologies 25
Drill 1.0 Hive 0.13 with Tez Impala 1.x Presto 0.56 Shark 0.8 Vertica
Latency Low Medium Low Low Medium Low
Files Yes (all Hive file
formats)
Yes (all Hive file
formats)
Yes (Parquet,
Sequence, …)
Yes (RC,
Sequence, Text)
Yes (all Hive file
formats)
Yes (all Hive file
formats)
HBase/M7 Yes Yes Various issues No Yes No
Schema Hive or schema-
less
Hive Hive Hive Hive Proprietary or Hive
SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL +
advanced analytics
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC,
ADO.NET, …
Large joins Yes Yes No No No Yes
Nested data Yes Limited No Limited Limited Limited
Hive UDFs Yes Yes Limited No Yes No
Transactions No No No No No Yes
Optimizer Limited Limited Limited Limited Limited Yes
Concurrency Limited Limited Limited Limited Limited Yes
SQL on Hadoop: Many Options
Flexibility to choose when to use which based on use case
© 2014 MapR Technologies 26
ENTERPRISE
DATA HUB
MARKETING
ANALYTICS
RISK
ANALYTICS
OPERATIONS
INTELLIGENCE
• Multi-structured
data staging & archive
• ETL / DW optimization
• Mainframe
optimization
• Data exploration
• Recommendation
engines & targeting
• Ad optimization
• Pricing analysis
• Lead scoring
• Network security
monitoring
• Security information &
event management
• Fraudulent behavioral
analysis
• Supply chain & logistics
• System log analysis
• Manufacturing quality
assurance
• Preventative
maintenance
• Sensor analysis
Proven Hadoop Production Success
© 2014 MapR Technologies 27© 2014 MapR Technologies
Other Tools & Frameworks of Note
© 2014 MapR Technologies 28
Pig
• Procedural Language
• Loops, if-then statements
© 2014 MapR Technologies 29
• Map Reduce Framwork
• Lingual: SQL-like operations
• Pattern: Machine Learning Applications
• Scalding: Cascading for Scala
• Cascalog: Cascading for Clojure
© 2014 MapR Technologies 30
• Python, Scala and Java
• Spark powers a stack of high-level tools including
– Shark for SQL,
– MLlib for machine learning,
– GraphX, and
– Spark Streaming.
• You can combine these frameworks seamlessly in the same
application.
© 2014 MapR Technologies 31
• Machine Learning / Predictive Analytics
– Collaborative Filtering
– Linear / Logistic Regression
– Naïve Bayes
– Random Forests
– K-Mean Clustering
– Canopy Clustering
– Principal Component Analysis
© 2014 MapR Technologies 32
• Database on Hadoop
• Highly scalable
• Columnar – Flexible schema
• Data source for Map Reduce and Spark jobs
© 2014 MapR Technologies 33
Q&A
@mapr maprtech
jberns@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
© 2014 MapR Technologies 34© 2014 MapR Technologies
Iot and Big Data:
Architectures & Use Cases
© 2014 MapR Technologies 35© 2014 MapR Technologies
NoSQL
© 2014 MapR Technologies 36
NoSQL Databases
• No-SQL or ―Not only‖ SQL
• Give up some of the functionality of traditional relational
databases for speed and scalability
• Types
– Key-Value
– Columnar
– Document
– Graph
• NoSQL databases favor flexible schemas
© 2014 MapR Technologies 37
HBase
© 2014 MapR Technologies 38© 2014 MapR Technologies
Queues
© 2014 MapR Technologies 39
Queues
• Just like a queue at an amusement park
• First-in-first out
• Queues messages or events
© 2014 MapR Technologies 40
Message Queue
© 2014 MapR Technologies 41© 2014 MapR Technologies
Stream Processing
© 2014 MapR Technologies 42
Stream Processing
• Handles data at high velocity
• If Hadoop is the ocean, streams are the firehose
• Processing in near real-time
© 2014 MapR Technologies 43
Storm
© 2014 MapR Technologies 44© 2014 MapR Technologies
Batch Processing
© 2014 MapR Technologies 45© 2014 MapR Technologies
Combination Architectures
© 2014 MapR Technologies 46
Lambda Architecture
© 2014 MapR Technologies 47
Complex Architectures Using Many Big Data Technologies
© 2014 MapR Technologies 48
Wanna Play?
• http://www.mapr.com/products/mapr-sandbox-hadoop
© 2014 MapR Technologies 49
Q&A
@mapr maprtech
jberns@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

IoT and Big Data - Iot Asia 2014

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies The Internet of Things and Big Data: Intro
  • 2.
    © 2014 MapRTechnologies 2 What This Is; What This Is Not • It’s not specific to IoT – It’s not about any specific type of data or protocol – It’s not specific to any particular industry • It’s about processing big data – IoT data can be big data – IoT might be the biggest data of the coming decade – But it’s just big data – Same strategies & technologies apply
  • 3.
    © 2014 MapRTechnologies 3
  • 4.
    © 2014 MapRTechnologies 4
  • 5.
    © 2014 MapRTechnologies 5 When Does Data Become ―Big?‖ • When the size of the data, itself, becomes a problem • When the ―old way‖ of processing data just doesn’t work effectively • It’s ―big‖ when we have to rethink: – How we store that much data – How we move that much data – How we extract, load & transform that much data – How we explore and analyze that much data – How we process and get meaningful insights from that much data
  • 6.
    © 2014 MapRTechnologies 6 C’mon! What does that mean in size? • Not gigabytes • Most likely not a few terabytes • Possibly not 10’s of terabytes • Probably 100’s of terabytes • Definitely petabytes
  • 7.
    © 2014 MapRTechnologies 7 So How Do We Handle Big Data? • Distribute & parallelize!
  • 8.
    © 2014 MapRTechnologies 8 MPP Analytic Databases or Hadoop
  • 9.
    © 2014 MapRTechnologies 9 Big Data Analytics Bridging classic & big data worlds “Capture only what’s needed” SQL performance and structure Hadoop scale and flexibility IT delivers a platform for storing, refining, and analyzing all data sources Business explores data for questions worth answering Big Data Method Multi-structured & iterative analysis IT structures the data to answer those questions Business determines what questions to ask Classic Method Structured & Repeatable Analysis “Capture in case it’s needed”
  • 10.
    © 2014 MapRTechnologies 10 Philosophical Differences Traditional Methods • More power • Summarize data • Transform and store • Pre-defined schema • Move data -> compute • Less data / more complex algorithms Big Data • More machines • Keep all data • Transform on demand • Flexible / no schema • Move compute -> data • Mode data / simple algorithms
  • 11.
    © 2014 MapRTechnologies 11 answer = f(all data) • Save all raw data • Data immutability • Transform as needed • Result is based on the raw data
  • 12.
    © 2014 MapRTechnologies 12 Q&A @mapr maprtech jberns@mapr.com Engage with us! MapR maprtech mapr-technologies
  • 13.
    © 2014 MapRTechnologies 13© 2014 MapR Technologies Iot and Big Data: Hadoop as a Data Platform
  • 14.
    © 2014 MapRTechnologies 14 Hadoop: The Disruptive Technology at the Core of Big Data
  • 15.
    © 2014 MapRTechnologies 15 Forces of Adoption Hadoop TAM comes from disrupting enterprise data warehouse and storage spending Data IT Budgets • Gartner, "Forecast Analysis: Enterprise IT Spending by Vertical Industry Market, Worldwide, 2010-2016, 3Q12 Update.― • Wall Street Journal, ―Financial Services Companies Firms See Results from Big Data Push‖, Jan. 27, 2014 $9,000 $40,000 <$1,000 2013 ENTERPRISE STORAGE IT BUDGETS GROWING AT 2.5% 2014 2015 2016 2017 DATABASE WAREHOUSE DATA GROWING AT 40% $ PER TERABYTE HADOOP
  • 16.
    © 2014 MapRTechnologies 16© 2014 MapR Technologies Hadoop 101 (External Presentation)
  • 17.
    © 2014 MapRTechnologies 17© 2014 MapR Technologies Hadoop Hardware
  • 18.
    © 2014 MapRTechnologies 18 Typical Compute Node • Two CPUs, each with 4-8 cores per CPU • 32-128 GB Memory • 6-24 hard disks • 2-4 10GB Network cards
  • 19.
    © 2014 MapRTechnologies 19© 2014 MapR Technologies Hadoop Ecosystem
  • 20.
    © 2014 MapRTechnologies 20 Ecosystem of Projects Built of Hadoop
  • 21.
    © 2014 MapRTechnologies 21© 2014 MapR Technologies SQL On Hadoop
  • 22.
    © 2014 MapRTechnologies 22 SQL on Hadoop • Generally data has no inherent ―schema‖ • Schema is defined by user / interpreted from structure • Schema is applied during processing • One file can have many schemas applied • Works for many kinds of data—but not all – Temperature sensor data? Sure – Video feeds? Not really
  • 23.
    © 2014 MapRTechnologies 23 Key Use Cases • Exploratory analysis on large scale raw data • Unknown value • No defined schema • Variety of data types • Large-scale SQL queries on long history • Well defined schema • Known value, but high cost in existing systems 2 Big Data Analysis Big Data Exploration
  • 24.
    © 2014 MapRTechnologies 24 What is Driving the Need for SQL-on-Hadoop? Organizations are looking for • Reuse existing tools and skills to unlock Hadoop data to broader audience • Analysis on new types of data • More complete data analysis • More up-to-date and real-time data analysis (not just ―after the fact‖)
  • 25.
    © 2014 MapRTechnologies 25 Drill 1.0 Hive 0.13 with Tez Impala 1.x Presto 0.56 Shark 0.8 Vertica Latency Low Medium Low Low Medium Low Files Yes (all Hive file formats) Yes (all Hive file formats) Yes (Parquet, Sequence, …) Yes (RC, Sequence, Text) Yes (all Hive file formats) Yes (all Hive file formats) HBase/M7 Yes Yes Various issues No Yes No Schema Hive or schema- less Hive Hive Hive Hive Proprietary or Hive SQL support ANSI SQL HiveQL HiveQL (subset) ANSI SQL HiveQL ANSI SQL + advanced analytics Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC ODBC/JDBC, ADO.NET, … Large joins Yes Yes No No No Yes Nested data Yes Limited No Limited Limited Limited Hive UDFs Yes Yes Limited No Yes No Transactions No No No No No Yes Optimizer Limited Limited Limited Limited Limited Yes Concurrency Limited Limited Limited Limited Limited Yes SQL on Hadoop: Many Options Flexibility to choose when to use which based on use case
  • 26.
    © 2014 MapRTechnologies 26 ENTERPRISE DATA HUB MARKETING ANALYTICS RISK ANALYTICS OPERATIONS INTELLIGENCE • Multi-structured data staging & archive • ETL / DW optimization • Mainframe optimization • Data exploration • Recommendation engines & targeting • Ad optimization • Pricing analysis • Lead scoring • Network security monitoring • Security information & event management • Fraudulent behavioral analysis • Supply chain & logistics • System log analysis • Manufacturing quality assurance • Preventative maintenance • Sensor analysis Proven Hadoop Production Success
  • 27.
    © 2014 MapRTechnologies 27© 2014 MapR Technologies Other Tools & Frameworks of Note
  • 28.
    © 2014 MapRTechnologies 28 Pig • Procedural Language • Loops, if-then statements
  • 29.
    © 2014 MapRTechnologies 29 • Map Reduce Framwork • Lingual: SQL-like operations • Pattern: Machine Learning Applications • Scalding: Cascading for Scala • Cascalog: Cascading for Clojure
  • 30.
    © 2014 MapRTechnologies 30 • Python, Scala and Java • Spark powers a stack of high-level tools including – Shark for SQL, – MLlib for machine learning, – GraphX, and – Spark Streaming. • You can combine these frameworks seamlessly in the same application.
  • 31.
    © 2014 MapRTechnologies 31 • Machine Learning / Predictive Analytics – Collaborative Filtering – Linear / Logistic Regression – Naïve Bayes – Random Forests – K-Mean Clustering – Canopy Clustering – Principal Component Analysis
  • 32.
    © 2014 MapRTechnologies 32 • Database on Hadoop • Highly scalable • Columnar – Flexible schema • Data source for Map Reduce and Spark jobs
  • 33.
    © 2014 MapRTechnologies 33 Q&A @mapr maprtech jberns@mapr.com Engage with us! MapR maprtech mapr-technologies
  • 34.
    © 2014 MapRTechnologies 34© 2014 MapR Technologies Iot and Big Data: Architectures & Use Cases
  • 35.
    © 2014 MapRTechnologies 35© 2014 MapR Technologies NoSQL
  • 36.
    © 2014 MapRTechnologies 36 NoSQL Databases • No-SQL or ―Not only‖ SQL • Give up some of the functionality of traditional relational databases for speed and scalability • Types – Key-Value – Columnar – Document – Graph • NoSQL databases favor flexible schemas
  • 37.
    © 2014 MapRTechnologies 37 HBase
  • 38.
    © 2014 MapRTechnologies 38© 2014 MapR Technologies Queues
  • 39.
    © 2014 MapRTechnologies 39 Queues • Just like a queue at an amusement park • First-in-first out • Queues messages or events
  • 40.
    © 2014 MapRTechnologies 40 Message Queue
  • 41.
    © 2014 MapRTechnologies 41© 2014 MapR Technologies Stream Processing
  • 42.
    © 2014 MapRTechnologies 42 Stream Processing • Handles data at high velocity • If Hadoop is the ocean, streams are the firehose • Processing in near real-time
  • 43.
    © 2014 MapRTechnologies 43 Storm
  • 44.
    © 2014 MapRTechnologies 44© 2014 MapR Technologies Batch Processing
  • 45.
    © 2014 MapRTechnologies 45© 2014 MapR Technologies Combination Architectures
  • 46.
    © 2014 MapRTechnologies 46 Lambda Architecture
  • 47.
    © 2014 MapRTechnologies 47 Complex Architectures Using Many Big Data Technologies
  • 48.
    © 2014 MapRTechnologies 48 Wanna Play? • http://www.mapr.com/products/mapr-sandbox-hadoop
  • 49.
    © 2014 MapRTechnologies 49 Q&A @mapr maprtech jberns@mapr.com Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #15 Let’s start with this chart. To reinforce you’re in the right room you picked the right session…Hadoop Not only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
  • #16 Need a Platform that serves the broadest sets of use cases….
  • #25 Large media company – 30 days worth of data in GP; 90 days in Hadoop (5 Petabytes).. Want to make all data available for analysis – can not do with GP (400 nodes required). Want to make SQL-on-Hadoop available – if people are happy with performance they will transition workloads to Hadoop. Have 200 nodes of GP today (analytics platform); (aggregates in DW are 40 nodes
  • #27 Hadoop is being used in lots of different use cases across a variety of industriesOne way to think of this are functional areas of an organization (from left to right CIO/chief data officer, CMO (marketing), CSO or CRO (chief security or risk), or the COO, head of quality, or IT operations)We have many customers in each of these areas. Here are some example customers of MapR (give example snippets of each)