SlideShare a Scribd company logo
1 of 58
Download to read offline
A REFERENCE
ARCHITECTURE
FOR INTERNET
OF THINGS
SUJEE MANIYAM
FOUNDER / PRINCIPAL @ ELEPHANTSCALE
SUJEE@ELEPHANTSCALE.COM
(c) Elephant Scale 2015
HI, I’M SUJEE MANIYAM
•  Founder / Principal @ ElephantScale
•  Consulting & Training in Big Data
•  Training in Spark / Hadoop / NoSQL /
Data Science
•  Author
•  “Hadoop illuminated” open source book
•  “HBase Design Patterns
•  Open Source contributor: github.com/sujee
•  sujee@elephantscale.com
•  http://sujee.net
(c) Elephant Scale 2015
Spark
Training
available!
INTERNET OF THINGS – A
REALITY
(c) Elephant Scale 2015
DATA INFRASTRUCTURE
?
(c) Elephant Scale 2015
DATA VOLUME ?
A NAPKIN CALCULATION
Say we have
•  Million sensors
•  Each sensor reports every minute
•  data size 1KB
This will result in :
•  1.44 Billions events / day !
•  1.44 TB / day !!
(c) Elephant Scale 2015
SENSOR DATA WORKSHEET
IoT Temperature Sensor Projection
variables description
sensors 1M 1.00E+06 1 million
signal frequency every min / 60 secs 60 secs
event size 1KB 1000 bytes
events per day per sensor 1440
total events per day 1.44E+09 1440$$millions1.44$$billion
total events / sec 1.67E+04 16,666.67$$
total data size per day 1.44E+12 1440$$$GB 1.44$$TB
(c) Elephant Scale 2015
SENSOR DATA : TEXAS UTILITIES
SMART METER DATA
Texas Smart Meter Projections
variables description
sensors 10 million customers 1.00E+07 10 million
signal frequency every 15 mins 900 secs
event size 1.4 K 1400 bytes
events per day per sensor 96
total events per day 9.60E+08 960$$millions 0.96$$billion
total events / sec 1.11E+04 11,111.11$$
total data size per day 1.34E+12 1344$$$GB 1.344$$TB
(c) Elephant Scale 2015
BIG ‘DATA’
(c) Elephant Scale 2015
DATA VELOCITY
Say we have
•  Million sensors
•  Each sensor reports every minute
•  data size 1KB
è
•  Millions events / minute
•  ~ 17,000 events / sec
(c) Elephant Scale 2015
DATA PROCESSING SPEED
•  Need (near) real time processing most of the time
•  E.g. Need to alert if temperature suddenly spikes
temp
Time
Alert
(c) Elephant Scale 2015
CHALLENGE = BIG DATA +
REAL TIME
•  Don’t loose events !
•  Any event could be important
•  Most events are mundane (e.g. temperature stays
between 68’F – 72’ F)
•  Process them in near real time
•  Store the events for a long time
•  Audit
•  Diagnose
•  Support various queries
•  Real time (what is the latest temperature for sensor id
123?)
•  Aggregate (what is the avg. temp in zipcode 12345)
(c) Elephant Scale 2015
HIGH LEVEL ARCHITECTURE
(c) Elephant Scale 2015
NEXT : (1) CAPTURE
(c) Elephant Scale 2015
(1) CAPTURE
REQUIREMENTS
Requirements:
•  Capture events coming at high speed
•  Tens of thousands events / sec (some times millions / sec)
•  Don’t loose events
•  Tolerate hardware / software failure
•  Tolerate intermittent connectivity issues
•  Scale ‘easily’
(c) Elephant Scale 2015
(1) CAPTURE
CHOICES
•  MQ (RabbitMQ ..etc)
•  Good adoption in enterprises / durable
•  FluentD
•  Data collector for various sources
•  Flume
•  Part of Hadoop eco system
•  Good for collecting logs from many sources
•  AWS Kinesis
•  Queue system in Amazon Cloud
•  Kafka
•  Distributed queue
(c) Elephant Scale 2015
(1) CAPTURE
MEET KAFKA
•  Apache Kafka is a distributed messaging system
•  Came out of LinkedIn… open sourced in 2011
•  Built to tolerate hardware / software / network failures
•  Built for high throughput and scale
•  LinkedIn : 220 Billion messages / day
•  At peak : 3+ million messages / day
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA ARCHITECTURE
•  Publisher - subscriber / producer – consumer model
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA ARCHITECTURE
•  Producers write data
to brokers
•  Consumers read data
from brokers
•  All of this is
distributed / parallel
•  Failure tolerant
•  Data is stored as topics
•  “sensor_data”
•  “alerts”
•  “emails”
(c) Elephant Scale 2015
(1) CAPTURE
KAFKA USERS
•  Linked In
•  Track user activities
•  Sending emails
•  Netflix
•  Real time monitoring
•  Spotify
•  Ship logs to hadoop
•  Uber… AirBnB….
(c) Elephant Scale 2015
(1) CAPTURE
ARCHITECTURE WITH KAFKA
(c) Elephant Scale 2015
NEXT : (2) PROCESSING
(c) Elephant Scale 2015
(2) PROCESSING
REQUIREMENTS
•  Process events in real time or near real time
•  High velocity
•  Tens of thousands à millions of events / sec
•  Guaranteed processing
•  Process an event at-least-once
•  Exactly-once (harder to achieve)
•  Failure tolerant
•  Scale ‘easily’
(c) Elephant Scale 2015
(2) PROCESSING
CHOICES
•  Storm
•  Process streams
•  Events based
•  Came out of twitter
•  Apache Samza
•  Stream processing framework based on Kafka + Hadoop
YARN
•  Apache NiFi
•  Data flow
•  New project / incubating
•  Spark streaming
(c) Elephant Scale 2015
(2) PROCESSING
MEET SPARK
•  Spark is the new darling of ‘Big Data’ world
•  Lot’s of activity and interest
•  Fast and Expressive Cluster Compute Engine
•  “First Big Data platform to integrate batch, streaming and
interactive computations in a unified framework” –
stratio.com
Hadoop
Spark
(c) Elephant Scale 2015
(2) PROCESSING
SPARK ECO-SYSTEM
(c) Elephant Scale 2015
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema /
sql
Real Time
Machine
Learning
Stand alone YARN MESOS
Cluster
managers
GraphX
Graph
processing
(2) PROCESSING
SPARK DATA SOURCE
ABSTRACTION
Spark
(compute
engine)
HDFS
Amazon
S3
Cassandra ???
RDD
Hadoop
RDD
Cassandra
RDD
(c) Elephant Scale 2015
(2) PROCESSING
SPARK : ‘UNIFIED’ STACK
Spark supports multiple programming models
•  Map reduce style batch processing
•  Streaming / real time processing
•  Querying via SQL
•  Machine learning
All modules are tightly integrated
•  Facilitates rich applications
Spark can be only stack you need !
(c) Elephant Scale 2015
Image: buymeposters.com
(2) PROCESSING
SPARK STREAMING
Streaming
Sources
Storage
(c) Elephant Scale 2015
(2) PROCESSING
SPARK STREAMING
•  Provides ‘high level’ operations in time windows
•  E.g. ‘calculate X for the past 10 seconds’
•  Good adoption
(c) Elephant Scale 2015
(2) PROCESSING
ARCHITECTURE WITH SPARK
STREAMING
(c) Elephant Scale 2015
NEXT : STORAGE
(c) Elephant Scale 2015
(3) STORAGE
REQUIREMENTS
•  Handle ‘Big Data’ ( 1 TB / day !)
•  Traditional storages are not effective (or too expensive)
•  Need two types of storage
1.  ‘forever’ storage
•  Store multi terabytes of data for a long periods
•  Support Batch queries
2.  ‘fast / real-time lookup’ storage
•  Query in real time (milliseconds)
“what is the latest reading for sensor-123 ?”
•  Store latest / new data (e.g. last 3 months)
•  Flexible schema for semi-structured data
•  Both need to scale
(c) Elephant Scale 2015
(3) STORAGE
REQUIREMENTS
(c) Elephant Scale 2015
(3) STORAGE
CHOICES
•  ‘forever’ storage
•  Scalable distributed file systems
•  Hadoop ! (HDFS actually)
•  ‘real time store’
•  Traditional RDBMS won’t work
•  Don’t scale well (or too expensive)
•  Rigid schema layout
•  NoSQL !
(c) Elephant Scale 2015
(3) STORAGE
HDFS (IN 20 SECS)
•  Distributed file system
•  Runs on commodity servers
•  à high ROI
•  Can keep ticking even when nodes go down
•  à fault tolerant
•  Replicates data to prevent data loss in case of node
failures
•  à built in backup J
•  Scales to Peta bytes (horizontal scalability)
•  Proven in the field
(c) Elephant Scale 2015
(3) STORAGE
HDFS ARCHITECTURE
(c) Elephant Scale 2015
(3) STORAGE
COST OF BIG DATA
Source : hortonworks
(c) Elephant Scale 2015
(3) STORAGE
HDFS
•  Can handle big data
•  Scales easily
•  Cost effective
•  “Source of Truth”
•  Files are immutable within HDFS (new data is ‘appended’ )
•  Audit friendly
(c) Elephant Scale 2015
(3) STORAGE
CAPACITY PLANNING (HADOOP)
Variables (tweak these) description value units
Average daily ingest 1000 GB
raw data node storage eg. 12 disks x 3TB 36 TB
replication default 3 3
space allocated for HDFS HDFS 75% + Mapreduce 25% 75.00%
growth per month (not calculated) 0
Calculation
effective data storage per node 27 TB
growth 1 month 6 month 1 yr 2 yr
data size (TB) 90 540 1,080 2,160
# data nodes 3.333333333 20 40 80
(c) Elephant Scale 2015
(3) STORAGE (REAL TIME)
CHOICES FOR NOSQL
•  Too many ! J
•  HBase
•  Part of Hadoop eco system
•  Uses HDFS for storage
•  Provides consistent view of data
•  Cassandra
•  Popular NoSQL store
•  No Single Point of Failure (SPOF) – ring architecture
•  No dependency on Hadoop
•  Accumulo
•  Came out of NSA !
•  Uses HDFS for storage
•  Provides very good security (naturally !)
(c) Elephant Scale 2015
(3) STORAGE
CAP THEOREM
(c) Elephant Scale 2015
(3) STORAGE
ARCHITECTURE SO FAR
(c) Elephant Scale 2015
NEXT : QUERY
(c) Elephant Scale 2015
(3) QUERY
REQUIREMENTS
•  Real Time queries
•  “what is the latest reading for the sensor id = 123”
•  Useful for building applications /
dashboards
•  Latency : milli-seconds
•  Batch / Aggregate queries
•  “What is the average temperature in
zip code = 12345” ?
•  May need to go through large data points
•  Latency : ‘batch’ (minutes / hours)
(c) Elephant Scale 2015
(3) QUERY
SOLUTIONS
•  Batch queries
•  Query data in HDFS (and or NoSQL)
•  Hadoop mapreduce (Pig / Hive)
•  Spark batch analytics
•  Real time queries
•  Queries to go NoSQL store
HDFS
NoSQL
Real time
queries
Batch queries
(c) Elephant Scale 2015
(3) QUERY
ARCHITECTURE
(c) Elephant Scale 2015
FINAL ARCHITECTURE
(c) Elephant Scale 2015
LAMBDA ARCHITECTURE
(c) Elephant Scale 2015
LAMBDA ARCHITECTURE
EXPLAINED
1.  All new data is sent to both batch layer and speed layer
2.  Batch layer
•  Holds master data set (immutable , append-only)
•  Answers batch queries
3.  Serving layer
•  updates batch views so they can be queried adhoc
4.  Speed Layer
•  Handles new data
•  Facilitates fast / real-time queries
5.  Query layer
•  Answers queries using batch & real-time views
(c) Elephant Scale 2015
INCORPORATING LAMBDA
ARCHITECTURE
(c) Elephant Scale 2015
OUR ARCHITECTURE
•  Each component is scalable
•  Each component is fault tolerant
•  Incorporates best practices
•  All open source !
(c) Elephant Scale 2015
… AND ONE MORE THING…
•  Security !
(c) Elephant Scale 2015
Source : businessinsider.com
HOW EVER…
(c) Elephant Scale 2015
At scale nothing works as advertised !
GOOD NEWS !
•  We’d like to build an open source, reference data platform for IoT /
connected devices!
•  Yes, open source ! J
•  ElephantScale is a strong believer in open source
•  “hadoop illuminated” – open source Hadoop book
•  Github.com/elephantscale
•  Best practices
•  Bringing together lots of expertise in Big Data systems
•  Register your interest
http://elephantscale.com/iotx/
(c) Elephant Scale 2015
GOALS FOR IOTX
http://elephantscale.com/iotx/
•  Use open-source proven components
•  Capture :
•  Kafka
•  Kinesis (AWS)
•  Processing : Spark Streaming
•  Batch storage : Hadoop / HDFS
•  Real Time Store : support multiple data stores
•  Cassandra
•  Hbase
•  Accumulo
•  ???
(c) Elephant Scale 2015
GOALS FOR IOTX…
http://elephantscale.com/iotx/
•  Query templates using
•  Spark
•  Hadoop Map Reduce (Pig / Hive)
•  Incorporate third party libraries for
•  Outlier detection (temperature is outside norms)
•  Trend detection (stock price is trending up)
•  Alerts (fire !)
•  Monitoring & Metrics (key !!)
•  What’s going in the system?
•  Host / system level (cpu / network ..etc) – easier
•  application level (e.g. find slow queries) – harder
•  Incorporate third party libraries
(c) Elephant Scale 2015
THANKS AND QUESTIONS?
“A Reference Architecture for Internet of
Things (IoT)”
Sujee Maniyam
Founder / Principal @ ElephantScale
Expert Consulting + Training in Big Data technologies
sujee@elephantscale.com
Elephantscale.com
Project sign up page : http://elephantscale.com/iotx/
(c) Elephant Scale 2015
IMAGE CREDITS
•  www.engadget.com
•  Xfinity.com
•  Tesla.com
(c) Elephant Scale 2015

More Related Content

What's hot

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 

What's hot (20)

Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
Hadoop and Spark-Perfect Together-(Arun C. Murthy, Hortonworks)
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 

Viewers also liked

Wice Whats New Office 2007
Wice Whats New Office 2007Wice Whats New Office 2007
Wice Whats New Office 2007
ederrico
 
SharePoint 2010 public facing sites
SharePoint 2010 public facing sitesSharePoint 2010 public facing sites
SharePoint 2010 public facing sites
Chris Riley ☁
 
Stampabile msn 2009 mobile camp def [modalità compatibilità]
Stampabile msn 2009 mobile camp def [modalità compatibilità]Stampabile msn 2009 mobile camp def [modalità compatibilità]
Stampabile msn 2009 mobile camp def [modalità compatibilità]
Laura Cavallaro
 
PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...
PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...
PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...
Ratih Aini
 

Viewers also liked (16)

Jehe
JeheJehe
Jehe
 
Wice Whats New Office 2007
Wice Whats New Office 2007Wice Whats New Office 2007
Wice Whats New Office 2007
 
Contra los Collares dañinos - Emma infante Derecho animal y sociedad www.futu...
Contra los Collares dañinos - Emma infante Derecho animal y sociedad www.futu...Contra los Collares dañinos - Emma infante Derecho animal y sociedad www.futu...
Contra los Collares dañinos - Emma infante Derecho animal y sociedad www.futu...
 
Dinas kesehatan
Dinas kesehatanDinas kesehatan
Dinas kesehatan
 
Аудит 2013-2014
Аудит 2013-2014Аудит 2013-2014
Аудит 2013-2014
 
SharePoint 2010 public facing sites
SharePoint 2010 public facing sitesSharePoint 2010 public facing sites
SharePoint 2010 public facing sites
 
ჩვენ და კომპიუტერი
ჩვენ და კომპიუტერიჩვენ და კომპიუტერი
ჩვენ და კომპიუტერი
 
Отчет куратора Школы, 2014 год
Отчет куратора Школы, 2014 годОтчет куратора Школы, 2014 год
Отчет куратора Школы, 2014 год
 
Stampabile msn 2009 mobile camp def [modalità compatibilità]
Stampabile msn 2009 mobile camp def [modalità compatibilità]Stampabile msn 2009 mobile camp def [modalità compatibilità]
Stampabile msn 2009 mobile camp def [modalità compatibilità]
 
Act 8
Act 8Act 8
Act 8
 
PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...
PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...
PENGARUH TERAPI RELAKSASI OTOT PROGRESIF TERHADAP PERUBAHAN TINGKAT INSOMNIA ...
 
Qualità per Il Web - La Metodologia Mediabeta
Qualità per Il Web - La Metodologia MediabetaQualità per Il Web - La Metodologia Mediabeta
Qualità per Il Web - La Metodologia Mediabeta
 
Pohon Produksi Pohon Kelapa
Pohon Produksi Pohon KelapaPohon Produksi Pohon Kelapa
Pohon Produksi Pohon Kelapa
 
Innovative extension approaches in india
Innovative extension approaches in indiaInnovative extension approaches in india
Innovative extension approaches in india
 
Wh questions
Wh questionsWh questions
Wh questions
 
The present progressive, wh questions
The present progressive, wh questionsThe present progressive, wh questions
The present progressive, wh questions
 

Similar to Reference architecture for Internet Of Things

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
The Background Noise of the Internet
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the Internet
Andrew Morris
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 

Similar to Reference architecture for Internet Of Things (20)

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
 
Data Science
Data ScienceData Science
Data Science
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
The Background Noise of the Internet
The Background Noise of the InternetThe Background Noise of the Internet
The Background Noise of the Internet
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual Meetup
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

More from elephantscale

More from elephantscale (6)

AI for Kids
AI for KidsAI for Kids
AI for Kids
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
Building a Big Data Team
Building a Big Data TeamBuilding a Big Data Team
Building a Big Data Team
 
Petrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultinPetrophysics and Big Data by Elephant Scale training and consultin
Petrophysics and Big Data by Elephant Scale training and consultin
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Oil & Gas Big Data use cases
Oil & Gas Big Data use casesOil & Gas Big Data use cases
Oil & Gas Big Data use cases
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Reference architecture for Internet Of Things

  • 1. A REFERENCE ARCHITECTURE FOR INTERNET OF THINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANTSCALE SUJEE@ELEPHANTSCALE.COM (c) Elephant Scale 2015
  • 2. HI, I’M SUJEE MANIYAM •  Founder / Principal @ ElephantScale •  Consulting & Training in Big Data •  Training in Spark / Hadoop / NoSQL / Data Science •  Author •  “Hadoop illuminated” open source book •  “HBase Design Patterns •  Open Source contributor: github.com/sujee •  sujee@elephantscale.com •  http://sujee.net (c) Elephant Scale 2015 Spark Training available!
  • 3. INTERNET OF THINGS – A REALITY (c) Elephant Scale 2015
  • 5. DATA VOLUME ? A NAPKIN CALCULATION Say we have •  Million sensors •  Each sensor reports every minute •  data size 1KB This will result in : •  1.44 Billions events / day ! •  1.44 TB / day !! (c) Elephant Scale 2015
  • 6. SENSOR DATA WORKSHEET IoT Temperature Sensor Projection variables description sensors 1M 1.00E+06 1 million signal frequency every min / 60 secs 60 secs event size 1KB 1000 bytes events per day per sensor 1440 total events per day 1.44E+09 1440$$millions1.44$$billion total events / sec 1.67E+04 16,666.67$$ total data size per day 1.44E+12 1440$$$GB 1.44$$TB (c) Elephant Scale 2015
  • 7. SENSOR DATA : TEXAS UTILITIES SMART METER DATA Texas Smart Meter Projections variables description sensors 10 million customers 1.00E+07 10 million signal frequency every 15 mins 900 secs event size 1.4 K 1400 bytes events per day per sensor 96 total events per day 9.60E+08 960$$millions 0.96$$billion total events / sec 1.11E+04 11,111.11$$ total data size per day 1.34E+12 1344$$$GB 1.344$$TB (c) Elephant Scale 2015
  • 9. DATA VELOCITY Say we have •  Million sensors •  Each sensor reports every minute •  data size 1KB è •  Millions events / minute •  ~ 17,000 events / sec (c) Elephant Scale 2015
  • 10. DATA PROCESSING SPEED •  Need (near) real time processing most of the time •  E.g. Need to alert if temperature suddenly spikes temp Time Alert (c) Elephant Scale 2015
  • 11. CHALLENGE = BIG DATA + REAL TIME •  Don’t loose events ! •  Any event could be important •  Most events are mundane (e.g. temperature stays between 68’F – 72’ F) •  Process them in near real time •  Store the events for a long time •  Audit •  Diagnose •  Support various queries •  Real time (what is the latest temperature for sensor id 123?) •  Aggregate (what is the avg. temp in zipcode 12345) (c) Elephant Scale 2015
  • 12. HIGH LEVEL ARCHITECTURE (c) Elephant Scale 2015
  • 13. NEXT : (1) CAPTURE (c) Elephant Scale 2015
  • 14. (1) CAPTURE REQUIREMENTS Requirements: •  Capture events coming at high speed •  Tens of thousands events / sec (some times millions / sec) •  Don’t loose events •  Tolerate hardware / software failure •  Tolerate intermittent connectivity issues •  Scale ‘easily’ (c) Elephant Scale 2015
  • 15. (1) CAPTURE CHOICES •  MQ (RabbitMQ ..etc) •  Good adoption in enterprises / durable •  FluentD •  Data collector for various sources •  Flume •  Part of Hadoop eco system •  Good for collecting logs from many sources •  AWS Kinesis •  Queue system in Amazon Cloud •  Kafka •  Distributed queue (c) Elephant Scale 2015
  • 16. (1) CAPTURE MEET KAFKA •  Apache Kafka is a distributed messaging system •  Came out of LinkedIn… open sourced in 2011 •  Built to tolerate hardware / software / network failures •  Built for high throughput and scale •  LinkedIn : 220 Billion messages / day •  At peak : 3+ million messages / day (c) Elephant Scale 2015
  • 17. (1) CAPTURE KAFKA ARCHITECTURE •  Publisher - subscriber / producer – consumer model (c) Elephant Scale 2015
  • 18. (1) CAPTURE KAFKA ARCHITECTURE •  Producers write data to brokers •  Consumers read data from brokers •  All of this is distributed / parallel •  Failure tolerant •  Data is stored as topics •  “sensor_data” •  “alerts” •  “emails” (c) Elephant Scale 2015
  • 19. (1) CAPTURE KAFKA USERS •  Linked In •  Track user activities •  Sending emails •  Netflix •  Real time monitoring •  Spotify •  Ship logs to hadoop •  Uber… AirBnB…. (c) Elephant Scale 2015
  • 20. (1) CAPTURE ARCHITECTURE WITH KAFKA (c) Elephant Scale 2015
  • 21. NEXT : (2) PROCESSING (c) Elephant Scale 2015
  • 22. (2) PROCESSING REQUIREMENTS •  Process events in real time or near real time •  High velocity •  Tens of thousands à millions of events / sec •  Guaranteed processing •  Process an event at-least-once •  Exactly-once (harder to achieve) •  Failure tolerant •  Scale ‘easily’ (c) Elephant Scale 2015
  • 23. (2) PROCESSING CHOICES •  Storm •  Process streams •  Events based •  Came out of twitter •  Apache Samza •  Stream processing framework based on Kafka + Hadoop YARN •  Apache NiFi •  Data flow •  New project / incubating •  Spark streaming (c) Elephant Scale 2015
  • 24. (2) PROCESSING MEET SPARK •  Spark is the new darling of ‘Big Data’ world •  Lot’s of activity and interest •  Fast and Expressive Cluster Compute Engine •  “First Big Data platform to integrate batch, streaming and interactive computations in a unified framework” – stratio.com Hadoop Spark (c) Elephant Scale 2015
  • 25. (2) PROCESSING SPARK ECO-SYSTEM (c) Elephant Scale 2015 Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning Stand alone YARN MESOS Cluster managers GraphX Graph processing
  • 26. (2) PROCESSING SPARK DATA SOURCE ABSTRACTION Spark (compute engine) HDFS Amazon S3 Cassandra ??? RDD Hadoop RDD Cassandra RDD (c) Elephant Scale 2015
  • 27. (2) PROCESSING SPARK : ‘UNIFIED’ STACK Spark supports multiple programming models •  Map reduce style batch processing •  Streaming / real time processing •  Querying via SQL •  Machine learning All modules are tightly integrated •  Facilitates rich applications Spark can be only stack you need ! (c) Elephant Scale 2015 Image: buymeposters.com
  • 29. (2) PROCESSING SPARK STREAMING •  Provides ‘high level’ operations in time windows •  E.g. ‘calculate X for the past 10 seconds’ •  Good adoption (c) Elephant Scale 2015
  • 30. (2) PROCESSING ARCHITECTURE WITH SPARK STREAMING (c) Elephant Scale 2015
  • 31. NEXT : STORAGE (c) Elephant Scale 2015
  • 32. (3) STORAGE REQUIREMENTS •  Handle ‘Big Data’ ( 1 TB / day !) •  Traditional storages are not effective (or too expensive) •  Need two types of storage 1.  ‘forever’ storage •  Store multi terabytes of data for a long periods •  Support Batch queries 2.  ‘fast / real-time lookup’ storage •  Query in real time (milliseconds) “what is the latest reading for sensor-123 ?” •  Store latest / new data (e.g. last 3 months) •  Flexible schema for semi-structured data •  Both need to scale (c) Elephant Scale 2015
  • 34. (3) STORAGE CHOICES •  ‘forever’ storage •  Scalable distributed file systems •  Hadoop ! (HDFS actually) •  ‘real time store’ •  Traditional RDBMS won’t work •  Don’t scale well (or too expensive) •  Rigid schema layout •  NoSQL ! (c) Elephant Scale 2015
  • 35. (3) STORAGE HDFS (IN 20 SECS) •  Distributed file system •  Runs on commodity servers •  à high ROI •  Can keep ticking even when nodes go down •  à fault tolerant •  Replicates data to prevent data loss in case of node failures •  à built in backup J •  Scales to Peta bytes (horizontal scalability) •  Proven in the field (c) Elephant Scale 2015
  • 36. (3) STORAGE HDFS ARCHITECTURE (c) Elephant Scale 2015
  • 37. (3) STORAGE COST OF BIG DATA Source : hortonworks (c) Elephant Scale 2015
  • 38. (3) STORAGE HDFS •  Can handle big data •  Scales easily •  Cost effective •  “Source of Truth” •  Files are immutable within HDFS (new data is ‘appended’ ) •  Audit friendly (c) Elephant Scale 2015
  • 39. (3) STORAGE CAPACITY PLANNING (HADOOP) Variables (tweak these) description value units Average daily ingest 1000 GB raw data node storage eg. 12 disks x 3TB 36 TB replication default 3 3 space allocated for HDFS HDFS 75% + Mapreduce 25% 75.00% growth per month (not calculated) 0 Calculation effective data storage per node 27 TB growth 1 month 6 month 1 yr 2 yr data size (TB) 90 540 1,080 2,160 # data nodes 3.333333333 20 40 80 (c) Elephant Scale 2015
  • 40. (3) STORAGE (REAL TIME) CHOICES FOR NOSQL •  Too many ! J •  HBase •  Part of Hadoop eco system •  Uses HDFS for storage •  Provides consistent view of data •  Cassandra •  Popular NoSQL store •  No Single Point of Failure (SPOF) – ring architecture •  No dependency on Hadoop •  Accumulo •  Came out of NSA ! •  Uses HDFS for storage •  Provides very good security (naturally !) (c) Elephant Scale 2015
  • 41. (3) STORAGE CAP THEOREM (c) Elephant Scale 2015
  • 42. (3) STORAGE ARCHITECTURE SO FAR (c) Elephant Scale 2015
  • 43. NEXT : QUERY (c) Elephant Scale 2015
  • 44. (3) QUERY REQUIREMENTS •  Real Time queries •  “what is the latest reading for the sensor id = 123” •  Useful for building applications / dashboards •  Latency : milli-seconds •  Batch / Aggregate queries •  “What is the average temperature in zip code = 12345” ? •  May need to go through large data points •  Latency : ‘batch’ (minutes / hours) (c) Elephant Scale 2015
  • 45. (3) QUERY SOLUTIONS •  Batch queries •  Query data in HDFS (and or NoSQL) •  Hadoop mapreduce (Pig / Hive) •  Spark batch analytics •  Real time queries •  Queries to go NoSQL store HDFS NoSQL Real time queries Batch queries (c) Elephant Scale 2015
  • 49. LAMBDA ARCHITECTURE EXPLAINED 1.  All new data is sent to both batch layer and speed layer 2.  Batch layer •  Holds master data set (immutable , append-only) •  Answers batch queries 3.  Serving layer •  updates batch views so they can be queried adhoc 4.  Speed Layer •  Handles new data •  Facilitates fast / real-time queries 5.  Query layer •  Answers queries using batch & real-time views (c) Elephant Scale 2015
  • 51. OUR ARCHITECTURE •  Each component is scalable •  Each component is fault tolerant •  Incorporates best practices •  All open source ! (c) Elephant Scale 2015
  • 52. … AND ONE MORE THING… •  Security ! (c) Elephant Scale 2015 Source : businessinsider.com
  • 53. HOW EVER… (c) Elephant Scale 2015 At scale nothing works as advertised !
  • 54. GOOD NEWS ! •  We’d like to build an open source, reference data platform for IoT / connected devices! •  Yes, open source ! J •  ElephantScale is a strong believer in open source •  “hadoop illuminated” – open source Hadoop book •  Github.com/elephantscale •  Best practices •  Bringing together lots of expertise in Big Data systems •  Register your interest http://elephantscale.com/iotx/ (c) Elephant Scale 2015
  • 55. GOALS FOR IOTX http://elephantscale.com/iotx/ •  Use open-source proven components •  Capture : •  Kafka •  Kinesis (AWS) •  Processing : Spark Streaming •  Batch storage : Hadoop / HDFS •  Real Time Store : support multiple data stores •  Cassandra •  Hbase •  Accumulo •  ??? (c) Elephant Scale 2015
  • 56. GOALS FOR IOTX… http://elephantscale.com/iotx/ •  Query templates using •  Spark •  Hadoop Map Reduce (Pig / Hive) •  Incorporate third party libraries for •  Outlier detection (temperature is outside norms) •  Trend detection (stock price is trending up) •  Alerts (fire !) •  Monitoring & Metrics (key !!) •  What’s going in the system? •  Host / system level (cpu / network ..etc) – easier •  application level (e.g. find slow queries) – harder •  Incorporate third party libraries (c) Elephant Scale 2015
  • 57. THANKS AND QUESTIONS? “A Reference Architecture for Internet of Things (IoT)” Sujee Maniyam Founder / Principal @ ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Project sign up page : http://elephantscale.com/iotx/ (c) Elephant Scale 2015
  • 58. IMAGE CREDITS •  www.engadget.com •  Xfinity.com •  Tesla.com (c) Elephant Scale 2015