SlideShare a Scribd company logo
1 of 31
1 © 2016, Conversant, LLC. All rights reserved.
DATA MODELING FOR IOT
APACHECON IOT NORTH AMERICA 2017 PRESENTED BY:
JAYESH THAKRAR
SENIOR SOFTWARE ENGINEER
2
WHY DATA MODELING FOR IOT?
1. IoT is the next big wave after social media
(e.g. connected cars, smart homes & appliances)
2. Interesting challenges of volume, velocity and variety
3. Can be applied to other big data problems
3
DATA MODELING FOR IOT
1. Discuss sample IoT application
2. Discuss data model
3. Discuss application architecture
4
Sample
Application
5
INTELLIGENT VEHICLES
Cloud (Internet)
Road-side
infrastructure
• V2V: Vehicle to Vehicle
• V2C : Vehicle to Cloud
• V2I: Vehicle to Infrastructure
• Event = single, discrete
communication message
exchanged between a vehicle
and infrastructure
Communication Endpoints:
6
V2I: DATA & APPLICATION ASSUMPTIONS
• 1+ billion vehicles
• 500+ events per vehicle/day, based on
avg. time on road = 3 hours = 180 min
1 event per 10-30 seconds (avg = 3 per min) = 180*3 = 540 events/vehicle
• Avg. event size = 250-500+ bytes
Total raw data size = 150-300 TB / day
• Cassandra datastore
can be applied to HBase or other similarly
scalable datastore with appropriate testing
• Streaming for ingestion/processing/ETL
• Adhoc and batched analytics, extraction, etc
• Avoid schema-level indexes
for maintainability, efficiency, size, storage, etc.
7
SAMPLE APPLICATION ARCHITECTURE
Ingestion pipeline
Stream processing and analytics
Data storage
8
DATA MODEL CONSTRAINTS / REQUIREMENTS
• Efficient, low-latency writes and reads
• Sample queries:
- Events for a vehicle between two dates (or timestamps)
- Events for an infrastructure between two dates (or timestamps)
- Events by all infrastructure on a specific road-segment in a region
• Short, adhoc query characteristics/needs (guesstimate)
- volume = 100 – 100,000 rows
- response time = 100 ms – 100 seconds (proportional to result size)
9
SCHEMA VISUALIZATION: STAR SCHEMA
Vehicle
Event
Infrastructure
Road SegmentTime / Calendar
Region
10
CAN ALSO BE APPLIED TO: ADVERTISING/SEARCH
Cookie
Event
URL
LocationTime / Calendar
Region
11
CN ALSO BE APPLIED TO : SOCIAL NETWORKS
User
Action
Page
LocationTime / Calendar
Region
12
IoT Data Model
13
INSPIRATION: UNIX FILESYSTEM INODE
14
CASSANDRA: TABLE BASICS
• Data stored in tables with pre-defined schema
• Data types: primitives, collections, user-defined type
– Collections = sets, maps, lists
– Map keys and set and list values sorted
• Every table has primary key (PK)
– PK = single column or multi-column (composite)
– Data distributed on cluster nodes based on hash of first part of PK
• Keyspace = collection of (related) tables
• PK based queries = very fast
because of key cache, bloom filter, and sstable indexes
15
DATA ASSUMPTIONS (SIMPLISTIC MODEL)
16
TABLE DESIGN OPTIONS
Traditional table structure - column for each field
INSERT INTO event(id, timestamp, vehicle_id, infra_id,...)
INSERT INTO event JSON '{ "id" : 1234, "timestamp" : "...", ....)
All data fields serialized into a single column
INSERT INTO event(id, data)
VALUES (1234, "JSON/blob/serialized avro/etc") // data = blob or text
All data field stored in a collection field (e.g. map and/or set)
INSERT INTO event(id, data)
VALUES (1234, {'timestamp': ...}) // data = map<text, text>
17
STAR SCHEMA: DIMENSION TABLES
18
STAR SCHEMA: EVENT NAVIGATION TABLES
19
VEHICLE -> EVENTS : VEH_EVENT
CREATE TABLE veh_event(id TEXT PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
eb5071d8-0e35-4a82-ad37-543d3da66de7 set_data: (2017062408, 2017062409, ...)
eb5071d8-0e35-4a82-ad37-543d3da66de7, 2017062408 map_data: (08:23:16.732 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...)
25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ......
Level 0: Map of pointers to hourly data for each vehicle
Level 1: Map of pointers to actual event data for a vehicle for a given hour interval
Actual event data
vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7
event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
20
INFRASTRUCTURE -> EVENTS: INFRA_EVENT
CREATE TABLE infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559
vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7
event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
Level 0: Map of pointers to hourly data for each infrastructure
ffe0bdbb-3b89-4337-a477-4a17f719b559 set_data: (2017062408, 2017062409, ...)
ffe0bdbb-3b89-4337-a477-4a17f719b559, 2017062408 map_data: (23:16.732, eb5071d8-0e35-4a82-ad37-543d3da66de7 ->
25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...)
Level 1: Map of pointers to actual event data by vehicle for an infrastructure for a given hour interval
25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ......
Actual event data
21
LOCATION -> EVENTS: LOC_INFRA_EVENT
CREATE TABLE loc_infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
3aa40699-357e-48db-888b-af2ff7856949 set_data: (60b57655-0670-4969-9eec-99bcf8c8a034, ...)
60b57655-0670-4969-9eec-99bcf8c8a034 set_data: (ffe0bdbb-3b89-4337-a477-4a17f719b559, ...)
Level 0: Map of pointers to road-segments by region
Level 1: Map of pointers to infrastructure by road-segment
region_id = 3aa40699-357e-48db-888b-af2ff7856949
road_seg_id = 60b57655-0670-4969-9eec-99bcf8c8a034
infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559
map_data can be used above if there is a need to store any data (e.g. timestamp) along with road-segment or infra id
22
LOGICAL & PHYSICAL DESIGN CONSIDERATIONS
• Split each "level" of (logical) event navigation table into physical tables
– E.g. vehicle_event into vehicle_event_lo, vehicle_event_l1
Allows tuning parameters like cache, partition size, bloom filter as well as ease maintenance, etc.
• Primary keys for tables – combine process-level UUID + counter E.g.
– <uuid>-<NNNN> (reduces number of UUID generation calls)
– Further compact primary key by using binary encoding instead of string
(e.g 16 bytes for UUID + 8 bytes for counter)
• Short column names and appropriate data formats
– CREATE TABLE vehicle_event(id BLOB PRIMARY KEY, m MAP <TEXT, TEXT>, s SET <TEXT>, ...)
– Compact data e.g. time-of-day timestamps as integer i.e. ms of the day)
• Data immutability (helps reduce Cassandra entropy & ghost data concerns)
– Immutable event level data (insert-only into event and navigation tables)
– TTL to "age-out/purge" old data
• Keyspace sharding by time period and Cassandra compaction strategy
– Keyspace by day/week Compaction strategy = STCS v/s TWCS
23
KEY TAKEAWAYS OF DATA MODEL
• Single column primary keys
• Short primary key and column names
• All access (single row or range scan) via primary keys only
• Range scan (when necessary) appropriately paginated
• Immutable data (no updates/deletes) and idempotent inserts
• Data purge (TTL v/s keyspace by time period)
24
The Big Picture
Data Architecture +
App Architecture
25
SINGLE CLUSTER, CENTRALIZED INGESTION & PROCESSING
Single, centralized Cassandra cluster with
data-pipeline from different locations
26
MULTI-DATACENTER CLUSTER, INGESTION & PROCESSING
27
MULTIPLE INDEPENDENT, MODULAR SYSTEMS
Multiple, independent Cassandra clusters at different
datacenters along with an optional central cluster
containing select and/or aggregated data.
28
Reference & Misc
29
SAMPLE OF V2I REFERENCE INFORMATION
• https://www.its.dot.gov/index.htm
• https://www.its.dot.gov/v2i/
• https://www.its.dot.gov/communications/media/15cv_future.htm
• https://www.iso.org/committee/54706/x/catalogue/
• https://www.iso.org/standard/69897.html
30
SCALA SAMPLE TO MAP SET DATA INTO
INDIVIDUAL CASSANDRA ROW ACCESS
case class Data(key: String, values: Set[String]) extends
Iterator[Tuple2[String, String]] {
private val i = values.iterator
def hasNext = i.hasNext
def next = Tuple2[String, String](key, i.next)
}
val d = Seq[(String, Set[String])](("a",
Set[String]("a-1", "a-2", "a-3")))
scala> d.flatMap(i => Data(i._1, i._2))
res3: Seq[(String, String)] = List((a,a-1), (a,a-2), (a,a-3))
31

More Related Content

Similar to Data Modeling for IoT and Big Data

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
Splunk
 

Similar to Data Modeling for IoT and Big Data (20)

Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Yahoo’s next generation user profile platform
Yahoo’s next generation user profile platformYahoo’s next generation user profile platform
Yahoo’s next generation user profile platform
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
IoT and connected devices
IoT and connected devicesIoT and connected devices
IoT and connected devices
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
IoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM InformixIoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM Informix
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
[PLCUG] Splunk - complete Citrix environment monitoring
[PLCUG] Splunk - complete Citrix environment monitoring[PLCUG] Splunk - complete Citrix environment monitoring
[PLCUG] Splunk - complete Citrix environment monitoring
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
The GIS for Emergency and Security of Catalonia
The GIS for Emergency and Security of CataloniaThe GIS for Emergency and Security of Catalonia
The GIS for Emergency and Security of Catalonia
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 

More from Jayesh Thakrar

Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
Jayesh Thakrar
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Jayesh Thakrar
 

More from Jayesh Thakrar (7)

ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profiling
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sql
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
 

Recently uploaded

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

Data Modeling for IoT and Big Data

  • 1. 1 © 2016, Conversant, LLC. All rights reserved. DATA MODELING FOR IOT APACHECON IOT NORTH AMERICA 2017 PRESENTED BY: JAYESH THAKRAR SENIOR SOFTWARE ENGINEER
  • 2. 2 WHY DATA MODELING FOR IOT? 1. IoT is the next big wave after social media (e.g. connected cars, smart homes & appliances) 2. Interesting challenges of volume, velocity and variety 3. Can be applied to other big data problems
  • 3. 3 DATA MODELING FOR IOT 1. Discuss sample IoT application 2. Discuss data model 3. Discuss application architecture
  • 5. 5 INTELLIGENT VEHICLES Cloud (Internet) Road-side infrastructure • V2V: Vehicle to Vehicle • V2C : Vehicle to Cloud • V2I: Vehicle to Infrastructure • Event = single, discrete communication message exchanged between a vehicle and infrastructure Communication Endpoints:
  • 6. 6 V2I: DATA & APPLICATION ASSUMPTIONS • 1+ billion vehicles • 500+ events per vehicle/day, based on avg. time on road = 3 hours = 180 min 1 event per 10-30 seconds (avg = 3 per min) = 180*3 = 540 events/vehicle • Avg. event size = 250-500+ bytes Total raw data size = 150-300 TB / day • Cassandra datastore can be applied to HBase or other similarly scalable datastore with appropriate testing • Streaming for ingestion/processing/ETL • Adhoc and batched analytics, extraction, etc • Avoid schema-level indexes for maintainability, efficiency, size, storage, etc.
  • 7. 7 SAMPLE APPLICATION ARCHITECTURE Ingestion pipeline Stream processing and analytics Data storage
  • 8. 8 DATA MODEL CONSTRAINTS / REQUIREMENTS • Efficient, low-latency writes and reads • Sample queries: - Events for a vehicle between two dates (or timestamps) - Events for an infrastructure between two dates (or timestamps) - Events by all infrastructure on a specific road-segment in a region • Short, adhoc query characteristics/needs (guesstimate) - volume = 100 – 100,000 rows - response time = 100 ms – 100 seconds (proportional to result size)
  • 9. 9 SCHEMA VISUALIZATION: STAR SCHEMA Vehicle Event Infrastructure Road SegmentTime / Calendar Region
  • 10. 10 CAN ALSO BE APPLIED TO: ADVERTISING/SEARCH Cookie Event URL LocationTime / Calendar Region
  • 11. 11 CN ALSO BE APPLIED TO : SOCIAL NETWORKS User Action Page LocationTime / Calendar Region
  • 14. 14 CASSANDRA: TABLE BASICS • Data stored in tables with pre-defined schema • Data types: primitives, collections, user-defined type – Collections = sets, maps, lists – Map keys and set and list values sorted • Every table has primary key (PK) – PK = single column or multi-column (composite) – Data distributed on cluster nodes based on hash of first part of PK • Keyspace = collection of (related) tables • PK based queries = very fast because of key cache, bloom filter, and sstable indexes
  • 16. 16 TABLE DESIGN OPTIONS Traditional table structure - column for each field INSERT INTO event(id, timestamp, vehicle_id, infra_id,...) INSERT INTO event JSON '{ "id" : 1234, "timestamp" : "...", ....) All data fields serialized into a single column INSERT INTO event(id, data) VALUES (1234, "JSON/blob/serialized avro/etc") // data = blob or text All data field stored in a collection field (e.g. map and/or set) INSERT INTO event(id, data) VALUES (1234, {'timestamp': ...}) // data = map<text, text>
  • 18. 18 STAR SCHEMA: EVENT NAVIGATION TABLES
  • 19. 19 VEHICLE -> EVENTS : VEH_EVENT CREATE TABLE veh_event(id TEXT PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) eb5071d8-0e35-4a82-ad37-543d3da66de7 set_data: (2017062408, 2017062409, ...) eb5071d8-0e35-4a82-ad37-543d3da66de7, 2017062408 map_data: (08:23:16.732 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...) 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ...... Level 0: Map of pointers to hourly data for each vehicle Level 1: Map of pointers to actual event data for a vehicle for a given hour interval Actual event data vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7 event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
  • 20. 20 INFRASTRUCTURE -> EVENTS: INFRA_EVENT CREATE TABLE infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559 vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7 event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 Level 0: Map of pointers to hourly data for each infrastructure ffe0bdbb-3b89-4337-a477-4a17f719b559 set_data: (2017062408, 2017062409, ...) ffe0bdbb-3b89-4337-a477-4a17f719b559, 2017062408 map_data: (23:16.732, eb5071d8-0e35-4a82-ad37-543d3da66de7 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...) Level 1: Map of pointers to actual event data by vehicle for an infrastructure for a given hour interval 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ...... Actual event data
  • 21. 21 LOCATION -> EVENTS: LOC_INFRA_EVENT CREATE TABLE loc_infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) 3aa40699-357e-48db-888b-af2ff7856949 set_data: (60b57655-0670-4969-9eec-99bcf8c8a034, ...) 60b57655-0670-4969-9eec-99bcf8c8a034 set_data: (ffe0bdbb-3b89-4337-a477-4a17f719b559, ...) Level 0: Map of pointers to road-segments by region Level 1: Map of pointers to infrastructure by road-segment region_id = 3aa40699-357e-48db-888b-af2ff7856949 road_seg_id = 60b57655-0670-4969-9eec-99bcf8c8a034 infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559 map_data can be used above if there is a need to store any data (e.g. timestamp) along with road-segment or infra id
  • 22. 22 LOGICAL & PHYSICAL DESIGN CONSIDERATIONS • Split each "level" of (logical) event navigation table into physical tables – E.g. vehicle_event into vehicle_event_lo, vehicle_event_l1 Allows tuning parameters like cache, partition size, bloom filter as well as ease maintenance, etc. • Primary keys for tables – combine process-level UUID + counter E.g. – <uuid>-<NNNN> (reduces number of UUID generation calls) – Further compact primary key by using binary encoding instead of string (e.g 16 bytes for UUID + 8 bytes for counter) • Short column names and appropriate data formats – CREATE TABLE vehicle_event(id BLOB PRIMARY KEY, m MAP <TEXT, TEXT>, s SET <TEXT>, ...) – Compact data e.g. time-of-day timestamps as integer i.e. ms of the day) • Data immutability (helps reduce Cassandra entropy & ghost data concerns) – Immutable event level data (insert-only into event and navigation tables) – TTL to "age-out/purge" old data • Keyspace sharding by time period and Cassandra compaction strategy – Keyspace by day/week Compaction strategy = STCS v/s TWCS
  • 23. 23 KEY TAKEAWAYS OF DATA MODEL • Single column primary keys • Short primary key and column names • All access (single row or range scan) via primary keys only • Range scan (when necessary) appropriately paginated • Immutable data (no updates/deletes) and idempotent inserts • Data purge (TTL v/s keyspace by time period)
  • 24. 24 The Big Picture Data Architecture + App Architecture
  • 25. 25 SINGLE CLUSTER, CENTRALIZED INGESTION & PROCESSING Single, centralized Cassandra cluster with data-pipeline from different locations
  • 27. 27 MULTIPLE INDEPENDENT, MODULAR SYSTEMS Multiple, independent Cassandra clusters at different datacenters along with an optional central cluster containing select and/or aggregated data.
  • 29. 29 SAMPLE OF V2I REFERENCE INFORMATION • https://www.its.dot.gov/index.htm • https://www.its.dot.gov/v2i/ • https://www.its.dot.gov/communications/media/15cv_future.htm • https://www.iso.org/committee/54706/x/catalogue/ • https://www.iso.org/standard/69897.html
  • 30. 30 SCALA SAMPLE TO MAP SET DATA INTO INDIVIDUAL CASSANDRA ROW ACCESS case class Data(key: String, values: Set[String]) extends Iterator[Tuple2[String, String]] { private val i = values.iterator def hasNext = i.hasNext def next = Tuple2[String, String](key, i.next) } val d = Seq[(String, Set[String])](("a", Set[String]("a-1", "a-2", "a-3"))) scala> d.flatMap(i => Data(i._1, i._2)) res3: Seq[(String, String)] = List((a,a-1), (a,a-2), (a,a-3))
  • 31. 31