SlideShare a Scribd company logo
1 © 2016, Conversant, LLC. All rights reserved.
DATA MODELING FOR IOT
APACHECON IOT NORTH AMERICA 2017 PRESENTED BY:
JAYESH THAKRAR
SENIOR SOFTWARE ENGINEER
2
WHY DATA MODELING FOR IOT?
1. IoT is the next big wave after social media
(e.g. connected cars, smart homes & appliances)
2. Interesting challenges of volume, velocity and variety
3. Can be applied to other big data problems
3
DATA MODELING FOR IOT
1. Discuss sample IoT application
2. Discuss data model
3. Discuss application architecture
4
Sample
Application
5
INTELLIGENT VEHICLES
Cloud (Internet)
Road-side
infrastructure
• V2V: Vehicle to Vehicle
• V2C : Vehicle to Cloud
• V2I: Vehicle to Infrastructure
• Event = single, discrete
communication message
exchanged between a vehicle
and infrastructure
Communication Endpoints:
6
V2I: DATA & APPLICATION ASSUMPTIONS
• 1+ billion vehicles
• 500+ events per vehicle/day, based on
avg. time on road = 3 hours = 180 min
1 event per 10-30 seconds (avg = 3 per min) = 180*3 = 540 events/vehicle
• Avg. event size = 250-500+ bytes
Total raw data size = 150-300 TB / day
• Cassandra datastore
can be applied to HBase or other similarly
scalable datastore with appropriate testing
• Streaming for ingestion/processing/ETL
• Adhoc and batched analytics, extraction, etc
• Avoid schema-level indexes
for maintainability, efficiency, size, storage, etc.
7
SAMPLE APPLICATION ARCHITECTURE
Ingestion pipeline
Stream processing and analytics
Data storage
8
DATA MODEL CONSTRAINTS / REQUIREMENTS
• Efficient, low-latency writes and reads
• Sample queries:
- Events for a vehicle between two dates (or timestamps)
- Events for an infrastructure between two dates (or timestamps)
- Events by all infrastructure on a specific road-segment in a region
• Short, adhoc query characteristics/needs (guesstimate)
- volume = 100 – 100,000 rows
- response time = 100 ms – 100 seconds (proportional to result size)
9
SCHEMA VISUALIZATION: STAR SCHEMA
Vehicle
Event
Infrastructure
Road SegmentTime / Calendar
Region
10
CAN ALSO BE APPLIED TO: ADVERTISING/SEARCH
Cookie
Event
URL
LocationTime / Calendar
Region
11
CN ALSO BE APPLIED TO : SOCIAL NETWORKS
User
Action
Page
LocationTime / Calendar
Region
12
IoT Data Model
13
INSPIRATION: UNIX FILESYSTEM INODE
14
CASSANDRA: TABLE BASICS
• Data stored in tables with pre-defined schema
• Data types: primitives, collections, user-defined type
– Collections = sets, maps, lists
– Map keys and set and list values sorted
• Every table has primary key (PK)
– PK = single column or multi-column (composite)
– Data distributed on cluster nodes based on hash of first part of PK
• Keyspace = collection of (related) tables
• PK based queries = very fast
because of key cache, bloom filter, and sstable indexes
15
DATA ASSUMPTIONS (SIMPLISTIC MODEL)
16
TABLE DESIGN OPTIONS
Traditional table structure - column for each field
INSERT INTO event(id, timestamp, vehicle_id, infra_id,...)
INSERT INTO event JSON '{ "id" : 1234, "timestamp" : "...", ....)
All data fields serialized into a single column
INSERT INTO event(id, data)
VALUES (1234, "JSON/blob/serialized avro/etc") // data = blob or text
All data field stored in a collection field (e.g. map and/or set)
INSERT INTO event(id, data)
VALUES (1234, {'timestamp': ...}) // data = map<text, text>
17
STAR SCHEMA: DIMENSION TABLES
18
STAR SCHEMA: EVENT NAVIGATION TABLES
19
VEHICLE -> EVENTS : VEH_EVENT
CREATE TABLE veh_event(id TEXT PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
eb5071d8-0e35-4a82-ad37-543d3da66de7 set_data: (2017062408, 2017062409, ...)
eb5071d8-0e35-4a82-ad37-543d3da66de7, 2017062408 map_data: (08:23:16.732 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...)
25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ......
Level 0: Map of pointers to hourly data for each vehicle
Level 1: Map of pointers to actual event data for a vehicle for a given hour interval
Actual event data
vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7
event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
20
INFRASTRUCTURE -> EVENTS: INFRA_EVENT
CREATE TABLE infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559
vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7
event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
Level 0: Map of pointers to hourly data for each infrastructure
ffe0bdbb-3b89-4337-a477-4a17f719b559 set_data: (2017062408, 2017062409, ...)
ffe0bdbb-3b89-4337-a477-4a17f719b559, 2017062408 map_data: (23:16.732, eb5071d8-0e35-4a82-ad37-543d3da66de7 ->
25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...)
Level 1: Map of pointers to actual event data by vehicle for an infrastructure for a given hour interval
25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ......
Actual event data
21
LOCATION -> EVENTS: LOC_INFRA_EVENT
CREATE TABLE loc_infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...)
3aa40699-357e-48db-888b-af2ff7856949 set_data: (60b57655-0670-4969-9eec-99bcf8c8a034, ...)
60b57655-0670-4969-9eec-99bcf8c8a034 set_data: (ffe0bdbb-3b89-4337-a477-4a17f719b559, ...)
Level 0: Map of pointers to road-segments by region
Level 1: Map of pointers to infrastructure by road-segment
region_id = 3aa40699-357e-48db-888b-af2ff7856949
road_seg_id = 60b57655-0670-4969-9eec-99bcf8c8a034
infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559
map_data can be used above if there is a need to store any data (e.g. timestamp) along with road-segment or infra id
22
LOGICAL & PHYSICAL DESIGN CONSIDERATIONS
• Split each "level" of (logical) event navigation table into physical tables
– E.g. vehicle_event into vehicle_event_lo, vehicle_event_l1
Allows tuning parameters like cache, partition size, bloom filter as well as ease maintenance, etc.
• Primary keys for tables – combine process-level UUID + counter E.g.
– <uuid>-<NNNN> (reduces number of UUID generation calls)
– Further compact primary key by using binary encoding instead of string
(e.g 16 bytes for UUID + 8 bytes for counter)
• Short column names and appropriate data formats
– CREATE TABLE vehicle_event(id BLOB PRIMARY KEY, m MAP <TEXT, TEXT>, s SET <TEXT>, ...)
– Compact data e.g. time-of-day timestamps as integer i.e. ms of the day)
• Data immutability (helps reduce Cassandra entropy & ghost data concerns)
– Immutable event level data (insert-only into event and navigation tables)
– TTL to "age-out/purge" old data
• Keyspace sharding by time period and Cassandra compaction strategy
– Keyspace by day/week Compaction strategy = STCS v/s TWCS
23
KEY TAKEAWAYS OF DATA MODEL
• Single column primary keys
• Short primary key and column names
• All access (single row or range scan) via primary keys only
• Range scan (when necessary) appropriately paginated
• Immutable data (no updates/deletes) and idempotent inserts
• Data purge (TTL v/s keyspace by time period)
24
The Big Picture
Data Architecture +
App Architecture
25
SINGLE CLUSTER, CENTRALIZED INGESTION & PROCESSING
Single, centralized Cassandra cluster with
data-pipeline from different locations
26
MULTI-DATACENTER CLUSTER, INGESTION & PROCESSING
27
MULTIPLE INDEPENDENT, MODULAR SYSTEMS
Multiple, independent Cassandra clusters at different
datacenters along with an optional central cluster
containing select and/or aggregated data.
28
Reference & Misc
29
SAMPLE OF V2I REFERENCE INFORMATION
• https://www.its.dot.gov/index.htm
• https://www.its.dot.gov/v2i/
• https://www.its.dot.gov/communications/media/15cv_future.htm
• https://www.iso.org/committee/54706/x/catalogue/
• https://www.iso.org/standard/69897.html
30
SCALA SAMPLE TO MAP SET DATA INTO
INDIVIDUAL CASSANDRA ROW ACCESS
case class Data(key: String, values: Set[String]) extends
Iterator[Tuple2[String, String]] {
private val i = values.iterator
def hasNext = i.hasNext
def next = Tuple2[String, String](key, i.next)
}
val d = Seq[(String, Set[String])](("a",
Set[String]("a-1", "a-2", "a-3")))
scala> d.flatMap(i => Data(i._1, i._2))
res3: Seq[(String, String)] = List((a,a-1), (a,a-2), (a,a-3))
31

More Related Content

Similar to Data Modeling for IoT and Big Data

Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Guido Schmutz
 
Yahoo’s next generation user profile platform
Yahoo’s next generation user profile platformYahoo’s next generation user profile platform
Yahoo’s next generation user profile platform
Kai (Kelvin) Liu
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
DataWorks Summit/Hadoop Summit
 
IoT and connected devices
IoT and connected devicesIoT and connected devices
IoT and connected devices
Pascal Bodin
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
MongoDB
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
InfluxData
 
IoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM InformixIoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM Informix
Pradeep Muthalpuredathe
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
[PLCUG] Splunk - complete Citrix environment monitoring
[PLCUG] Splunk - complete Citrix environment monitoring[PLCUG] Splunk - complete Citrix environment monitoring
[PLCUG] Splunk - complete Citrix environment monitoring
Jaroslaw Sobel
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
Probst Ludwine
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
Crate.io
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
Big Data Value Association
 
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
Vinu Charanya
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 
The GIS for Emergency and Security of Catalonia
The GIS for Emergency and Security of CataloniaThe GIS for Emergency and Security of Catalonia
The GIS for Emergency and Security of Catalonia
Esri
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
NECST Lab @ Politecnico di Milano
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
Splunk
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 

Similar to Data Modeling for IoT and Big Data (20)

Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day BerlinReal Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
 
Yahoo’s next generation user profile platform
Yahoo’s next generation user profile platformYahoo’s next generation user profile platform
Yahoo’s next generation user profile platform
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
IoT and connected devices
IoT and connected devicesIoT and connected devices
IoT and connected devices
 
MongoDB and the Internet of Things
MongoDB and the Internet of ThingsMongoDB and the Internet of Things
MongoDB and the Internet of Things
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
IoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM InformixIoT Analytics from Edge to Cloud - using IBM Informix
IoT Analytics from Edge to Cloud - using IBM Informix
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
[PLCUG] Splunk - complete Citrix environment monitoring
[PLCUG] Splunk - complete Citrix environment monitoring[PLCUG] Splunk - complete Citrix environment monitoring
[PLCUG] Splunk - complete Citrix environment monitoring
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
[Kubecon 2017 Austin, TX] How We Built a Framework at Twitter to Solve Servic...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
The GIS for Emergency and Security of Catalonia
The GIS for Emergency and Security of CataloniaThe GIS for Emergency and Security of Catalonia
The GIS for Emergency and Security of Catalonia
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 

More from Jayesh Thakrar

ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profiling
Jayesh Thakrar
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sql
Jayesh Thakrar
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
Jayesh Thakrar
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
Jayesh Thakrar
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Jayesh Thakrar
 

More from Jayesh Thakrar (7)

ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profiling
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sql
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
 

Recently uploaded

一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 

Recently uploaded (20)

一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 

Data Modeling for IoT and Big Data

  • 1. 1 © 2016, Conversant, LLC. All rights reserved. DATA MODELING FOR IOT APACHECON IOT NORTH AMERICA 2017 PRESENTED BY: JAYESH THAKRAR SENIOR SOFTWARE ENGINEER
  • 2. 2 WHY DATA MODELING FOR IOT? 1. IoT is the next big wave after social media (e.g. connected cars, smart homes & appliances) 2. Interesting challenges of volume, velocity and variety 3. Can be applied to other big data problems
  • 3. 3 DATA MODELING FOR IOT 1. Discuss sample IoT application 2. Discuss data model 3. Discuss application architecture
  • 5. 5 INTELLIGENT VEHICLES Cloud (Internet) Road-side infrastructure • V2V: Vehicle to Vehicle • V2C : Vehicle to Cloud • V2I: Vehicle to Infrastructure • Event = single, discrete communication message exchanged between a vehicle and infrastructure Communication Endpoints:
  • 6. 6 V2I: DATA & APPLICATION ASSUMPTIONS • 1+ billion vehicles • 500+ events per vehicle/day, based on avg. time on road = 3 hours = 180 min 1 event per 10-30 seconds (avg = 3 per min) = 180*3 = 540 events/vehicle • Avg. event size = 250-500+ bytes Total raw data size = 150-300 TB / day • Cassandra datastore can be applied to HBase or other similarly scalable datastore with appropriate testing • Streaming for ingestion/processing/ETL • Adhoc and batched analytics, extraction, etc • Avoid schema-level indexes for maintainability, efficiency, size, storage, etc.
  • 7. 7 SAMPLE APPLICATION ARCHITECTURE Ingestion pipeline Stream processing and analytics Data storage
  • 8. 8 DATA MODEL CONSTRAINTS / REQUIREMENTS • Efficient, low-latency writes and reads • Sample queries: - Events for a vehicle between two dates (or timestamps) - Events for an infrastructure between two dates (or timestamps) - Events by all infrastructure on a specific road-segment in a region • Short, adhoc query characteristics/needs (guesstimate) - volume = 100 – 100,000 rows - response time = 100 ms – 100 seconds (proportional to result size)
  • 9. 9 SCHEMA VISUALIZATION: STAR SCHEMA Vehicle Event Infrastructure Road SegmentTime / Calendar Region
  • 10. 10 CAN ALSO BE APPLIED TO: ADVERTISING/SEARCH Cookie Event URL LocationTime / Calendar Region
  • 11. 11 CN ALSO BE APPLIED TO : SOCIAL NETWORKS User Action Page LocationTime / Calendar Region
  • 14. 14 CASSANDRA: TABLE BASICS • Data stored in tables with pre-defined schema • Data types: primitives, collections, user-defined type – Collections = sets, maps, lists – Map keys and set and list values sorted • Every table has primary key (PK) – PK = single column or multi-column (composite) – Data distributed on cluster nodes based on hash of first part of PK • Keyspace = collection of (related) tables • PK based queries = very fast because of key cache, bloom filter, and sstable indexes
  • 16. 16 TABLE DESIGN OPTIONS Traditional table structure - column for each field INSERT INTO event(id, timestamp, vehicle_id, infra_id,...) INSERT INTO event JSON '{ "id" : 1234, "timestamp" : "...", ....) All data fields serialized into a single column INSERT INTO event(id, data) VALUES (1234, "JSON/blob/serialized avro/etc") // data = blob or text All data field stored in a collection field (e.g. map and/or set) INSERT INTO event(id, data) VALUES (1234, {'timestamp': ...}) // data = map<text, text>
  • 18. 18 STAR SCHEMA: EVENT NAVIGATION TABLES
  • 19. 19 VEHICLE -> EVENTS : VEH_EVENT CREATE TABLE veh_event(id TEXT PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) eb5071d8-0e35-4a82-ad37-543d3da66de7 set_data: (2017062408, 2017062409, ...) eb5071d8-0e35-4a82-ad37-543d3da66de7, 2017062408 map_data: (08:23:16.732 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...) 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ...... Level 0: Map of pointers to hourly data for each vehicle Level 1: Map of pointers to actual event data for a vehicle for a given hour interval Actual event data vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7 event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776
  • 20. 20 INFRASTRUCTURE -> EVENTS: INFRA_EVENT CREATE TABLE infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559 vehicle_id = eb5071d8-0e35-4a82-ad37-543d3da66de7 event_id = 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 Level 0: Map of pointers to hourly data for each infrastructure ffe0bdbb-3b89-4337-a477-4a17f719b559 set_data: (2017062408, 2017062409, ...) ffe0bdbb-3b89-4337-a477-4a17f719b559, 2017062408 map_data: (23:16.732, eb5071d8-0e35-4a82-ad37-543d3da66de7 -> 25b6a3f4-5eec-4b04-954e-6d6bf85c4776, ...) Level 1: Map of pointers to actual event data by vehicle for an infrastructure for a given hour interval 25b6a3f4-5eec-4b04-954e-6d6bf85c4776 data : ...... Actual event data
  • 21. 21 LOCATION -> EVENTS: LOC_INFRA_EVENT CREATE TABLE loc_infra_event(id text PRIMARY KEY, map_data MAP <TEXT, TEXT>, set_data SET <TEXT>, ...) 3aa40699-357e-48db-888b-af2ff7856949 set_data: (60b57655-0670-4969-9eec-99bcf8c8a034, ...) 60b57655-0670-4969-9eec-99bcf8c8a034 set_data: (ffe0bdbb-3b89-4337-a477-4a17f719b559, ...) Level 0: Map of pointers to road-segments by region Level 1: Map of pointers to infrastructure by road-segment region_id = 3aa40699-357e-48db-888b-af2ff7856949 road_seg_id = 60b57655-0670-4969-9eec-99bcf8c8a034 infra_id = ffe0bdbb-3b89-4337-a477-4a17f719b559 map_data can be used above if there is a need to store any data (e.g. timestamp) along with road-segment or infra id
  • 22. 22 LOGICAL & PHYSICAL DESIGN CONSIDERATIONS • Split each "level" of (logical) event navigation table into physical tables – E.g. vehicle_event into vehicle_event_lo, vehicle_event_l1 Allows tuning parameters like cache, partition size, bloom filter as well as ease maintenance, etc. • Primary keys for tables – combine process-level UUID + counter E.g. – <uuid>-<NNNN> (reduces number of UUID generation calls) – Further compact primary key by using binary encoding instead of string (e.g 16 bytes for UUID + 8 bytes for counter) • Short column names and appropriate data formats – CREATE TABLE vehicle_event(id BLOB PRIMARY KEY, m MAP <TEXT, TEXT>, s SET <TEXT>, ...) – Compact data e.g. time-of-day timestamps as integer i.e. ms of the day) • Data immutability (helps reduce Cassandra entropy & ghost data concerns) – Immutable event level data (insert-only into event and navigation tables) – TTL to "age-out/purge" old data • Keyspace sharding by time period and Cassandra compaction strategy – Keyspace by day/week Compaction strategy = STCS v/s TWCS
  • 23. 23 KEY TAKEAWAYS OF DATA MODEL • Single column primary keys • Short primary key and column names • All access (single row or range scan) via primary keys only • Range scan (when necessary) appropriately paginated • Immutable data (no updates/deletes) and idempotent inserts • Data purge (TTL v/s keyspace by time period)
  • 24. 24 The Big Picture Data Architecture + App Architecture
  • 25. 25 SINGLE CLUSTER, CENTRALIZED INGESTION & PROCESSING Single, centralized Cassandra cluster with data-pipeline from different locations
  • 27. 27 MULTIPLE INDEPENDENT, MODULAR SYSTEMS Multiple, independent Cassandra clusters at different datacenters along with an optional central cluster containing select and/or aggregated data.
  • 29. 29 SAMPLE OF V2I REFERENCE INFORMATION • https://www.its.dot.gov/index.htm • https://www.its.dot.gov/v2i/ • https://www.its.dot.gov/communications/media/15cv_future.htm • https://www.iso.org/committee/54706/x/catalogue/ • https://www.iso.org/standard/69897.html
  • 30. 30 SCALA SAMPLE TO MAP SET DATA INTO INDIVIDUAL CASSANDRA ROW ACCESS case class Data(key: String, values: Set[String]) extends Iterator[Tuple2[String, String]] { private val i = values.iterator def hasNext = i.hasNext def next = Tuple2[String, String](key, i.next) } val d = Seq[(String, Set[String])](("a", Set[String]("a-1", "a-2", "a-3"))) scala> d.flatMap(i => Data(i._1, i._2)) res3: Seq[(String, String)] = List((a,a-1), (a,a-2), (a,a-3))
  • 31. 31