Rivian Internal
P R O P R I E T A R Y A N D C O N F I D E N T I A L | D O N O T D I S T R I B U T E | 1
Real-time analytics at IoT scale
Pramod Immaneni
Principal Architect, Data Platform
Our vehicles
An electric SUV
to rule them all
Vehicle of
the Year
Included 10 Best Trucks
and SUVs awards for 2023
2022 Truck
of the Year
Electric Vehicle
of the Year
“Best ownership experience” among premium
battery electric vehicles. “R1T scored higher
overall than any other vehicle in the study”
R1T and R1S earned
the highest safety
rating from IIHS
Data and ML Landscape
Analytical Queries Near Real-time Streams Event Processing Purpose-built Pipelines
Dashboards Models Applications
Vehicle Telemetry Hi/Lo Frequency, Service, Charging, …
Data and AI Platform
Platform Architecture
Real-time Platform
Unity
Catalog
Lakehouse Platform
Event Processor
Low-latency store
Standardization pipelines
Data & ML pipelines Delta Tables
Real-time
Platform
7
Real-time Telemetry
Low Latency Queries
Timely Actions
Availability of fresh vehicle telemetry
data to applications
Processing queries with sub-second
latencies to power interactive
dashboards. Supporting time-series and
OLAP queries
Taking actions on data in motion for
timely response
Standardization
Pipeline
Push Notification
PagerDuty
Event Bridge
Standardization,
validation of data
schema and adding
vehicle context​
Data preparation for storage
in real-time distributed data store
Event Watch action platform, some examples
• When a critical event is detected a PagerDuty
alert is sent out.​
• OTA status such as success/fail/ready to install
are pushed to Mobile.​
• A change event is sent when a controller is
swapped on the vehicle by detecting a change
in device id – Stateful
Event Watch
Service
Filtering, Dedup
Sessions, Aggregations
Custom pipelines​
• Trip/Session detection​
• Geofence entry and exit detection
Telemetry Data
Real-time Processing
Distributed OLAP Database
Query
Service
Real-time queries
• Time-series
• Bucketed Aggregates
• Slicing and dicing
Kafka message queue used for
data handoff
Streaming pipelines built on
Apache Flink
Clustering Platform
Stream Processing Engine
Streaming
Pipeline
Infrastructure/Cloud
Event
Watch
Streaming
Pipeline
AWS
Kubernetes
(EKS)
Apache Flink
Streaming Layer
Data Processing
Business Logic
Application Spec
Templates
Mgmt
Layers
Streaming stack
• Analyzing data and taking business actions as soon as data is produced or
available.
• Message by message (event) processing contrast to batch processing.
• Enables large scale real-time pipelines that are potentially running forever*.
Input from live
sources or stores (MQ
(Kafka), HTTP/Socket,
Files etc.)
Unbounded,
continuous data
streams
Batch can be processed as
stream (but a stream is not a
batch)
(In-memory)
Processing with
temporal boundaries
(windows)
Support for event time
semantics
Stateful operations:
Aggregation, Rules, …
-> Analytics
Output to stores, live
dashboards,
downstream
applications
Stream Processing
Direct Acyclic Graph
(DAG)
• Application logic broken down into stages - Operator
• Multiple instances of each stage
• Data Tuples are sent in a continuous Stream between the Operators
• Operators connected with Streams to form a DAG
Operators
Stream of Data Tuples
Instances of an operator
Streaming Pipeline/Application
Geofence Detection
Vehicle
Stream
Geo-
fence
Stream
Geo-
Hash
Match
level-1
Geo-
Hash
Match
level-2
Geo-
Hash
Match
level-8
. . . Bounding
Polygon
Match
https://www.geospatialworld.net/blogs/polygeohasher-an-optimized-way-to-create-geohashes/
Geohash
• Geohash is a hierarchical spatial mapping system
• Sava Centar - srywbvnhkp9v
• Two locations close to each other share common prefix
• Longer the match the closer they are
• Geofence matching can be sped up by iteratively matching
more characters of the prefix
• Iterations can be pipelined for higher throughput
• Each stage can be scaled to handle more vehicles
Event Watch
• A low-latency platform for event filtering, change detection and actions
• Out of the box supported specifications
• Sessionization, Dedup, Staleness, Streaming SQL
• Notifications via PagerDuty, Push Notifications, EventBridge or Email
• Supports pluggability of custom actions with BYOC
• Built on Apache Flink a stateful, distributed and fault tolerant stream processing engine
• 2M events/sec peak, <100ms avg latency
Specifications
• Enable staleness to discard late arriving data
• Keep streams current and avoid outdated event detection
"staleness": {
"signal": "sound_alarm",
"ttl_ms": 3600000
}
Streaming SQL
• Describe stream subscription using familiar SQL.
• Output is a continuous stream of matching events
"query": "select `id`, `timestamp` from stream where `sound_alarm`='true'"
Staleness check
Specifications
• Virtual session on event detection
• Query other signals in context of the session
• Supports TTL on the dependency signal
Deduplication
• Identify and discard duplicates
• Useful to avoid triggering duplicate notifications
• Provide TTL for deduplication at ms granularity
Sessions
"dedup": true,
"dedup_ttl_ms": 86400000
"query" : "select `id`, `range_threshold`, `pet_mode_status`from stream
where `range_threshold` = 'VEHICLE_RANGE_CRITICALLY_LOW'",
…
"dependency": {
"type": "thermal",
"subtype": "hvac_settings",
"signal": "pet_mode_status",
"values": ["On"]
},
Query Service
Real-time Query Service
● Service and remote diagnostics
○ looking at telemetry data before an appointment can identify the issue
○ replacement parts can be ordered in advance
○ sample telemetry:
■ current OTA version
■ diagnostic error codes
■ core vehicle data
● Fleet management
○ fleet customers can view telemetry data for their fleet
○ aggregating data per day or per hour
○ sample telemetry:
■ state of charge
■ charging status
■ energy added in charge session
■ estimated range
■ odometer
■ current location
Telemetry schema preparation
Distributed OLAP Database
Query
Service
Performance
• Horizontal data capacity scaling by adding data
nodes
• Scale usage by adding query nodes
Capability
Scalability
Extensibility
• Sub-second query responses enables interactive data
exploration and reporting applications
• Query both real-time and historical data together
• OLAP and timeseries queries
• Flexible schema, keys, metrics and rollups
• Data tiering and retention with QoS
• Extensible with plugins – Kafka, S3, parquet, sketches
• Highly configurable at component level
• Built-in streaming ingestion of real-time
data and batch ingestion from data lakes
• Apache open-source community driven model
• Data can accumulate over several months or years
and tables can grow into billions of rows
Apache Druid – Real-time OLAP
• Fault tolerant, self healing and balancing
• Data tiering and retention with QoS
Kubernetes Deployment
https://imply.io/druid-architecture-concepts/
• Data Nodes
• Historicals have bulk of the
data
• Middle Managers (MM) hold
data during ingestion
• Query Nodes
• Brokers process user queries
by aggregating results from
MMs and Historicals
• Routers route queries to
brokers or master nodes
• Master Nodes
• Coordinators manage
historicals, data assignment
and auto-compaction
• Overlord manage MMs and
real-time ingestion tasks
Data Nodes – Store data, including real-time data being ingested and respond to queries.
Query Nodes – Process user queries utilizing the Data Nodes.
Master Nodes – Manage coordination, manage ingestion tasks and work assignment, data
distribution, and recovery aspects.
Metadata Storage used for storing information about data segments and dynamic configuration
Deep Storage is where data lands up for permanent storage – Data Lake (S3)
Producer
Producer
Producer
Producer
Kafka
Partition 1
Partition N
Ingestion Task 1
Ingestion Task 2
Ingestion Task 3
Ingestion Task 4
Ingestion Task M
Deep Storage
Druid
Query Brokers
Historicals
App 1
App 2
Finalized Segments
Partial Segment Data
Segments
Historicals
• Druid continually ingests streaming Telemetry data from Kafka in real-time ~ 2M events/sec
• Multiple tasks running in parallel ingest from different Kafka Partitions and Brokers
• Tasks create data segments which are collection of events (rows) and indexes for search
• Tasks publish finalized segments to Deep Storage, which are picked up and served by Historicals
• Tasks run on Middle Managers and auto-recovered on failure
Real-time Ingestion
Thank you.

[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale

  • 1.
    Rivian Internal P RO P R I E T A R Y A N D C O N F I D E N T I A L | D O N O T D I S T R I B U T E | 1 Real-time analytics at IoT scale Pramod Immaneni Principal Architect, Data Platform
  • 2.
    Our vehicles An electricSUV to rule them all Vehicle of the Year Included 10 Best Trucks and SUVs awards for 2023 2022 Truck of the Year Electric Vehicle of the Year “Best ownership experience” among premium battery electric vehicles. “R1T scored higher overall than any other vehicle in the study” R1T and R1S earned the highest safety rating from IIHS
  • 3.
    Data and MLLandscape Analytical Queries Near Real-time Streams Event Processing Purpose-built Pipelines Dashboards Models Applications Vehicle Telemetry Hi/Lo Frequency, Service, Charging, … Data and AI Platform
  • 4.
    Platform Architecture Real-time Platform Unity Catalog LakehousePlatform Event Processor Low-latency store Standardization pipelines Data & ML pipelines Delta Tables
  • 5.
    Real-time Platform 7 Real-time Telemetry Low LatencyQueries Timely Actions Availability of fresh vehicle telemetry data to applications Processing queries with sub-second latencies to power interactive dashboards. Supporting time-series and OLAP queries Taking actions on data in motion for timely response
  • 6.
    Standardization Pipeline Push Notification PagerDuty Event Bridge Standardization, validationof data schema and adding vehicle context​ Data preparation for storage in real-time distributed data store Event Watch action platform, some examples • When a critical event is detected a PagerDuty alert is sent out.​ • OTA status such as success/fail/ready to install are pushed to Mobile.​ • A change event is sent when a controller is swapped on the vehicle by detecting a change in device id – Stateful Event Watch Service Filtering, Dedup Sessions, Aggregations Custom pipelines​ • Trip/Session detection​ • Geofence entry and exit detection Telemetry Data Real-time Processing Distributed OLAP Database Query Service Real-time queries • Time-series • Bucketed Aggregates • Slicing and dicing Kafka message queue used for data handoff Streaming pipelines built on Apache Flink
  • 7.
    Clustering Platform Stream ProcessingEngine Streaming Pipeline Infrastructure/Cloud Event Watch Streaming Pipeline AWS Kubernetes (EKS) Apache Flink Streaming Layer Data Processing Business Logic Application Spec Templates Mgmt Layers Streaming stack
  • 8.
    • Analyzing dataand taking business actions as soon as data is produced or available. • Message by message (event) processing contrast to batch processing. • Enables large scale real-time pipelines that are potentially running forever*. Input from live sources or stores (MQ (Kafka), HTTP/Socket, Files etc.) Unbounded, continuous data streams Batch can be processed as stream (but a stream is not a batch) (In-memory) Processing with temporal boundaries (windows) Support for event time semantics Stateful operations: Aggregation, Rules, … -> Analytics Output to stores, live dashboards, downstream applications Stream Processing
  • 9.
    Direct Acyclic Graph (DAG) •Application logic broken down into stages - Operator • Multiple instances of each stage • Data Tuples are sent in a continuous Stream between the Operators • Operators connected with Streams to form a DAG Operators Stream of Data Tuples Instances of an operator Streaming Pipeline/Application
  • 10.
    Geofence Detection Vehicle Stream Geo- fence Stream Geo- Hash Match level-1 Geo- Hash Match level-2 Geo- Hash Match level-8 . .. Bounding Polygon Match https://www.geospatialworld.net/blogs/polygeohasher-an-optimized-way-to-create-geohashes/ Geohash • Geohash is a hierarchical spatial mapping system • Sava Centar - srywbvnhkp9v • Two locations close to each other share common prefix • Longer the match the closer they are • Geofence matching can be sped up by iteratively matching more characters of the prefix • Iterations can be pipelined for higher throughput • Each stage can be scaled to handle more vehicles
  • 11.
    Event Watch • Alow-latency platform for event filtering, change detection and actions • Out of the box supported specifications • Sessionization, Dedup, Staleness, Streaming SQL • Notifications via PagerDuty, Push Notifications, EventBridge or Email • Supports pluggability of custom actions with BYOC • Built on Apache Flink a stateful, distributed and fault tolerant stream processing engine • 2M events/sec peak, <100ms avg latency
  • 12.
    Specifications • Enable stalenessto discard late arriving data • Keep streams current and avoid outdated event detection "staleness": { "signal": "sound_alarm", "ttl_ms": 3600000 } Streaming SQL • Describe stream subscription using familiar SQL. • Output is a continuous stream of matching events "query": "select `id`, `timestamp` from stream where `sound_alarm`='true'" Staleness check
  • 13.
    Specifications • Virtual sessionon event detection • Query other signals in context of the session • Supports TTL on the dependency signal Deduplication • Identify and discard duplicates • Useful to avoid triggering duplicate notifications • Provide TTL for deduplication at ms granularity Sessions "dedup": true, "dedup_ttl_ms": 86400000 "query" : "select `id`, `range_threshold`, `pet_mode_status`from stream where `range_threshold` = 'VEHICLE_RANGE_CRITICALLY_LOW'", … "dependency": { "type": "thermal", "subtype": "hvac_settings", "signal": "pet_mode_status", "values": ["On"] },
  • 14.
  • 15.
    Real-time Query Service ●Service and remote diagnostics ○ looking at telemetry data before an appointment can identify the issue ○ replacement parts can be ordered in advance ○ sample telemetry: ■ current OTA version ■ diagnostic error codes ■ core vehicle data ● Fleet management ○ fleet customers can view telemetry data for their fleet ○ aggregating data per day or per hour ○ sample telemetry: ■ state of charge ■ charging status ■ energy added in charge session ■ estimated range ■ odometer ■ current location Telemetry schema preparation Distributed OLAP Database Query Service
  • 16.
    Performance • Horizontal datacapacity scaling by adding data nodes • Scale usage by adding query nodes Capability Scalability Extensibility • Sub-second query responses enables interactive data exploration and reporting applications • Query both real-time and historical data together • OLAP and timeseries queries • Flexible schema, keys, metrics and rollups • Data tiering and retention with QoS • Extensible with plugins – Kafka, S3, parquet, sketches • Highly configurable at component level • Built-in streaming ingestion of real-time data and batch ingestion from data lakes • Apache open-source community driven model • Data can accumulate over several months or years and tables can grow into billions of rows Apache Druid – Real-time OLAP • Fault tolerant, self healing and balancing • Data tiering and retention with QoS
  • 17.
    Kubernetes Deployment https://imply.io/druid-architecture-concepts/ • DataNodes • Historicals have bulk of the data • Middle Managers (MM) hold data during ingestion • Query Nodes • Brokers process user queries by aggregating results from MMs and Historicals • Routers route queries to brokers or master nodes • Master Nodes • Coordinators manage historicals, data assignment and auto-compaction • Overlord manage MMs and real-time ingestion tasks Data Nodes – Store data, including real-time data being ingested and respond to queries. Query Nodes – Process user queries utilizing the Data Nodes. Master Nodes – Manage coordination, manage ingestion tasks and work assignment, data distribution, and recovery aspects. Metadata Storage used for storing information about data segments and dynamic configuration Deep Storage is where data lands up for permanent storage – Data Lake (S3)
  • 18.
    Producer Producer Producer Producer Kafka Partition 1 Partition N IngestionTask 1 Ingestion Task 2 Ingestion Task 3 Ingestion Task 4 Ingestion Task M Deep Storage Druid Query Brokers Historicals App 1 App 2 Finalized Segments Partial Segment Data Segments Historicals • Druid continually ingests streaming Telemetry data from Kafka in real-time ~ 2M events/sec • Multiple tasks running in parallel ingest from different Kafka Partitions and Brokers • Tasks create data segments which are collection of events (rows) and indexes for search • Tasks publish finalized segments to Deep Storage, which are picked up and served by Historicals • Tasks run on Middle Managers and auto-recovered on failure Real-time Ingestion
  • 19.