Interactive real time dashboards on data streams using Kafka, Druid, and Superset

1 © Hortonworks Inc. 2011–2018. All rights reserved
Interactive Realtime Dashboards on Data
Streams
Nishant Bangarwa
Hortonworks
Druid Committer, PMC

Sample Data Stream : Wikipedia Edits

Step by Step Breakdown
Consume Events
Enrich / Transform
(Add Geolocation
from IP Address)
Store Events
Visualize Events
Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact

Required Components
 Event Flow
 Event Processing
 Data Store
 Visualization Layer

Event Flow

Event Flow : Requirements
Event
Producers
Queue
Event
Consumers
 Low latency
 High Throughput
 Failure Handling
 Message delivery guarantees –
 Atleast Once, Exactly once, Atmost Once
 Varies based on use-case, Exactly once being the holy grail and
the most difficult to achieve
 Scalability

Apache Kafka

Apache Kafka
 Low Latency
 High Throughput
 Message Delivery guarantees
 At-least once
 Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)
 Reliable design to Handle Failures
 Message Acks between producers and brokers
 Data Replication on brokers
 Consumers can Read from any desired offset
 Handle multiple producers/consumers
 Scalable

Event Processing

Event Processing : Requirements
 Consume-Process-Produce Pattern
 Enrich and Transform event streams
 Windowing
 Apply business logic
 Consume and Join multiple streams into single
 Failure Handling
 Scalability
Consume Process Produce

Kafka Streams
 Rich Lightweight Stream processing library
 Event-at-a-time
 Stateful processing : windowing, joining, aggregation operators
 Local state using RocksDb
 Backed by changelog in kafka
 Highly scalable, distributed, fault tolerant
 Compared to a standard Kafka consumer:
 Higher level: faster to build a sophisticated app
 Less control for very fine-grained consumption

Kafka Streams : Wikipedia Data Enrichment

© Hortonworks Inc. 2011 – 2016. All Rights Reserved
1
Data Store

Data Store

Data Store : Requirements
Processed
Events
Data Store Queries
 Ability to ingest Streaming data
 Power Interactive dashboards
 Sub-Second Query Response time
 Ad-hoc arbitrary slicing and dicing of data
 Data Freshness
 Summarized/aggregated data is queried
 Scalability
 High Availability

1
Druid
• Column-oriented distributed datastore
• Sub-Second query times
• Realtime streaming ingestion
• Arbitrary slicing and dicing of data
• Automatic Data Summarization
• Approximate algorithms (hyperLogLog, theta)
• Scalable to petabytes of data
• Highly available

Suitable Use Cases
• Powering Interactive user facing applications
• Arbitrary slicing and dicing of large datasets
• User behavior analysis
• measuring distinct counts
• retention analysis
• funnel analysis
• A/B testing
• Exploratory analytics/root cause analysis
• Not interested in dumping entire dataset

Druid: Segments
• Data in Druid is stored in Segment Files.
• Partitioned by time
• Ideally, segment files are each smaller than 1GB.
• If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday

1
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics

2
Data Rollup
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour

2
Dictionary Encoding
• Create and store Ids for each value
• e.g. page column
⬢ Values - Justin Bieber, Ke$ha, Selena Gomes
⬢ Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
⬢ Column Data - [0 0 0 1 1 2]
• city column - [0 0 0 1 1 1]

2
Bitmap Indices
• Store Bitmap Indices for each value
⬢ Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
⬢ Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
⬢ Selena Gomes -> [5] -> [0 0 0 0 0 1]
• Queries
⬢ Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
⬢ language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
• Indexes compressed with Concise or Roaring encoding

2
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour

Approximate Algorithms
• Store Sketch objects, instead of raw column values
• Better rollup for high cardinality columns e.g userid
• Reduced storage size
• Use Cases
• Fast approximate distinct counts
• Approximate histograms
• Funnel/retention analysis
• Limitation
• Not possible to do exact counts
• filter on individual row values

Realtime
Nodes
Historical
Nodes
2
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff

Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)

2
Companies Using Druid

Visualization Layer

Visualization Layer : Requirements
• Rich dashboarding capabilities
• Work with multiple datasoucres
• Security/Access control
• Allow for extension
• Add custom visualizations
Data Store Visualization
Layer
User
Dashboards

Superset
• Python backend
• Flask app builder
• Authentication
• Pandas for rich analytics
• SqlAlchemy for SQL toolkit
• Javascript frontend
• React, NVD3
• Deep integration with Druid

Superset Rich Dashboarding Capabilities: Treemaps

Superset Rich Dashboarding Capabilities: Sunburst

Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!

Wikipedia Real-Time Dashboard
Kafka
Connect
IP-to-
Geolocation
Processor
wikipedia-raw
topic
wikipedia-raw
topic
wikipedia-enriched
topic
wikipedia-enriched
topic

Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)

Project Websites
 Kafka - http://kafka.apache.org
 Druid - http://druid.io
 Superset - http://superset.incubator.apache.org

Thank you
Twitter - @NishantBangarwa
Email - nbangarwa@hortonworks.com
Linkedin - https://www.linkedin.com/in/nishant-bangarwa

Questions?

Interactive real time dashboards on data streams using Kafka, Druid, and Superset

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interactive real time dashboards on data streams using Kafka, Druid, and Superset

Similar to Interactive real time dashboards on data streams using Kafka, Druid, and Superset (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Interactive real time dashboards on data streams using Kafka, Druid, and Superset

Editor's Notes