Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Interactive Realtime
Dashboards on Data Streams
Nishant Bangarwa
Hortonworks
Druid Committer, PMC
June 2017
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Data Stream : Wikipedia Edits
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Step by Step Breakdown
Consume Events
Enrich / Transform
(Add Geolocat...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Required Components
 Event Flow
 Event Processing
 Data Store
 Vis...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved6
Event Flow
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Event Flow : Requirements
Event
Producers
Queue
Event
Consumers
 Low ...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Kafka
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Kafka
 Low Latency
 High Throughput
 Message Delivery guaran...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved10
Event Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Event Processing : Requirements
 Consume-Process-Produce Pattern
 En...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Streams
 Rich Lightweight Stream processing library
 Event-at-...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Streams : Wikipedia Data Enrichment
© Hortonworks Inc. 2011 – 2016. All Rights Reserved14
Data Store
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Store : Requirements
Processed
Events
Data Store Queries
 Abilit...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved16
Druid
 Column-oriented distributed datastore
 Sub-Second query tim...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Suitable Use Cases
 Powering Interactive user facing applications
 A...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitio...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved19
Example Wikipedia Edit Dataset
timestamp page language city country ...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved20
Data Rollup
timestamp page language city country … added deleted
201...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved21
Dictionary Encoding
 Create and store Ids for each value
 e.g. pag...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved22
Bitmap Indices
 Store Bitmap Indices for each value
 Justin Bieber...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved23
Approximate Sketch Columns
timestamp page userid language city count...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Approximate Algorithms
 Store Sketch objects, instead of raw column v...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Realtime
Nodes
Historical
Nodes
25
Druid Architecture
Batch Data
Event...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance and Scalability : Fast Facts
Most Events per Day
300 Billi...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved27
Companies Using Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved28
Visualization Layer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visualization Layer : Requirements
 Rich dashboarding capabilities
 ...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset
 Python backend
 Flask app builder
 Authentication
 Panda...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Treemaps
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Sunburst
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset UI Provides Powerful Visualizations
Rich library of dashboard...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Wikipedia Real-Time Dashboard
Kafka
Connect
IP-to-
Geolocation
Process...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project Websites
 Kafka - http://kafka.apache.org
 Druid - http://dr...
© Hortonworks Inc. 2011 – 2016. All Rights Reserved36
Thank you ! Questions ?
 Twitter - @NishantBangarwa
 Email - nbang...
Upcoming SlideShare
Loading in …5
×

Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset

2,586 views

Published on

Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @nishant - a really well done presentation ..you explained the event stream processing flow using wiki edit stream & simplified the concepts nicely without throwing bigdata jargons... personally sharing this within the team to use this as a benchmark for good presentation which explains difficult tech topics...in a easy relatable manner yet covering the complexities nicely !! good job keep doing such stuff
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset

  1. 1. Interactive Realtime Dashboards on Data Streams Nishant Bangarwa Hortonworks Druid Committer, PMC June 2017
  2. 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample Data Stream : Wikipedia Edits
  3. 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
  4. 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Step by Step Breakdown Consume Events Enrich / Transform (Add Geolocation from IP Address) Store Events Visualize Events Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact
  5. 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Required Components  Event Flow  Event Processing  Data Store  Visualization Layer
  6. 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 Event Flow
  7. 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Event Flow : Requirements Event Producers Queue Event Consumers  Low latency  High Throughput  Failure Handling  Message delivery guarantees –  Message Ordering  Atleast Once, Exactly once, Atmost Once  Scalability  Fault tolerant
  8. 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Kafka
  9. 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Kafka  Low Latency  High Throughput  Message Delivery guarantees  At-least once  Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)  Reliable design to Handle Failures  Message Acks between producers and brokers  Data Replication on brokers  Consumers can Read from any desired offset  Handle multiple producers/consumers  Scalable
  10. 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved10 Event Processing
  11. 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Event Processing : Requirements  Consume-Process-Produce Pattern  Enrich and Transform event streams  Windowing  Apply business logic  Consume and Join multiple streams into single  Failure Handling  Scalability Source Process Sink Consume Produce
  12. 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Streams  Rich Lightweight Stream processing library  Event-at-a-time  Stateful processing : windowing, joining, aggregation operators  Local state using RocksDb  Backed by changelog in kafka  Highly scalable, distributed, fault tolerant  Compared to a standard Kafka consumer:  Higher level: faster to build a sophisticated app  Less control for very fine-grained consumption
  13. 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Streams : Wikipedia Data Enrichment
  14. 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved14 Data Store
  15. 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Store : Requirements Processed Events Data Store Queries  Ability to ingest Streaming data  Power Interactive dashboards  Sub-Second Query Response time  Ad-hoc arbitrary slicing and dicing of data  Data Freshness  Summarized/aggregated data is queried  Scalability  High Availability
  16. 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 Druid  Column-oriented distributed datastore  Sub-Second query times  Realtime streaming ingestion  Arbitrary slicing and dicing of data  Automatic Data Summarization  Approximate algorithms (hyperLogLog, theta)  Scalable to petabytes of data  Highly available
  17. 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Suitable Use Cases  Powering Interactive user facing applications  Arbitrary slicing and dicing of large datasets  User behavior analysis  measuring distinct counts  retention analysis  funnel analysis  A/B testing  Exploratory analytics/root cause analysis  Not interested in dumping entire dataset
  18. 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segments  Data in Druid is stored in Segment Files.  Partitioned by time  Ideally, segment files are each smaller than 1GB.  If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  19. 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  20. 20. © Hortonworks Inc. 2011 – 2016. All Rights Reserved20 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  21. 21. © Hortonworks Inc. 2011 – 2016. All Rights Reserved21 Dictionary Encoding  Create and store Ids for each value  e.g. page column  Values - Justin Bieber, Ke$ha, Selena Gomes  Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2  Column Data - [0 0 0 1 1 2]  city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  22. 22. © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 Bitmap Indices  Store Bitmap Indices for each value  Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]  Ke$ha -> [3, 4] -> [0 0 0 1 1 0]  Selena Gomes -> [5] -> [0 0 0 0 0 1]  Queries  Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]  language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]  Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  23. 23. © Hortonworks Inc. 2011 – 2016. All Rights Reserved23 Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  24. 24. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Approximate Algorithms  Store Sketch objects, instead of raw column values  Better rollup for high cardinality columns e.g userid  Reduced storage size  Use Cases  Fast approximate distinct counts  Approximate histograms  Funnel/retention analysis  Limitation  Not possible to do exact counts  filter on individual row values
  25. 25. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Realtime Nodes Historical Nodes 25 Druid Architecture Batch Data Event Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  26. 26. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  27. 27. © Hortonworks Inc. 2011 – 2016. All Rights Reserved27 Companies Using Druid
  28. 28. © Hortonworks Inc. 2011 – 2016. All Rights Reserved28 Visualization Layer
  29. 29. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visualization Layer : Requirements  Rich dashboarding capabilities  Work with multiple datasoucres  Security/Access control  Allow for extension  Add custom visualizations Data Store Visualization Layer User Dashboards
  30. 30. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset  Python backend  Flask app builder  Authentication  Pandas for rich analytics  SqlAlchemy for SQL toolkit  Javascript frontend  React, NVD3  Deep integration with Druid
  31. 31. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Treemaps
  32. 32. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Sunburst
  33. 33. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!
  34. 34. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Wikipedia Real-Time Dashboard Kafka Connect IP-to- Geolocation Processor wikipedia-raw topic wikipedia-raw topic wikipedia-enriched topic wikipedia-enriched topic
  35. 35. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Websites  Kafka - http://kafka.apache.org  Druid - http://druid.io  Superset - http://superset.incubator.apache.org
  36. 36. © Hortonworks Inc. 2011 – 2016. All Rights Reserved36 Thank you ! Questions ?  Twitter - @NishantBangarwa  Email - nbangarwa@hortonworks.com  Linkedin - https://www.linkedin.com/in/nishant-bangarwa Off The Record (OTR) session Experiences and challenges in working with Druid at 03:25 PM - 04:10 PM on 28 July, 2017 in Room 1 MLR Convention Centre, Whitefield

×