Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

1,730 views

Published on

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

Published in: Technology
  • Be the first to comment

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data

  1. 1. Building a Graph Database in Neo4j with Spark & Spark SQL to Gain New Insights from Log Data ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST, HORTONWORKS RACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO
  2. 2. Overview  Overview of business problem  Introduction to TiVo and TiVo data  Challenges with using data to optimize UI navigation  Solution  Demo  Next Steps  Questions
  3. 3. TiVo Background and context  TiVo is a discovery platform for integrated entertainment  Multiple ways to find content TiVo Roamio Demo Video  TiVo collects ~2 million logs a day from their boxes  User action events, TiVo action events, inventory events  These events are “memory-less” and don’t know what happened prior to the event
  4. 4. Motivation TiVo Business Initiatives  Help users get to content they want faster  Help users discover new content easier  Is feature X important to discovering and getting to content? (ex: Is the guide still used to find content? Challenge  Measuring a KPI for initiatives in a log stream that is “memory-less”  Identifying events or a pattern of events that impacts the KPIs Data Objective – “Path Analysis”  Analysis that answers questions around the navigational paths users take to get to or from defined start and/or end points
  5. 5. Architecture Challenges  Traditional data platform that was sample-based and SQL-based  Relational databases
  6. 6. Technical Challenges  Relational Databases Challenges  Little flexibility in “click path” definition  Decisions about defining “paths” are made during processing step  Many business assumptions have to be made with little insight
  7. 7. Solution
  8. 8. Solution  Graph Database (Neo4j)  Relationships are first-class citizens  Simple abstractions  Enable sophisticated models  “Path Analysis”
  9. 9. Prototype One Day Graph Info  Edges: 57K relationships  Nodes: 135 UI or “screen” nodes 12 “watch content” nodes
  10. 10. Size reduction (2K times)  70 GB log data  35MB Neo DB (all nodes & edges)  1 Day  Oct 1, 2015 LOGS 70+ GB UI nodes Watch nodes Edges 35 MB
  11. 11. Screens, Transitions, and Content  Screen events  Remote button press events  Watch content events
  12. 12. Sample Graph Live Movie My Shows TiVo Central (Home) Switched to Switched to Switched to
  13. 13. Architecture Overview
  14. 14. What’s captured in the graph?  Node (UI)  Name  Timestamp  Node (Watch)  Type, e.g. Recorded  Genre, e.g. TV Show  Timestamp  Edge  Average Time  Total number of keys pressed  Key sequence  e.g. Home  Up/Down  Select/Play  Total number of times path taken  Unique number of users taking this path
  15. 15. What’s captured in the graph? TiVo Central 1/1/2016 0.4s average time 3 keys pressed Home-Down-Select 50 times path used 27 unique users Live Movie 1/1/2016
  16. 16. Raw Log File example … 1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097... 1444809715812909|Key|HOME 1444880816123454|UI|TivoCentralScreen 1444809716234553|Key|DOWN 1444809716354363|Key|SELECT 1444809716518701|Trick3|PLAY|116|1|100|-1 1444809719888072|Key|PLAY 1444809719889072|Trick3|PLAY|119|1|100|-1 1444809726966880|Watch|rec|WFXTDT|SH|508|... …
  17. 17. Filtered Log ... Watch: LIVE MOVIE Key: HOME UI: TIVO HOME Key: DOWN Key: SELECT Key: PLAY Watch: REC SHOW … Edge Node Node Edge Node Same day
  18. 18. Algorithm Overview (1 of 2) 1. Filter for desired events • Remove non-Screen, non-Watch, non-Key events 2. Session-ize and order logs to reflect Screen/Watch/Edge events 3. Define display for Key Press events - two formats • Normal: SELECT & UP x 2 & GUIDE & SELECT • Compact: TIVO & 9 KEYS & SELECT 4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min) • For all logs find the following sequence: Node X - timestamp x (start time) key A key B key C Node Y - timestamp y (end time)
  19. 19. Algorithm Overview (2 of 2) 5. For each unique node-edge-node calculate: 1. Average transition time 2. Number of transitions 3. Number of unique transitions 4. Number of keys pressed 5. Key sequence (normal or compact) 6. Export results to CSV files
  20. 20. DEMO  What is the most popular path people take to get to content?  live vs. recorded  What percent of total paths are most popular?  What path is most popular? Overall? Unique?  What app is most popular?  What percent of total paths involve the Guide screen?
  21. 21. Business Advantages  Measure KPIs for time to content and content discovery  Optimize KPIs (understanding user behavior that impacts the KPIs)  Enhance A/B Testing by helping to answer “why?”  Simplify user experience across products  Increase engagement with new content  Understand feature usage interactions not only as a mutually exclusive experience
  22. 22. Future Work  Deploy to production -- multi-day queries  Add relationships and nodes for feature usage  Classify paths (“discovery” or “known destination”)  Exploratory analysis
  23. 23. Thanks!  @RobHryniewicz  @Bayesbabe

×