Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What the Bleep is Big Data? A Holistic View of Data and Algorithms

810 views

Published on

Data structures--the glue that holds data and algorithms together. This talk discusses tables and graphs--what they can do and how they are related.

Published in: Engineering
  • Be the first to comment

What the Bleep is Big Data? A Holistic View of Data and Algorithms

  1. 1. What the #(&*$ is Big Data? A Holistic View of Data and Algorithms Alice Zheng, GraphLab Strata Conference, Santa Clara February, 2014
  2. 2. Background • Machine Learning • Enable machines to understand the world • Play with data • GraphLab • Unleash data science! • Enable non-ML experts to play with data • This talk: a look at Big Data and Machine Learning from a tool builder’s perspective Strata Conf, Feb 2014 2
  3. 3. DATA Strata Conf, Feb 2014
  4. 4. What is Data? • Data is an extension of ourselves • Pictures, texts, messages, logs • Sensors and devices • Measurements and experiments • Data is organic; it is wild and messy • Data proliferates Strata Conf, Feb 2014 4
  5. 5. Producers of Big Data • Tech industry • Google, Microsoft, Facebook, Amazon, Twitter, … • Consumer/Retail • Walmart, Target, Amazon, Netflix, … • Telecomm • Verizon, AT&T, Telefonica, … • Finance • Thomson Reuters, Dow Jones, … • Health care and monitoring • Personal health metrics, health care records, … • Science • Genome research, high energy physics, astronomy, NASA, … • Etc. Strata Conf, Feb 2014 5
  6. 6. • 1.11 billion active users [March 2013] • 665 million daily users on average [March 2013] • Daily data amount: [Aug 2012] • 500+ TB data • 2.5 billion pieces of content • 2.7 billion “Like” actions • 300 mil photos • Scans 105 TB data every ½ hour • 100+ PB data stored on a single Hadoop cluster [Aug 2012] Strata Conf, Feb 2014 6 Data Sources: [Yahoo! news] [TechCrunch]
  7. 7. System Event Logs ETW (Event Tracing for Windows) • Logs of kernel and application events • Up to 100K events per second • Binary log size: ~200 MB every 2-5 minutes • 20-50 TB/year from one machine • ~50 PB/year from 1000 machines Strata Conf, Feb 2014 7 Data source: http://msdn.microsoft.com/en-us/library/windows/desktop/bb968803%28v=vs.85%29.aspx
  8. 8. A Picture of Big Data Strata Conf, Feb 2014 8 WikipediaWebSpam Sys Logs Walmart LHC Whole Genome Scans SDSS Flickr Cellphone CDRs Facebook Twitter GB TB PB EB Total Size / Year Structure Science Tech Size of bubble = Size of a single record (log-scale) Other
  9. 9. TAKING THE LEAP Strata Conf, Feb 2014 9
  10. 10. ALGORITHMS Strata Conf, Feb 2014 10
  11. 11. The Way to Insight • What do people do with Big Data? • Myriad algorithms for myriad tasks • Two disparate examples • What movies would Bob like? – discovering recommendations from a crowd • Why is my machine so slow? – diagnosing systems using event logs Strata Conf, Feb 2014 11
  12. 12. Algorithm Example 1: A Recommender System Strata Conf, Feb 2014
  13. 13. What Movies Would Bob Like? • Bob watched “Silver Linings Playbook” and “Twin Peaks.” What else might Bob like? • Given movie selections of many users, make recommendations for individuals Strata Conf, Feb 2014
  14. 14. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014
  15. 15. Finding Similar Movies • Jaccard similarity between a pair of movies num users who watched both num users who watched either • If every user who watched one or the other movie, ends up watching both, then the two movies must be very similar. Strata Conf, Feb 2014
  16. 16. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  17. 17. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  18. 18. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = ?
  19. 19. User-Movie Interaction Matrix Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Bob Anna David Ethan Strata Conf, Feb 2014 Sim(“Silver Linings Playbook”, “Hunger Games”) = 1/3
  20. 20. Movie Similarity Matrix Strata Conf, Feb 2014 Silver Linings Playbook Hunger Games Twin Peaks Iron Man 3 Mulholland Drive Silver Linings Playbook 1 1/3 2/3 0 1/3 Hunger Games 1/3 1 1/4 0 1/3 Twin Peaks 2/3 1/4 1 0 2/3 Iron Man 3 0 0 0 1 0 Mulholland Drive 1/3 1/3 2/3 0 1
  21. 21. Making New Recommendations recs = [ ] for movie in user.preferences: new_movies = Sim[movie, :].topk( ) recs.append(new_movies) recs.sort() • Equivalently, take the vector-matrix product • vector = the user’s preferences • matrix = movie similarity matrix Strata Conf, Feb 2014
  22. 22. Key Ideas • During training: compute item-item similarity matrix • Making recommendations: take vector- matrix product Strata Conf, Feb 2014
  23. 23. Algorithm Example 2: Diagnosing a slow computer Strata Conf, Feb 2014
  24. 24. Why is My Machine So Slow? • Slow machines are frustrating! • Diagnose slowness via event logs
  25. 25. ETW – Event Tracing for Windows • Fine-grained event tracing • Up to 100,000 events per second Strata Conf, Feb 2014 25 Excerpt of Sample ETW log
  26. 26. Diagnosing Slowness • Start from slow thread • Walk backwards to construct wait graph Strata Conf, Feb 2014 Firefox Time Network Stack TCP/IP packet Search Indexer File Lock Anti-Virus Checker File Lock
  27. 27. Key Algorithm Ideas • The insight is a wait graph • Constructing the graph involves repeated queries into a large set of events • Iterate: • What was the current thread waiting on? • Go to the source of the wait Strata Conf, Feb 2014
  28. 28. What links these algorithms and data? Strata Conf, Feb 2014
  29. 29. DATA STRUCTURES – THE BRIDGE Strata Conf, Feb 2014
  30. 30. Between Data and Algorithms • Data structures • Organized data • Optimized for certain computations • The key to efficient analysis • Algorithms prefer certain data structures • Raw data is amenable to certain data structures Data Algorithms Data Structures Amenable Preference
  31. 31. The Disconnect • Machine Learning research – largely disconnected from implementation • Some recent advances in large-scale ML are rediscovering known data structures • Next-gen ML tools need well-tailored data structures Strata Conf, Feb 2014 Machine Learning (Statistics, optimization, linear algebra, …) Data Structures (Lists, trees, tables, graphs, …)
  32. 32. Two Useful Data Structures • Flat tables • Graphs Strata Conf, Feb 2014
  33. 33. Data Structure 1: Flat Table Strata Conf, Feb 2014
  34. 34. Flat Tables • Rows and columns • Rows = records • Columns can be typed • A lot of raw data looks like flat tables! Strata Conf, Feb 2014
  35. 35. Example 1 User Item Rating Time Alice Breaking Bad, Season 1 3 … Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Strata Conf, Feb 2014 User-Item interaction data
  36. 36. Example 2 Timestamp Name PID CPU Stack … 447590409 audiodg.exe 1848 1 ntkrnlpa.exe!KeSetEvent ntkrnlpa.exe!WaitForLock 447590411 csrss.exe 460 0 … 447590415 iexplore.exe 2478 1 kernel64.exe!WaitForMultipleObjects … Strata Conf, Feb 2014 Event log data
  37. 37. Variations of Flat Tables • Query vs. computation • Random access (in-memory) vs. sequential access (on-disk) • Column vs. row-wise representation • Indexed or not • Distributed or not • Key-value stores (hash tables) Strata Conf, Feb 2014
  38. 38. Data Structure 1.5: Indexed Flat Table Strata Conf, Feb 2014
  39. 39. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 …
  40. 40. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate?
  41. 41. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate? Index of “Bob” points to rows 3 and 6
  42. 42. Back to the Recommender • Training: compute a matrix • Recommending: vector-matrix product • Raw data: user-item interaction log • Load in as flat table • Build index (user-item matrix) • Iterate through the users to train Strata Conf, Feb 2014
  43. 43. ML on Flat Tables • Anything where data is represented as feature vectors • Computations operate on rows • Stochastic gradient descent • K-means clustering • … or columns • Decision tree family Strata Conf, Feb 2014
  44. 44. Data Structure 2: Graph Strata Conf, Feb 2014
  45. 45. Example Strata Conf, Feb 2014 Anna Diana Charlie Frank Tina Bob Sam
  46. 46. Implementation 1: Edge List • A simple flat table! • Additional columns = edge attributes (e.g., user rating of movie, time watched, etc.) Strata Conf, Feb 2014 User Item Alice Breaking Bad, Season 1 Charlie Twilight Bob Silver Linings Playbook Frank American Hustle Tina Plan 9 From Outer Space Bob Twin Peaks Diana Dr. Strangelove …
  47. 47. Implementation 2: Edge List + Vertex List • Two flat tables • Pre-computed join on VertexID Strata Conf, Feb 2014 VertexID Name Age Genre 1 Alice 50 2 Charlie 26 3 Bob 33 … 100001 Silver Linings Playbook Romance 100002 Iron Man 3 Action 100003 Twin Peaks Thriller SrcVertex DstVertex 1 389944 2 136782 3 100001 4 572639 5 200835 3 100003 …
  48. 48. Graph Operations • get_neighbors(): 1. Query indexed flat table Strata Conf, Feb 2014
  49. 49. Example of Indexed Flat Table Strata Conf, Feb 2014 User Item Rating Alice Breaking Bad, Season 1 3 Charlie Twilight 2 Bob Silver Linings Playbook 4 Frank American Hustle 2 Tina Plan 9 From Outer Space 4 Bob Twin Peaks 2 Diana Dr. Strangelove 5 … Index Query: What items did Bob rate? Index of “Bob” points to rows 3 and 6
  50. 50. Graph Operations • get_neighbors(): 1. Query indexed flat table 2. Join with vertex table on VertexID or Name Strata Conf, Feb 2014 User Movie Rating Bob Silver Linings Playbook 4 Bob Twin Peaks 2 VertexID Name Age Genre 3 Bob 33 100001 Silver Linings Playbook Romance 100003 Twin Peaks Thriller
  51. 51. Graph Operations • get_subgraph(): • get_neighbors(), instantiate new table with subset of rows of old tables • Find edges/vertices with attribute = x • Filter old tables • Hypergraph – edges span more than 2 vertices • Just add more columns to the edge table Strata Conf, Feb 2014
  52. 52. Back to Syslog Mining • Wait graph construction = search and filter • Iterate: • get_neighbors() • filter on edge and vertex attribute to find culprits • Sequential process • Underlying event graph is enormous • SLOW Strata Conf, Feb 2014
  53. 53. ML on Graphs • Graphical models (Bayes nets) • Belief propagation • Gibbs sampling • Random walk on Markov chains • PageRank • Some algos are implementable on either • Matrix factorization Strata Conf, Feb 2014
  54. 54. Graphs vs. Tables Strata Santa Clara, Feb 2014 Tables Graphs
  55. 55. Graphs vs. Tables • Closely related • Graphs can be implemented on top of tables • … yet different • What key operations to optimize • How much to pre-compute • Indexes • Joins • Filters Strata Santa Clara, Feb 2014
  56. 56. Popular Implementations Strata Santa Clara, Feb 2014
  57. 57. Flat Tables Strata Conf, Feb 2014 Random Access (In Memory) Sequential Access (On Disk) Querying (Interactive) Computation (Batch) Pandas Spark SQL Hive/Pig GraphLab SFrame
  58. 58. Graphs Strata Conf, Feb 2014 Random Access (In-Memory) Sequential Access (On disk) Querying (Interactive) Computation (Batch) GraphLab Graph GraphChi Graph GraphDBs: HyperGraphDB, Titan, Neo4j Giraph
  59. 59. Conclusions • Fast and scalable analysis hinges upon efficient data structures • Match the algo to the data structure • Morph raw data into the data structure Strata Conf, Feb 2014 Raw Data Data Structure Algorithm Insight
  60. 60. Advertising • GraphLab Tutorial this afternoon! • “Large Scale Machine Learning Cookbook Using GraphLab” • Ballroom G, 1:30pm—5pm Strata Santa Clara, Feb 2014

×