Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph

772 views

Published on

Scalable Data Structures: SFrame & SGraph

Published in: Data & Analytics
  • Attract Abundance Into Your Life - New musical "Angel tone" calls in your angels to help you manifest abundance and miracles into your life... starting in just minutes per day. Go here to listen now. ▲▲▲ http://ishbv.com/manifmagic/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How to use "The Scrambler" ot get a girl obsessed with BANGING you... ➤➤ http://ishbv.com/unlockher/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How to Love Yourself: 15 Tips for Developing Self Love ➤➤ https://bit.ly/30Ju5r6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph

  1. 1. The Technology Behind Yucheng Low, PhD Chief Architect GraphLab Create
  2. 2. GraphLab Philosophy Users-First Architecture
  3. 3. Architecture Systems User
  4. 4. Architecture Systems User Systems-First Architectures Systems define constraints. Optimize for performance. PowerGraph
  5. 5. Architecture Systems User Users-First Architectures Users define constraints. Optimize for user interaction.
  6. 6. What is a Users-First Architecture for Data Science?
  7. 7. SFrame and SGraph Building on decades of database and systems research. Built by data scientists, for data scientists.
  8. 8. SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation User Com. Title Body User Disc.
  9. 9. Enabling users to easily and efficiently translate between both representations to get the best of both worlds.
  10. 10. SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation User Com. Title Body User Disc.
  11. 11. SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation User Com. Title Body User Disc.
  12. 12. SFrame Design Jobs fail because: • Machine run out of memory • Did not set Java Heap Size correctly • Resource Configuration X needs to be bigger. Pain Point #1: Resource Limits
  13. 13. SFrame Design • Graceful Degradation as 1st principle • Always Works Pain Point #2: Too Strict or Too Weak Schemas We want strong schema types. We also want weak schema types. Missing Values
  14. 14. SFrame Design • Graceful Degradation as 1st principle • Always Works • Rich Datatypes • Strong schema types: int, double, string... • Weak schema types: list, dictionary
  15. 15. SFrame Design • Graceful Degradation as 1st principle • Always Works • Rich Datatypes • Strong schema types: int, double, string... • Weak schema types: list, dictionary Pain Point #3: Feature Manipulation Difficult or costly to inspect existing features and create new features. Hard to perform data exploration.
  16. 16. SFrame Design • Graceful Degradation as 1st principle • Always Works • Rich Datatypes • Strong schema types: int, double, string... • Weak schema types: list, dictionary • Columnar Architecture • Easy feature engineering + Vectorized feature operations. • Immutable columns + Lazy evaluation • Statistics + visualization + sketches
  17. 17. SFrame Python API Example Make a little SFrame of 1 column and 5 values: sf = gl.SFrame({‘x’:[1,2,3,4,5]}) Normalizes the column x: sf[‘x’] = sf[‘x’] / sf[‘x’].sum() Uses a python lambda to create a new column: sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x) Create a new column using a vectorized operator: sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’] Create a new SFrame taking only 2 of the columns: sf2 = sf[[‘x’,’x-squared’]]
  18. 18. SFrame Querying Supports most typical SQL SELECT operations using a Pythonic syntax. SQL SELECT Book.title AS title, COUNT(*) AS authors FROM Book JOIN Book_author ON Book.isbn = Book_author.isbn GROUP BY Book.title; SFrame Python Book.join(Book_author, on=‘isbn’) .groupby(‘title’, {‘authors’:gl.aggregate.COUNT})
  19. 19. SFrame Columnar Encoding user movie rating Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed
  20. 20. SFrame Columnar Encoding user movie rating Type aware compression: • Variable Bit length Encode • Frame Of Reference Encode • ZigZag Encode • Delta / Delta ZigZag Encode • Dictionary Encode • General Purpose LZ4 Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed SFrame File
  21. 21. SFrame Columnar Encoding user movie rating Type aware compression: • Variable Bit length Encode • Frame Of Reference Encode • ZigZag Encode • Delta / Delta ZigZag Encode • Dictionary Encode • General Purpose LZ4 Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed User  176 MB 14.2 bits/int SFrame File 0.02 bits/intMovie  257 KB 3.8 bits/intRating  47 MB ------------------------------- Total  223MB
  22. 22. SFrame Columnar Encoding user movie rating Type aware compression: • Variable Bit length Encode • Frame Of Reference Encode • ZigZag Encode • Delta / Delta ZigZag Encode • Dictionary Encode • General Purpose LZ4 Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed User  176 MB 14.2 bits/int SFrame File 0.02 bits/intMovie  257 KB 3.8 bits/intRating  47 MB ------------------------------- Total  223MB 10s
  23. 23. SFrames Distributed • Distributed Dataflow • Columnar Query Optimizations • Communicate columnar compressed blocks rather than row tuples. The choice of distributed or local execution is a question of query optimization.
  24. 24. SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation User Com. Title Body User Disc.
  25. 25. SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation User Com. Title Body User Disc.
  26. 26. SGraph • SFrame backed graph representation. Inherits SFrame properties. • Data types, External Memory, Columnar, compression, etc. • Layout optimized for batch external memory computation.
  27. 27. SGraph Layout 1 2 3 4 Vertex SFrames Vertices Partitioned into p = 4 SFrames.
  28. 28. SGraph Layout 1 2 3 4 Vertex SFrames __id Name Address ZipCode 1011 John … 98105 2131 Jack … 98102 Vertices Partitioned into p = 4 SFrames.
  29. 29. SGraph Layout 1 2 3 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames Edges partitioned into p^2 = 16 SFrames.
  30. 30. SGraph Layout 1 2 3 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames Edges partitioned into p^2 = 16 SFrames.
  31. 31. SGraph Layout 1 2 3 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames Edges partitioned into p^2 = 16 SFrames.
  32. 32. SGraph Layout Vertex SFrames Edge SFrames
  33. 33. SGraph Layout Vertex SFrames Edge SFrames
  34. 34. SGraph Layout Vertex SFrames Edge SFrames
  35. 35. SGraph Layout Vertex SFrames Edge SFrames
  36. 36. SGraph Layout Vertex SFrames Edge SFrames
  37. 37. Deep Integration of SFrames and SGraphs • Seamless interaction between graph data and table data. • Queries can be performed easily across graph and tables.
  38. 38. Demo
  39. 39. SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation User Com. Title Body User Disc. User-first architecture. Built by data scientists, for data scientists.

×