Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- iAB_Search_Conference_Jul-15 by Asa Dargue 149 views
- GraphLab Conference 2014 Yucheng Lo... by Turi, Inc. 772 views
- Machine Learning in 2016: Live Q&A ... by Turi, Inc. 2833 views
- Learn to Build an App to Find Simil... by PyData 5223 views
- (CMP305) Deep Learning on AWS Made ... by Amazon Web Services 3452 views

970 views

Published on

Presenter: Yucheng Low, Chief Architect, Dato

Published in:
Technology

No Downloads

Total views

970

On SlideShare

0

From Embeds

0

Number of Embeds

4

Shares

0

Downloads

42

Comments

0

Likes

6

No embeds

No notes for slide

- 1. Scalable Machine Learning: Single Machine to Distributed Yucheng Low Chief Architect
- 2. What is ML scalability?
- 3. Is this scalability? 1600s Algorithm Implementation X 800s 400s 200s 300s Best Single Machine Implementation
- 4. True Scalability How long does it take to get to a predetermined accuracy? Not About: How well you can implement Algorithm X Understand the tradeoffs between different algorithms.
- 5. It is not about Scaling Up Scaling Out
- 6. It is about Scaling Up Scaling Out Going as fast as you can, on any hardware
- 7. • Assume bounded resources • Optimize for data scalability The Dato Way • Scales excellently • Require fewer machines to solve in the same runtime as other systems
- 8. 10 ~1GB/s 1 TB ~0.1GB/s 10 TB ~1-10 GB/s 0.1 TB Single Machine Scalability: Storage Hierarchy Capacity Throughput Random access is very slow! Good External Memory Datastructures For ML
- 9. SFrame: Scalable Tabular Data Manipulation User Com. Title Body User Disc. SGraph: Scalable Graph Manipulation
- 10. Data is usually rows… user movie rating But, data engineering typically column transformations…
- 11. 13 Feature engineering is columnar Normalizes the feature x: sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum() Create a new feature: sf[‘rating-squared’] = sf[‘rating’].apply(lambda rating: rating*rating) Create a new dataset with 2 of the features: sf2 = sf[[‘rating’,’ rating-squared’]] ratinguser movie rating squared
- 12. SFrame • Rich Datatypes • Strong schema types: int, double, string, image, ... • Weak schema types: list, dictionary (Can contain arbitrary JSON) • Columnar Architecture • Easy feature engineering + Vectorized feature operations. • Lazy evaluation • Statistics + sketches • Type aware compression User Com. Title Body User Disc.Scalable Out-Of-Core Table Representation Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MB
- 13. Out of Core Machine Learning Rethink all ML Algorithms Random Access Sequential Only Sampling? Sort/Shuffle Understanding the Statistical/convergence impacts of ML algorithm variations.
- 14. Single Machine Scaling 0 500 1000 1500 2000 2500 GraphLab-Create (1 Node) MLlib 1.3 (5 Node) MLlib 1.3 (1 Node) Scikit-Learn Runtime Dataset Source: LIBLinear binary classification datsets. KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed. Task: Predict student performance on math problems based on interactions with tutoring system
- 15. Single Machine Scaling 0 100 200 300 400 500 600 700 800 900 GraphLab-Create (1 Node) BIDMach (1 GPU Node) Runtime Criteo Kaggle: Click Prediction 46M rows 34M sparse coefficients Not a Compute Bound Task
- 16. Social Media Graphs encode the relationships between: •Big: trillions of vertices and edges and rich metadata •Facebook (10/2012): 1B users, 144B friendships •Twitter (2011): 15B follower edges AdvertisingScience Web People Facts Products Interests Ideas
- 17. SGraph 1. Immutable disk-backed graph representation. (Append only) 2. Vertex / Edge Attributes. 3. Optimized for bulk access, not fine-grained queries. Get neighborhood of [5 Million Vertices] Get neighborhood of 1 vertex
- 18. Standard Graph Representations src dest 1 102 132 10 48 999 129 192 998 23 392 124 Edge List Easy to Insert src dest 1 10 1 99 1 102 2 5 2 10 2 120 Sparse Matrix / Sorted Edge List Difficult to Insert (random writes)102 103 349 13 Difficult to Query Fast to Query 1 105
- 19. SGraph Layout 1 2 3 4 Vertex SFrames __id Address ZipCode Alice … 98105 Bob … 98102 Vertices partitioned into p = 4 SFrames
- 20. Edges partitioned into p^2 = 16 SFrames __id Address ZipCode John … 98105 Jack … 98102 SGraph Layout 1 3 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames __src_id __dst_id Message Alice Bob “hello” Bob Charlie “world” Charlie Alice “moof” 2
- 21. __id Address ZipCode John … 98105 Jack … 98102 3 SGraph Layout 1 2 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames __src_id __dst_id Message Alice Bob “hello” Bob Charlie “world” Charlie Alice “moof”
- 22. 3 SGraph Layout 1 2 4 Vertex SFrames (1,2) (2,2) (3,2) (4,2) (1,1) (2,1) (3,1) (4,1) (1,4) (2,4) (3,4) (4,4) (1,3) (2,3) (3,3) (4,3) Edge SFrames
- 23. Common Crawl Graph 3.5 billion Nodes and 128 billion Edges Largest available public Graph. 200GB Compression factor 10:1 12.5 bits per edge 2 TB Benefit From SFrame Compression Methods
- 24. Common Crawl Graph 3.5 billion Nodes and 128 billion Edges Largest available public Graph. 200GB Compression factor 10:1 12.5 bits per edge 2 TB
- 25. Common Crawl Graph 1x r3.8xlarge using 1x SSD. 3.5 billion Nodes and 128 billion Edges PageRank: 9 min per iteration. Connected Components: ~ 1 hr. There isn’t any general purpose library out there capable of this.
- 26. SFrame & SGraph BSD License (August)
- 27. Distributed
- 28. Train on bigger datasets Train Faster Speedup Relative to Best Single Machine Implementation
- 29. X Y Time for 1 pass = 100s Extending Single Machine to Distributed
- 30. Extending Single Machine to Distributed X Y Time for 1 pass = 50s X Y Parallel Disks Good External Memory Datastructures For ML Still Help
- 31. Distributed Optimization Newton, LBFGS, FISTA, etc Parallel Sweep over data X Y Synchronize Parameters Parallel Sweep over data X Y Synchronize Parameters Make sure this is embarrassingly parallel Talk Quickly
- 32. Distributed Optimization HDFS X Y 1. Data begins on HDFS X YX Y 2. Every machine takes part of the data to local disk/SSD 3. Inter machine communication by fast supercomputer-style primitives
- 33. Criteo Terabyte Click Logs Click Prediction Task: Whether visitor clicked on a link or not.
- 34. Criteo Terabyte Click Prediction 4.4 Billion Rows 13 Features ½ TB of data 0 500 1000 1500 2000 2500 3000 3500 4000 0 4 8 12 16 Runtime #Machines 225s 3630s
- 35. Distributed Graphs
- 36. Graph Partitioning Minimizing Communication YYYCommunication is linear in the number of machines each vertex spans 49 Vertex-Cut: Placing edges on machines, and letting vertex span machines
- 37. Graph Partitioning Communication Minimization Time to compute a partition Quality of partition
- 38. Graph Partitioning Since Large Natural Graphs are difficult to partition anyway… Time to compute a partition Quality of partition How good a partition quality can we get while doing almost no work at all?
- 39. Machine 2Machine 1 Machine 3 Randomly assign edges to machines YYYY ZYYYY ZY Z Random Partitioning But is probably the worst partition you can construct. Can we do better?
- 40. Sgraph Partitioning (1,2) (2,2) (1,1) (2,1) (3,2) (4,2) (3,1) (4,1) (1,4) (2,4) (1,3) (2,3) (3,4) (4,4) (3,3) (4,3)
- 41. Slides from a couple of years ago
- 42. Distributed Graphs New Graph Partitioning Ideas Mixed in-core out-of-core computation
- 43. Common Crawl Graph 0 100 200 300 400 500 600 0 4 8 12 16 Runtime #Machines 16 Machines, (c3.8xlarge, 512 vCPUs) 45 sec per iteration 3B edges per second 3.5 billion Nodes and 128 billion Edges
- 44. In search of Performance Understand memory access patterns of algorithms: Single Machine and Distributed Sequential? Random? User Com. Title Body User Disc. Optimize datastructures for access patterns
- 45. It is not merely about speed, or scaling Doing more with what you already have
- 46. Excess Slides
- 47. Our Tools Are Easy To Use import graphlab as gl train_data = gl.SFrame.read_csv(traindata_path) train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1) train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2) cls = gl.classifier.create(train_data, target='sentiment’) 5 line sentiment analysis But You have preexisting code in Numpy, Scipy, Scikit-learn
- 48. Automatic Numpy Scaling Automatic in-memory, type aware compression using SFrame type compression technology. import graphlab.numpy Scalable numpy activation successful Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn. Demo
- 49. Scikit Learn SGDLinearCLassifier 0 500 1000 1500 2000 2500 3000 3500 4000 0 100 200 300 400 Runtime(s) Millions of Rows Airline Delay Dataset Numpy Graphlab + numpy
- 50. Automatic Numpy Scaling Automatic in-memory, type aware compression using SFrame type compression technology. import graphlab.numpy Scalable numpy activation successful Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn. Demo Caveats apply - Sequential Access highly preferred. - Scales most memory bound sklearn algorithms by at least 2x, some by more.
- 51. 0 5000 10000 15000 20000 25000 30000 H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab Create GPU ImagesperSecond Deep Learning Throughput GPU Dataset Source: MNIST 60K examples, 764 dimensions Source(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..

No public clipboards found for this slide

Be the first to comment