SlideShare a Scribd company logo
The
Technology
Behind
Yucheng Low, PhD
Chief Architect
GraphLab
Create
GraphLab Philosophy
Users-First Architecture
Architecture
Systems
User
Architecture
Systems
User
Systems-First
Architectures
Systems define constraints.
Optimize for performance.
PowerGraph
Architecture
Systems
User
Users-First
Architectures
Users define constraints.
Optimize for user interaction.
What is a Users-First
Architecture for Data Science?
SFrame and SGraph
Building on decades of database
and systems research.
Built by data scientists,
for data scientists.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
Enabling users to easily and efficiently
translate between both representations to
get the best of both worlds.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SFrame Design
Jobs fail because:
• Machine run out of memory
• Did not set Java Heap Size correctly
• Resource Configuration X needs to be
bigger.
Pain Point #1: Resource Limits
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
Pain Point #2: Too Strict or Too Weak Schemas
We want strong schema types.
We also want weak schema types.
Missing Values
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
Pain Point #3: Feature Manipulation
Difficult or costly to inspect existing features and
create new features.
Hard to perform data exploration.
SFrame Design
• Graceful Degradation as 1st principle
• Always Works
• Rich Datatypes
• Strong schema types: int, double, string...
• Weak schema types: list, dictionary
• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.
• Immutable columns + Lazy evaluation
• Statistics + visualization + sketches
SFrame Python API Example
Make a little SFrame of 1 column and 5 values:
sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x)
Create a new column using a vectorized operator:
sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
sf2 = sf[[‘x’,’x-squared’]]
SFrame Querying
Supports most typical SQL SELECT operations using a
Pythonic syntax.
SQL
SELECT Book.title AS title, COUNT(*) AS authors
FROM Book
JOIN Book_author ON Book.isbn = Book_author.isbn
GROUP BY Book.title;
SFrame Python
Book.join(Book_author, on=‘isbn’)
.groupby(‘title’, {‘authors’:gl.aggregate.COUNT})
SFrame Columnar Encoding
user movie rating
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
SFrame File
SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User  176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie  257 KB
3.8 bits/intRating  47 MB
-------------------------------
Total  223MB
SFrame Columnar Encoding
user movie rating Type aware compression:
• Variable Bit length Encode
• Frame Of Reference Encode
• ZigZag Encode
• Delta / Delta ZigZag Encode
• Dictionary Encode
• General Purpose LZ4
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
User  176 MB 14.2 bits/int
SFrame File
0.02 bits/intMovie  257 KB
3.8 bits/intRating  47 MB
-------------------------------
Total  223MB
10s
SFrames Distributed
• Distributed Dataflow
• Columnar Query Optimizations
• Communicate columnar compressed blocks
rather than row tuples.
The choice of distributed or local execution is a
question of query optimization.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
SGraph
• SFrame backed graph representation.
Inherits SFrame properties.
• Data types, External Memory, Columnar,
compression, etc.
• Layout optimized for batch external
memory computation.
SGraph Layout
1
2
3
4
Vertex
SFrames
Vertices Partitioned
into p = 4 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
__id Name Address ZipCode
1011 John … 98105
2131 Jack … 98102
Vertices Partitioned
into p = 4 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge
SFrames
Edges partitioned into
p^2 = 16 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge
SFrames
Edges partitioned into
p^2 = 16 SFrames.
SGraph Layout
1
2
3
4
Vertex
SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge
SFrames
Edges partitioned into
p^2 = 16 SFrames.
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
SGraph Layout
Vertex
SFrames
Edge
SFrames
Deep Integration of SFrames and
SGraphs
• Seamless interaction between graph data
and table data.
• Queries can be performed easily across
graph and tables.
Demo
SFrame: Scalable Tabular
Data Manipulation
SGraph: Scalable Graph
Manipulation
User Com.
Title Body
User Disc.
User-first architecture.
Built by data scientists,
for data scientists.

More Related Content

What's hot

Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Turi, Inc.
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Chris Fregly
 
Cascalog
CascalogCascalog
Cascalog
nathanmarz
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
Databricks
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
Amazon Web Services
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
SciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkSciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talk
Wes McKinney
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
Paolo Platter
 

What's hot (20)

Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
Cascalog
CascalogCascalog
Cascalog
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
SciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkSciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talk
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 

Viewers also liked

Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Turi, Inc.
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
Amazon Web Services
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
Grigory Sapunov
 

Viewers also liked (6)

Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 

Similar to GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph

On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
Chin Huang
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Databricks
 
Pydata talk
Pydata talkPydata talk
Pydata talk
Turi, Inc.
 
Performance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerformance & Scalability Improvements in Perforce
Performance & Scalability Improvements in Perforce
Perforce
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Dave Gardner
 
Serial-War
Serial-WarSerial-War
Serial-War
Xuechao Wu
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
harendra_pathak
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
MongoDB
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
Cliff Gilmore
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
Alex Payne
 
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Metosin Oy
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 

Similar to GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph (20)

On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Pydata talk
Pydata talkPydata talk
Pydata talk
 
Performance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerformance & Scalability Improvements in Perforce
Performance & Scalability Improvements in Perforce
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
Serial-War
Serial-WarSerial-War
Serial-War
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
Schema tools-and-trics-and-quick-intro-to-clojure-spec-22.6.2016
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
Turi, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
Turi, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
Turi, Inc.
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
Turi, Inc.
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
Turi, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
Turi, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
Turi, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
Turi, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
Turi, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
Turi, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
Turi, Inc.
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
Turi, Inc.
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
Turi, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 

Recently uploaded

一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
yuvarajkumar334
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 

Recently uploaded (20)

一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS NOTES FOR MCA
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 

GraphLab Conference 2014 Yucheng Low - Scalable Data Structures: SFrame & SGraph

Editor's Notes

  1. My name is ... I am one of the co-founders and currently the chief architect at GraphLab. Talk about - architecture philosophy @ graphlab - and what we have built into GraphLab Create.
  2. The basic architecture philosophy at GraphLab is what I like to call “Users-First” architecture. What do I mean by that?
  3. The objective of a systems Architecture is to connect systems [] to users []. To provide users with a means to access data, compute resources, and so on.
  4. And of course, there are many ways to achieve this. [] Some architectures are designed with a bottom up nature: - Given particular systems constraints, what can we perform most efficiently? - Users can then try to develop whatever applications they need around the system.
  5. Other architectures are designed top down: - We begin by defining an interaction model with users. SQL for instance is a great example. - Then we figure out how to design an architecture that supports all these capabilities efficiently.
  6. The question we are trying to answer here at GraphLab is: “what is the Users-First Architecture for Data Science? For Machine Learning?” We think we have an answer. ----- Meeting Notes (7/21/14 03:15) ----- 1 min
  7. We designed two core datastructures we call... <T> Now what are SFrames and SGraphs?
  8. - SFrame is our scalable datastructure for table manipulation - SGraph is our scalable datastructure for graph manipulation.
  9. And one design objective is to enable ....
  10. I am going to first talk about the design of SFrames, our datastructure for table manipulation
  11. The SFrame design is governed by a series of common pain points we have observed. The first is fundamental. It is *extremely* annoying when we start a job / some task, have it run for a while, then have it fail. [] because... ----- Meeting Notes (7/21/14 03:15) ----- 2 min
  12. - graceful degradation as the first core principle in the SFrame design - instead of demanding more resources, we try to ensure that things always work even if there is insufficient memory. So we spent a lot of work on developing bounded memory algorithms to ensure that when there is insufficient memory, we might run slower, but we will always work. - We want strong types, because sometimes... It is *very* useful to have a priori guarantees that my “rating column” always contains an integer even in the 1billionth row. - We want weak types because sometime data is unstructured and I need some way to manage it.
  13. Our SFrames support a rich family of datatypes. From some strong types like int, double, string to weak types like arbitrary lists and dictionaries. Our lists and dictionaries are self-recursive and hence are expressive enough to hold arbitrary JSON. ----- Meeting Notes (7/21/14 03:15) ----- 4 min
  14. Key aspects of doing ML. I need to create new features, or delete new features. I need to be able to deeply explore a single feature or small sets of features at a time. ----- Meeting Notes (7/21/14 03:22) ----- efficiently create.
  15. This leads to the SFrame use of a columnar architecture. - By representing the data column-wise, we can support easy feature engineering. - By further using immutable columns and lazy evaluation, we can push through a large number of pipelining optimizations. - easily visualize or sketch statistics about single features
  16. The API looks somewhat Pandas like, and carry very similar ideas. For instance: It is as easy to load an SFrame with a billion rows as it is to construct a tiny SFrame here of 1 column and 5 values. 2) We can easily reassign the value of an entire column: here we normalize the column by the sum of the values. 3) We can easily create new columns by applying a python lambda operation to each entry. This lambda operation is parallelized behind the scenes 4) We can also create new columns by using the vectorized operators. 5) We can easily subselect columns to create new SFrames, And due to the immutable nature of columns, this operation is essentially free. So The SFrame can be used like a general purpose table, but having a carefully curated set of scalable operations.
  17. Further more, SFrames supports most of the key important query operators like groupby, join, etc. Most typical SQL Select operations will have a natural conversion to a pythonic syntax using SFrames
  18. I mentioned columnar architecture briefly in an earlier slide but I did not mention one of the fundamental benefits of columnar encoding: - aggressive compression is possible. But it is hard to understand how much compression can do without an actual demonstration. Data is 99M rows, 3 columns of integers of user, movie and ratings. This is ... Size ... ----- Meeting Notes (7/21/14 03:22) ----- 7 min
  19. - The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression. - For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms. - The on disk SFrame representation ends up as ... (movie is sorted) - The total size is smaller than the gzip compression. - This allows us to do more on one machine. - Aggressive compression allows us to keep more in application cache / file system cache - faster, more throughput even though its external memory. - Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame? - 10s on a 4 core desktop. (i73770K )
  20. - The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression. - For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms. - The on disk SFrame representation ends up as ... (movie is sorted) - The total size is smaller than the gzip compression. - This allows us to do more on one machine. - Aggressive compression allows us to keep more in application cache / file system cache - faster, more throughput even though its external memory. - Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame? - 10s on a 4 core desktop. (i73770K )
  21. - The SFrame encoder while ingressing the data, adaptively slicing the table up into blocks, targeted at a fixed block size after compression. - For compression, a collection of very high throughput columnar compression techniques are used to reduce the size of the data. Integers are particularly heavily compressed with a family of encoding algorithms. - The on disk SFrame representation ends up as ... (movie is sorted) - The total size is smaller than the gzip compression. - This allows us to do more on one machine. - Aggressive compression allows us to keep more in application cache / file system cache - faster, more throughput even though its external memory. - Oh, how long did it take to read the CSV and encode the entire dataset into an external memory, efficiently queryable SFrame? - 10s on a 4 core desktop. (i73770K )
  22. Now a natural question to bring up then is how about going Distributed? We take a somewhat novel perspective on this. The user facing Python API never changes. Instead, - In other words, the decision to go distributed is simply a question of query optimization. - If you are doing something small. It is going to be more efficient locally than to pay the distributed cost. - We are building a distributed dataflow system, taking advantage of columnar query optimizations. And when communication is necessary ,we can get reduce comms by communicating columnar compressed blocks rather than row tuples. ----- Meeting Notes (7/21/14 03:22) ----- 9 min – 10 min
  23. Next I will talk about SGraphs 10 min
  24. Our scalable datastructure for graph manipulation
  25. The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
  26. The layout works this way. Firstly, We partition the set of vertices into a collection of SFrames. This partitioning can be arbitrary, we use a simple hash function. Each vertex Sframe then contains the vertex ID and all the properties associated with the vertices. Note that this is an Sframe and hence the vertex attributes are stored column-wise.
  27. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  28. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  29. We next partition the edges into 16 Sframes, The layout is based on the adjacency matrix. For instance, edge partition (2,4) contains all the edges that connect between vertices in partition 2 and vertices in partition 4. This allows for instance, if computation is to be performed on edge partition (2,4), I only need vertex set 2 and vertex 4 in memory.
  30. The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion. This representation hence allows us to easily separate structure and data. For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
  31. The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion. This representation hence allows us to easily separate structure and data. For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
  32. The magic behind this layout is that ultimately, underlying the Sgraph is Sframes, which is stored in columnar fashion. This representation hence allows us to easily separate structure and data. For instance, we can without any copying, create new Graphs which shares the same structure, but with none of the data.
  33. Or introduce new features without rewriting anything. Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
  34. Or introduce new features without rewriting anything. Finally, since the vertex and edge attributes are all simply Sframes, through a trick of lazy evaluation, we can make the both the vertices and edges appear as a single Sframe to the user.
  35. This deep integration of SFrames and SGraphs allow seamless interaction between graph data and table data, and queries can be performed easily across both graphs and tables. Now, this is not so easy to understand just from slides, so I will give a quick demo.
  36. 13 min
  37. Now Lets briefly recap what we have just done during the demo. We have taken tabular data, converted it to graph and run a basic graph algorithm on it, and ask graph questions about it. And next we take the graph, and directly interpret it as a table: joining it with other tables to get some intelligence with very little code and very little friction.. All within a few lines of Python. This is what we have achieved by trying to understand data science from a user-first perspective. Ending up with a datastructure that is easy to use, powerful and enables our machine learning algorithms in GraphLab Create to scale easily to terabyte datasets on a single machine. Thank you.