Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SFrames
Yucheng Low
Chief Architect @ Dato
Scalable Machine Learning
recommenders, other task-oriented ML,
boosted decision trees, deep learning,
pattern mining, man...
SGraphSFrameLocal
HDFS
S3
3
Compressed In-Core or
Out-of-core scalable datastructures
https://github.com/dato-code/sframe
4
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf2 = gl.SFrame(‘netflix_norm.frame...
5
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame...
6
Python API
user movie rating
netflix_tr.frame
sf = gl.Sframe.read_csv(‘netflix.csv’)
sf
user
item
rating
sf2 = gl.SFrame...
7
Column Types Supported
• Boring Scalar Types
- int64, double, string
• Interesting Scalar Types
- Datetime.datetime, ima...
8
What Are SFrames
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimizati...
9
Query Planning
Physical Storage Layer
Compressed Column Store
(with some interesting properties)
Lazy Query Optimization...
10
Language Binding
• Python Bindings
- Our oldest binding.
Via Cython + Interprocess Comm to a C++ binary.
• R Bindings
-...
11
Common Crawl Graph
1x r3.8xlarge  using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration....
12
https://github.com/dato-code/sframe
pip install sframe
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Introduction to Recommender Systems
Next
Upcoming SlideShare
Introduction to Recommender Systems
Next
Download to read offline and view in fullscreen.

Share

SFrame

Download to read offline

Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis.

The SFrame package provides the complete implementation of:

SFrame
SArray
SGraph
The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)

SFrame

  1. 1. SFrames Yucheng Low Chief Architect @ Dato
  2. 2. Scalable Machine Learning recommenders, other task-oriented ML, boosted decision trees, deep learning, pattern mining, many others, etc GraphLab Create SGraphSFrameLocal HDFS S3 2 Compressed In-Core or Out-of-core scalable datastructures C++11
  3. 3. SGraphSFrameLocal HDFS S3 3 Compressed In-Core or Out-of-core scalable datastructures https://github.com/dato-code/sframe
  4. 4. 4 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] sf user item rating nrating
  5. 5. 5 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous
  6. 6. 6 Python API user movie rating netflix_tr.frame sf = gl.Sframe.read_csv(‘netflix.csv’) sf user item rating sf2 = gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item ratingsf[‘nrating’] = sf2[‘rating’] nrating diff = sf[‘rating’] - sf2[‘rating’] diff anonymous sf[‘diff’] = diff diff Not a SQL Frontend Filtering sf[sf[‘rating’] >= 3] Joins Sf.join(user_table, on=‘user_id’) Random/Array indexing row10 = sf[10] Table_with_every_other_row = sf[::2] Rather Fast Parallelized UDFs (Interproc SHM) sf[‘rating’].apply(lambda x: x*x)
  7. 7. 7 Column Types Supported • Boring Scalar Types - int64, double, string • Interesting Scalar Types - Datetime.datetime, image • For the Mathematician Type - array(‘d’) • For the all real data is ugly types - List, dict (Arbitrary union types. Ex: List can contain anything including other lists and dicts.)
  8. 8. 8 What Are SFrames Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache Type aware compression methods. Very aggressive numeric compression. Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MB
  9. 9. 9 Query Planning Physical Storage Layer Compressed Column Store (with some interesting properties) Lazy Query Optimization / Execution C++ Coroutine Exec Pipeline Python API Heavily Pandas Inspired (+ immutable data considerations) File System Abstraction Local HDFS S3 Cache p['X4'] = p['X3'] + p['X2'] g= p[p['X1'] < 10]
  10. 10. 10 Language Binding • Python Bindings - Our oldest binding. Via Cython + Interprocess Comm to a C++ binary. • R Bindings - Via our RCpp  C++11 Bindings (exported in SDK) • C++11 Bindings auto g = gl_sframe(); g["hello"] = gl_sarray::from_sequence(0,1000); g["world"] = 2; g["hello"] = (g["hello"] / 2) .astype(flex_type_enum::INTEGER); auto ret = g.groupby({"hello"}, {{"sum of world",aggregate::SUM("world")}}); ret = ret.sort({"hello"}); cout << ret; Columns: hello integer sum of world integer Rows: 500 Data: +----------------+----------------+ | hello | sum of world | +----------------+----------------+ | 0 | 4 | | 1 | 4 | | 2 | 4 | | 3 | 4 | | 4 | 4 | | 5 | 4 | | 6 | 4 | | 7 | 4 | | 8 | 4 | | 9 | 4 | +----------------+----------------+ [500 rows x 2 columns]
  11. 11. 11 Common Crawl Graph 1x r3.8xlarge  using 1x SSD. 3.5 billion Nodes and 128 billion Edges PageRank: 9 min per iteration. Connected Components: ~ 1 hr. There isn’t any general purpose library out there capable of this.
  12. 12. 12 https://github.com/dato-code/sframe pip install sframe
  • SilverMaple

    Oct. 16, 2015
  • kartiktv

    Sep. 19, 2015

Scalable tabular (SFrame, SArray) and graph (SGraph) data-structures built for out-of-core data analysis. The SFrame package provides the complete implementation of: SFrame SArray SGraph The C++ SDK surface area (gl_sframe, gl_sarray, gl_sgraph)

Views

Total views

2,087

On Slideshare

0

From embeds

0

Number of embeds

33

Actions

Downloads

34

Shares

0

Comments

0

Likes

2

×