Scalable data structures for data science

Scalable, Out-of-Core Data
Structures for Data Science
Krishna Sridhar
Data Scientist, Dato Inc.
krishna_srd

• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me

Collaborators
45+$and$growing$fast!

Scalable Machine Learning
recommenders, other task-oriented ML,
boosted decision trees, deep learning, pattern
mining, many others, etc
GraphLab Create
SGraphSFrameLocal
HDFS
S3
Compressed)In,Core)or)
Out,of,core)scalable)datastructures
C++11
Dato Architecture
pip install graphlab-create

Dato (Open Source) Architecture
SGraphSFrame
Compressed)In,Core)or)
Out,of,core)scalable)datastructures
https://github.com/dato-code/sframe

What can you do with a
single machine?

Build a Collaborative Filtering Model on 20 Billion
User-Item Ratings
Do PageRank on a 128 Billion edge graph.

Data Structures!
User Com.
Title Body
User Disc.
SFrame SGraph TimeSeries

SFrame Python API
Make a little SFrame of 1 column and 5 values:
>> sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
>> sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
>> sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x if x > 0 else 0)
Create a new column using a vectorized operator:
>> sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
>> sf2 = sf[[‘x’,’x-squared’]]

SFrame Design Principles
Graceful Degradation as 1st principle
- Always works
- High performance when in-memory, scales to disk.
Rich Datatypes
- Strong schema types: int, double, string, image.
- Weak schema types: list, dictionary (arbitrary JSON!)
Columnar Architecture
- Easy feature engineering + Vectorized feature operation
- Immutable columns + Lazy Evaluation
- Statistics + Sketching + Visualization

nrating
sf[‘nrating’]-=-sf2[‘rating’]
What is the SFrame?
sf#=#gl.SFrame(‘netflix_tr.frame’)
user movie rating
netflix_tr.frame
sf
user
item
rating
sf2$=$gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
rating

nrating
sf[‘nrating’]-=-sf2[‘rating’]
What is the SFrame?
sf#=#gl.SFrame(‘netflix_tr.frame’)
user movie rating
netflix_tr.frame
sf
user
item
rating
sf2$=$gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
rating
diff
anonymous
diff$=$sf[‘rating’]$0 sf2[‘rating’]

What is the SFrame?
Filtering
sf[sf[‘rating’]->=-3]
Joins
Sf.join(user_table,-on=‘user_id’)
Random/Array3indexing
row10-=-sf[10]
Table_with_every_other_row =-sf[::2]
Rather3Fast3Parallelized3UDFs3(Interproc SHM)
sf[‘rating’].apply(lambda-x:-x*x)
Not a SQL
Frontend

SArray Column Types
Boring Scalar Types
- int64, double, string
Interesting Scalar Types
- Datetime, image
Mathematician Type
- array(‘d’)
Industrial Data Scientist Type
- list, dict

SFrame Architecture
Physical)Storage)Layer
Compressed)Column)Store
(with)some)interesting)properties)
Lazy)Query)Optimization)/)
Execution
C++)CoroutineExec)Pipeline
Python)API
Heavily)Pandas)Inspired)
(+)immutable)data)considerations)
File)System)Abstraction Local HDFS S3
Cache

Compression!
Type aware compression
methods. Very aggressive
numeric compression.
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
160MBPhysical)Storage)Layer
Execution
Python)API
File)System)Abstraction

Query Evaluation
Physical)Storage)Layer
Execution
Python)API
File)System)Abstraction
p['X4']'='p['X3']'+'p['X2']
g='p[p['X1']'<'10]

Cross Platform?
Python Bindings
- Our oldest binding
- Via Cython + Interprocessing communication to a C++ binary
R Bindings
- Via RCpp
- In Beta. Soon to be released.
C++ Bindings
- Used for internal development of
Julia Bindings
- “Hackathon” mock project mature

SGraph: Common Crawl
1x r3.8xlarge ! using 1x SSD.
PageRank:)9 min%per%iteration.
Connected)Components:))~%1%hr.
There)isn’t)any)general)purpose)library)out)there)capable)of)this.
3.5 billion Nodes and 128 billion Edges

Applications
- Log data mining.
- Sensor data mining.
- Churn Prediction.
- Transactional data processing.
- Financial data.

Thanks!
https://github.com/dato-code/sframe
pip install sframe
pip install graphlab-create

Scalable data structures for data science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scalable data structures for data science

Similar to Scalable data structures for data science (20)

More from Turi, Inc.

More from Turi, Inc. (20)

Recently uploaded

Recently uploaded (20)

Scalable data structures for data science