Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Рік безкарності: громадський аналіз розслідування справ Євромайдану
Next
Download to read offline and view in fullscreen.

Share

Scalable data structures for data science

Download to read offline

Scalable data structures for data science (https://github.com/dato-code/SFrame)

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Scalable data structures for data science

  1. 1. Scalable, Out-of-Core Data Structures for Data Science Krishna Sridhar Data Scientist, Dato Inc. krishna_srd
  2. 2. • Background - Machine Learning (ML) Research. - Ph.D Numerical Optimization @Wisconsin • Now - Build ML tools for data-scientists & developers @Dato. - Help deploy ML algorithms. @krishna_srd, @DatoInc About Me
  3. 3. Collaborators 45+$and$growing$fast!
  4. 4. Scalable Machine Learning recommenders, other task-oriented ML, boosted decision trees, deep learning, pattern mining, many others, etc GraphLab Create SGraphSFrameLocal HDFS S3 Compressed)In,Core)or) Out,of,core)scalable)datastructures C++11 Dato Architecture pip install graphlab-create
  5. 5. Dato (Open Source) Architecture SGraphSFrame Compressed)In,Core)or) Out,of,core)scalable)datastructures https://github.com/dato-code/sframe
  6. 6. Single Machine? Scalable??
  7. 7. Yes!
  8. 8. What can you do with a single machine?
  9. 9. Build a Collaborative Filtering Model on 20 Billion User-Item Ratings Do PageRank on a 128 Billion edge graph.
  10. 10. How?
  11. 11. Data Structures! User Com. Title Body User Disc. SFrame SGraph TimeSeries
  12. 12. SFrame Python API Make a little SFrame of 1 column and 5 values: >> sf = gl.SFrame({‘x’:[1,2,3,4,5]}) Normalizes the column x: >> sf[‘x’] = sf[‘x’] / sf[‘x’].sum() Uses a python lambda to create a new column: >> sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x if x > 0 else 0) Create a new column using a vectorized operator: >> sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’] Create a new SFrame taking only 2 of the columns: >> sf2 = sf[[‘x’,’x-squared’]]
  13. 13. SFrame Design Principles Graceful Degradation as 1st principle - Always works - High performance when in-memory, scales to disk. Rich Datatypes - Strong schema types: int, double, string, image. - Weak schema types: list, dictionary (arbitrary JSON!) Columnar Architecture - Easy feature engineering + Vectorized feature operation - Immutable columns + Lazy Evaluation - Statistics + Sketching + Visualization
  14. 14. nrating sf[‘nrating’]-=-sf2[‘rating’] What is the SFrame? sf#=#gl.SFrame(‘netflix_tr.frame’) user movie rating netflix_tr.frame sf user item rating sf2$=$gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item rating
  15. 15. nrating sf[‘nrating’]-=-sf2[‘rating’] What is the SFrame? sf#=#gl.SFrame(‘netflix_tr.frame’) user movie rating netflix_tr.frame sf user item rating sf2$=$gl.SFrame(‘netflix_norm.frame’) user movie rating netflix_norm.frame sf2 user item rating diff anonymous diff$=$sf[‘rating’]$0 sf2[‘rating’]
  16. 16. What is the SFrame? Filtering sf[sf[‘rating’]->=-3] Joins Sf.join(user_table,-on=‘user_id’) Random/Array3indexing row10-=-sf[10] Table_with_every_other_row =-sf[::2] Rather3Fast3Parallelized3UDFs3(Interproc SHM) sf[‘rating’].apply(lambda-x:-x*x) Not a SQL Frontend
  17. 17. SArray Column Types Boring Scalar Types - int64, double, string Interesting Scalar Types - Datetime, image Mathematician Type - array(‘d’) Industrial Data Scientist Type - list, dict
  18. 18. SFrame Architecture Physical)Storage)Layer Compressed)Column)Store (with)some)interesting)properties) Lazy)Query)Optimization)/) Execution C++)CoroutineExec)Pipeline Python)API Heavily)Pandas)Inspired) (+)immutable)data)considerations) File)System)Abstraction Local HDFS S3 Cache
  19. 19. Compression! Type aware compression methods. Very aggressive numeric compression. Netflix Dataset, 99M rows, 3 columns, ints 1.4GB raw 289MB gzip compressed 160MBPhysical)Storage)Layer Lazy)Query)Optimization)/) Execution Python)API File)System)Abstraction
  20. 20. Query Evaluation Physical)Storage)Layer Lazy)Query)Optimization)/) Execution Python)API File)System)Abstraction p['X4']'='p['X3']'+'p['X2'] g='p[p['X1']'<'10]
  21. 21. Cross Platform? Python Bindings - Our oldest binding - Via Cython + Interprocessing communication to a C++ binary R Bindings - Via RCpp - In Beta. Soon to be released. C++ Bindings - Used for internal development of Julia Bindings - “Hackathon” mock project mature
  22. 22. SGraph: Common Crawl 1x r3.8xlarge ! using 1x SSD. PageRank:)9 min%per%iteration. Connected)Components:))~%1%hr. There)isn’t)any)general)purpose)library)out)there)capable)of)this. 3.5 billion Nodes and 128 billion Edges
  23. 23. Time Series!
  24. 24. Applications - Log data mining. - Sensor data mining. - Churn Prediction. - Transactional data processing. - Financial data.
  25. 25. Log Data Mining
  26. 26. Log Data Mining
  27. 27. Data Structures! User Com. Title Body User Disc. SFrame SGraph TimeSeries
  28. 28. Demo!
  29. 29. Thanks! https://github.com/dato-code/sframe pip install sframe pip install graphlab-create
  • choeungjin

    May. 3, 2016
  • AbhishekSaraf2

    Apr. 7, 2016
  • radiantslide

    Oct. 12, 2015
  • PetroRudenko

    Oct. 5, 2015

Scalable data structures for data science (https://github.com/dato-code/SFrame)

Views

Total views

1,338

On Slideshare

0

From embeds

0

Number of embeds

116

Actions

Downloads

44

Shares

0

Comments

0

Likes

4

×