df: Dataframe on Spark

•

2 likes•3,340 views

A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored. Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines. This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.

Engineering

df: Dataframe on
Spark
Mohit Jaggi
Code Ninja and Troublemaker
at

Agenda
• About Ayasdi
• df
• Brief Demo (if time allows)
• Conclusion

Ayasdi Solution
UX
Ayasdi Platform
Distributed Computing Algorithmic Reach
ETL

Day in a data scientist’s life
• Get data
• Need more/something else
• Data wrangling
• Rinse, repeat
• Load into analysis software like Ayasdi Core
• Actual data analysis, model-building etc

Data Wrangling Tools
• grep, cut, wc -l, head, tail
• Python Pandas
• Most useful construct: pandas data frame ala Excel
with CLI

Challenges
• Applying data science techniques to data larger
than single machine’s memory
• Easy to procure cluster of small machines than one
big machine
• Processing takes too long

Solution: Distribute
• Hadoop ecosystem: Spark is great
• Learning curve, what is this RDD thing? where is
my familiar data frame?
• There is pyspark but to get the best out of Spark
use Scala, another learning curve

df: Gentle Incline
“I want to put my projects on hold, and learn several new things simultaneously”
- No One Ever
• Attempts to provide an API on Spark that looks and feels like
pandas data frame
e.g. in pandas
df[“a”]
in df
df(“a”)
• Also intuitive for R programmers

Advantages
• Quite transparently runs on Spark: Distributed processing
• Is in Scala: No layering overhead
• Is in Scala: Can directly call cutting edge Spark libraries like
MLLib [pyspark wrappers usually a bit behind]
• Is an “internal DSL”: Advanced users can augment with
arbitrary Scala code. [python wrapper still possible]
• Is an “internal DSL”: Fast without resorting to code-generation
• Fully open sourced, Apache license

$Real Life Examples Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to Add a column with total mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”) Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }$

Future
• pyspark wrapper
• more data sources like SQL, parquet, HDF5 etc
• charts and graphs
• contributors welcome!

Summary
• pandas is awesome
• df scales to bigger data, looks and feels like pandas
• fully open source
https://github.com/AyasdiOpenSource/df
• Check out our website. We are hiring!
http://engineering.ayasdi.com/
http://www.ayasdi.com/careers/

Acknowledgements
• Max Song for introducing me to Pandas
• Jean-Ezra Young for insurance claims example
• Ayasdi for open-sourcing this work
• Hadoop and Spark communities for the awesome
platform
• Pandas team for the awesome tool

What's hot

Enabling exploratory data science with Spark and RDatabricks

Mapreduce in SearchAmund Tveit

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit

Spark SQL with Scala Code ExamplesTodd McGrath

Spark meetup v2.0.5Yan Zhou

Building data pipelinesJonathan Holloway

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit

A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb

The BDAS Open Source Communityjeykottalam

Strata NYC 2015 - What's coming for the Spark communityDatabricks

A look ahead at spark 2.0 Databricks

Simplifying Big Data Analytics with Apache SparkDatabricks

Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit

Demystifying DataFrame and DatasetKazuaki Ishizaki

SparkSQL: A Compiler from Queries to RDDsDatabricks

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

2014 09-12 lambda-architecture-at-indixYu Ishikawa

Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake

Use r tutorial part1, introduction to sparkrDatabricks

Engineering fast indexesDaniel Lemire

What's hot (20)

Enabling exploratory data science with Spark and R

Mapreduce in Search

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...

Spark SQL with Scala Code Examples

Spark meetup v2.0.5

Building data pipelines

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...

A Rusty introduction to Apache Arrow and how it applies to a time series dat...

The BDAS Open Source Community

Strata NYC 2015 - What's coming for the Spark community

A look ahead at spark 2.0

Simplifying Big Data Analytics with Apache Spark

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

Demystifying DataFrame and Dataset

SparkSQL: A Compiler from Queries to RDDs

Spark SQL Deep Dive @ Melbourne Spark Meetup

2014 09-12 lambda-architecture-at-indix

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Use r tutorial part1, introduction to sparkr

Engineering fast indexes

Similar to df: Dataframe on Spark

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Paris Data Geek - Spark Streaming Djamel Zouaoui

Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab

Fast and Scalable PythonTravis Oliphant

Intro to Apache Spark by CTO of TwingoMapR Technologies

Why Functional Programming Is Important in Big Data EraHandaru Sakti

Koalas: Unifying Spark and pandas APIsXiao Li

ETL with SPARK - First Spark London meetupRafal Kwasny

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)

Apache Spark TutorialAhmet Bulut

Big data clusteringJagadeesan A S

Big Data Beyond the JVM - Strata San Jose 2018Holden Karau

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Building Deep Learning Workflows with DL4JJosh Patterson

Apache Spark FundamentalsZahra Eskandari

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Similar to df: Dataframe on Spark (20)

Apache Spark for Everyone - Women Who Code Workshop

Paris Data Geek - Spark Streaming

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Fast and Scalable Python

Intro to Apache Spark by CTO of Twingo

Why Functional Programming Is Important in Big Data Era

Koalas: Unifying Spark and pandas APIs

ETL with SPARK - First Spark London meetup

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

Apache Spark Tutorial

Big data clustering

Big Data Beyond the JVM - Strata San Jose 2018

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Building Deep Learning Workflows with DL4J

Apache Spark Fundamentals

Apache Spark in Scientific Applications

Apache Spark in Scientific Applciations

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Recently uploaded

Work Experience-Dalton Park.pptxfvvvvvvvLewisJB

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Heart Disease Prediction using machine learning.pptxPoojaBan

DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N

Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff

pipeline in computer architecture designssuser87fa0c1

An experimental study in using natural admixture as an alternative for chemic...Chandu841456

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha

Churning of Butter, Factors affecting .Satyam Kumar

GDSC ASEB Gen AI study jams presentationGDSCAESB

Past, Present and Future of Generative AIabhishek36461

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

Recently uploaded (20)

Work Experience-Dalton Park.pptxfvvvvvvv

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

Introduction-To-Agricultural-Surveillance-Rover.pptx

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Heart Disease Prediction using machine learning.pptx

DATA ANALYTICS PPT definition usage example

IVE Industry Focused Event - Defence Sector 2024

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)

Call Girls Narol 7397865700 Independent Call Girls

pipeline in computer architecture design

An experimental study in using natural admixture as an alternative for chemic...

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx

Churning of Butter, Factors affecting .

GDSC ASEB Gen AI study jams presentation

Past, Present and Future of Generative AI

Arduino_CSE ece ppt for working and principal of arduino.ppt

Concrete Mix Design - IS 10262-2019 - .pptx

Introduction to Machine Learning Unit-3 for II MECH

df: Dataframe on Spark

2. df: Dataframe on Spark Mohit Jaggi Code Ninja and Troublemaker at

3. Agenda • About Ayasdi • df • Brief Demo (if time allows) • Conclusion

4. About Ayasdi

5. Traditional Analytics CODE ? Hypothesis

6. Automated Insights

7. Ayasdi Solution UX Ayasdi Platform Distributed Computing Algorithmic Reach ETL

8. df

9. Day in a data scientist’s life • Get data • Need more/something else • Data wrangling • Rinse, repeat • Load into analysis software like Ayasdi Core • Actual data analysis, model-building etc

10. Data Wrangling Tools • grep, cut, wc -l, head, tail • Python Pandas • Most useful construct: pandas data frame ala Excel with CLI

11. Challenges • Applying data science techniques to data larger than single machine’s memory • Easy to procure cluster of small machines than one big machine • Processing takes too long

12. Solution: Distribute • Hadoop ecosystem: Spark is great • Learning curve, what is this RDD thing? where is my familiar data frame? • There is pyspark but to get the best out of Spark use Scala, another learning curve

13. df: Gentle Incline “I want to put my projects on hold, and learn several new things simultaneously” - No One Ever • Attempts to provide an API on Spark that looks and feels like pandas data frame e.g. in pandas df[“a”] in df df(“a”) • Also intuitive for R programmers

14. Advantages • Quite transparently runs on Spark: Distributed processing • Is in Scala: No layering overhead • Is in Scala: Can directly call cutting edge Spark libraries like MLLib [pyspark wrappers usually a bit behind] • Is an “internal DSL”: Advanced users can augment with arbitrary Scala code. [python wrapper still possible] • Is an “internal DSL”: Fast without resorting to code-generation • Fully open sourced, Apache license

15. Real Life Examples Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to Add a column with total mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”) Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }

16. Demo

17. Future • pyspark wrapper • more data sources like SQL, parquet, HDF5 etc • charts and graphs • contributors welcome!

18. Conclusion

19. Summary • pandas is awesome • df scales to bigger data, looks and feels like pandas • fully open source https://github.com/AyasdiOpenSource/df • Check out our website. We are hiring! http://engineering.ayasdi.com/ http://www.ayasdi.com/careers/

20. Acknowledgements • Max Song for introducing me to Pandas • Jean-Ezra Young for insurance claims example • Ayasdi for open-sourcing this work • Hadoop and Spark communities for the awesome platform • Pandas team for the awesome tool

df: Dataframe on Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to df: Dataframe on Spark

Similar to df: Dataframe on Spark (20)

Recently uploaded

Recently uploaded (20)

df: Dataframe on Spark