Introduction to df

•

2 likes•1,503 views

A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored. Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines. This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.

Data & Analytics

df: Dataframe on
Spark
Mohit Jaggi
Code Ninja and Troublemaker
at

Agenda
• About Ayasdi
• df
• Brief Demo (if time allows)
• Conclusion

Ayasdi Solution
UX
Ayasdi Platform
Distributed Computing Algorithmic Reach
ETL

Day in a data scientist’s life
• Get data
• Need more/something else
• Data wrangling
• Rinse, repeat
• Load into analysis software like Ayasdi Core
• Actual data analysis, model-building etc

Data Wrangling Tools
• grep, cut, wc -l, head, tail
• Python Pandas
• Most useful construct: pandas data frame ala Excel
with CLI

Challenges
• Applying data science techniques to data larger
than single machine’s memory
• Easy to procure cluster of small machines than one
big machine
• Processing takes too long

Solution: Distribute
• Hadoop ecosystem: Spark is great
• Learning curve, what is this RDD thing? where is
my familiar data frame?
• There is pyspark but to get the best out of Spark
use Scala, another learning curve

df: Gentle Incline
“I want to put my projects on hold, and learn several new things simultaneously”
- No One Ever
• Attempts to provide an API on Spark that looks and feels like
pandas data frame
e.g. in pandas
df[“a”]
in df
df(“a”)
• Also intuitive for R programmers

Advantages
• Quite transparently runs on Spark: Distributed processing
• Is in Scala: No layering overhead
• Is in Scala: Can directly call cutting edge Spark libraries like
MLLib [pyspark wrappers usually a bit behind]
• Is an “internal DSL”: Advanced users can augment with
arbitrary Scala code. [python wrapper still possible]
• Is an “internal DSL”: Fast without resorting to code-generation
• Fully open sourced, Apache license

$Real Life Examples Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to Add a column with total mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”) Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }$

Future
• pyspark wrapper
• more data sources like SQL, parquet, HDF5 etc
• charts and graphs
• contributors welcome!

Summary
• pandas is awesome
• df scales to bigger data, looks and feels like pandas
• fully open source
https://github.com/AyasdiOpenSource/df
• Check out our website. We are hiring!
http://engineering.ayasdi.com/
http://www.ayasdi.com/careers/

Acknowledgements
• Max Song for introducing me to Pandas
• Jean-Ezra Young for insurance claims example
• Ayasdi for open-sourcing this work
• Hadoop and Spark communities for the awesome
platform
• Pandas team for the awesome tool

What's hot

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Koalas: Pandas on Apache SparkDatabricks

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit

Scalable Scientific Computing with DaskUwe Korn

Lessons from Running Large Scale Spark WorkloadsDatabricks

HBase at MendeleyDan Harvey

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Spark: Interactive To ProductionJen Aman

Deep Learning to Production with MLflow & RedisAIDatabricks

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Apache Arrow: In Theory, In PracticeDremio Corporation

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Apache Arrow: Leveling Up the Data Science StackWes McKinney

NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinZenika

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Netflix running Presto in the AWS CloudZhenxiao Luo

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE

What's hot (20)

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Koalas: Pandas on Apache Spark

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Spark Summit EU talk by Miklos Christine paddling up the stream

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...

Scalable Scientific Computing with Dask

Lessons from Running Large Scale Spark Workloads

HBase at Mendeley

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Spark: Interactive To Production

Deep Learning to Production with MLflow & RedisAI

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Apache Arrow: In Theory, In Practice

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Apache Arrow: Leveling Up the Data Science Stack

NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Netflix running Presto in the AWS Cloud

Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...

Viewers also liked

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaSpark Summit

Anomaly Detection with Apache SparkCloudera, Inc.

Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit

Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit

Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit

SparkSQL: A Compiler from Queries to RDDsDatabricks

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit

Viewers also liked (7)

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa

Anomaly Detection with Apache Spark

Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...

Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...

Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette

SparkSQL: A Compiler from Queries to RDDs

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...

Similar to Introduction to df

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Paris Data Geek - Spark Streaming Djamel Zouaoui

Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab

Fast and Scalable PythonTravis Oliphant

Intro to Apache Spark by CTO of TwingoMapR Technologies

Why Functional Programming Is Important in Big Data EraHandaru Sakti

Koalas: Unifying Spark and pandas APIsXiao Li

ETL with SPARK - First Spark London meetupRafal Kwasny

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)

Apache Spark TutorialAhmet Bulut

Big data clusteringJagadeesan A S

Big Data Beyond the JVM - Strata San Jose 2018Holden Karau

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Building Deep Learning Workflows with DL4JJosh Patterson

Apache Spark FundamentalsZahra Eskandari

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Similar to Introduction to df (20)

Apache Spark for Everyone - Women Who Code Workshop

Paris Data Geek - Spark Streaming

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Fast and Scalable Python

Intro to Apache Spark by CTO of Twingo

Why Functional Programming Is Important in Big Data Era

Koalas: Unifying Spark and pandas APIs

ETL with SPARK - First Spark London meetup

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

Apache Spark Tutorial

Big data clustering

Big Data Beyond the JVM - Strata San Jose 2018

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Building Deep Learning Workflows with DL4J

Apache Spark Fundamentals

Apache Spark in Scientific Applications

Apache Spark in Scientific Applciations

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Recently uploaded

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Halmar dropshipping via API with DroFxolyaivanovalion

Discover Why Less is More in B2B Researchmichael115558

Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01

Midocean dropshipping via API with DroFxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Recently uploaded (20)

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...

CebaBaby dropshipping via API with DroFX.pptx

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Halmar dropshipping via API with DroFx

Discover Why Less is More in B2B Research

Predicting Loan Approval: A Data Science Project

Anomaly detection and data imputation within time series

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Generative AI on Enterprise Cloud with NiFi and Milvus

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...

Midocean dropshipping via API with DroFx

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Sampling (random) method and Non random.ppt

Introduction to df

2. df: Dataframe on Spark Mohit Jaggi Code Ninja and Troublemaker at

3. Agenda • About Ayasdi • df • Brief Demo (if time allows) • Conclusion

4. About Ayasdi

5. Traditional Analytics CODE ? Hypothesis

6. Automated Insights

7. Ayasdi Solution UX Ayasdi Platform Distributed Computing Algorithmic Reach ETL

8. df

9. Day in a data scientist’s life • Get data • Need more/something else • Data wrangling • Rinse, repeat • Load into analysis software like Ayasdi Core • Actual data analysis, model-building etc

10. Data Wrangling Tools • grep, cut, wc -l, head, tail • Python Pandas • Most useful construct: pandas data frame ala Excel with CLI

11. Challenges • Applying data science techniques to data larger than single machine’s memory • Easy to procure cluster of small machines than one big machine • Processing takes too long

12. Solution: Distribute • Hadoop ecosystem: Spark is great • Learning curve, what is this RDD thing? where is my familiar data frame? • There is pyspark but to get the best out of Spark use Scala, another learning curve

13. df: Gentle Incline “I want to put my projects on hold, and learn several new things simultaneously” - No One Ever • Attempts to provide an API on Spark that looks and feels like pandas data frame e.g. in pandas df[“a”] in df df(“a”) • Also intuitive for R programmers

14. Advantages • Quite transparently runs on Spark: Distributed processing • Is in Scala: No layering overhead • Is in Scala: Can directly call cutting edge Spark libraries like MLLib [pyspark wrappers usually a bit behind] • Is an “internal DSL”: Advanced users can augment with arbitrary Scala code. [python wrapper still possible] • Is an “internal DSL”: Fast without resorting to code-generation • Fully open sourced, Apache license

15. Real Life Examples Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to Add a column with total mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”) Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }

16. Demo

17. Future • pyspark wrapper • more data sources like SQL, parquet, HDF5 etc • charts and graphs • contributors welcome!

18. Conclusion

19. Summary • pandas is awesome • df scales to bigger data, looks and feels like pandas • fully open source https://github.com/AyasdiOpenSource/df • Check out our website. We are hiring! http://engineering.ayasdi.com/ http://www.ayasdi.com/careers/

20. Acknowledgements • Max Song for introducing me to Pandas • Jean-Ezra Young for insurance claims example • Ayasdi for open-sourcing this work • Hadoop and Spark communities for the awesome platform • Pandas team for the awesome tool

Introduction to df

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Introduction to df

Similar to Introduction to df (20)

Recently uploaded

Recently uploaded (20)

Introduction to df