A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
9. Day in a data scientist’s life
• Get data
• Need more/something else
• Data wrangling
• Rinse, repeat
• Load into analysis software like Ayasdi Core
• Actual data analysis, model-building etc
10. Data Wrangling Tools
• grep, cut, wc -l, head, tail
• Python Pandas
• Most useful construct: pandas data frame ala Excel
with CLI
11. Challenges
• Applying data science techniques to data larger
than single machine’s memory
• Easy to procure cluster of small machines than one
big machine
• Processing takes too long
12. Solution: Distribute
• Hadoop ecosystem: Spark is great
• Learning curve, what is this RDD thing? where is
my familiar data frame?
• There is pyspark but to get the best out of Spark
use Scala, another learning curve
13. df: Gentle Incline
“I want to put my projects on hold, and learn several new things simultaneously”
- No One Ever
• Attempts to provide an API on Spark that looks and feels like
pandas data frame
e.g. in pandas
df[“a”]
in df
df(“a”)
• Also intuitive for R programmers
14. Advantages
• Quite transparently runs on Spark: Distributed processing
• Is in Scala: No layering overhead
• Is in Scala: Can directly call cutting edge Spark libraries like
MLLib [pyspark wrappers usually a bit behind]
• Is an “internal DSL”: Advanced users can augment with
arbitrary Scala code. [python wrapper still possible]
• Is an “internal DSL”: Fast without resorting to code-generation
• Fully open sourced, Apache license
15. Real Life Examples
Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to
Add a column with total
mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt']
—>
mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”)
Remove $ and , from numbers representing money
mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','')
mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float)
—>
mppu(“de-dollar”) = mppu(“dollar”).map
{ x: String => x.replace("$", "").replace(",","").toDouble }
19. Summary
• pandas is awesome
• df scales to bigger data, looks and feels like pandas
• fully open source
https://github.com/AyasdiOpenSource/df
• Check out our website. We are hiring!
http://engineering.ayasdi.com/
http://www.ayasdi.com/careers/
20. Acknowledgements
• Max Song for introducing me to Pandas
• Jean-Ezra Young for insurance claims example
• Ayasdi for open-sourcing this work
• Hadoop and Spark communities for the awesome
platform
• Pandas team for the awesome tool