8thlight.com 8thLightInc 8thLightInc /company/8th-light
8th Light miniConf
Dataframes Showdown
Hana Lee
8thlight.com 8thLightInc 8thLightInc /company/8th-light
WHAT ARE THEY?
● Spreadsheet on steroids
● Useful abstraction for data manipulation
● In-memory object
● R, Python library pandas, Apache Spark
WHY NOT SQL?
● Readability
● Exploratory analysis
● Denormalized, “wide” data sets
● Integration with C and Fortran libraries for fast
vector and matrix operations
Dataframes
PYTHON DATA STACK
● numpy : n-dimensional arrays
● scipy : statistical and mathematical algorithms
● pandas : dataframes
● jupyter : notebooks
The rise of
Python
COMMON FRUSTRATIONS
● Python dependency management and packaging
● pandas is a notorious memory hog
● Concurrency woes
● Slower than a compiled language
The pitfalls of
Python
8th Light
Software is our craft. 8thlight.com
Hypothesis
Is there a viable alternative?
APACHE ARROW
● Columnar memory format
● Language-agnostic
● Rich type system
● Works well with Parquet
A new hope
ADVANTAGES
● Compiled language with explicit memory
management
● Modern language features and syntax
● cargo : ergonomic build and dependency
management tool
CRATES
● datafusion : query engine for Arrow with
dataframe API
● evcxr : Jupyter kernel for Rust
Why Rust?
8th Light
Software is our craft. 8thlight.com
Experiment
Datafusion vs Pandas
SAMPLE DATA SET
● Github data set from Google BigQuery
● 199 columns x 6,219,749 rows
DATAFRAMES SHOWDOWN | EXPERIMENT
$ du -h github-timeline-sample.csv
4.3G github-timeline-sample.csv
METHODOLOGY
● Load CSV into dataframe object
● Crude use of GNU time (gtime on Mac OS X) to profile execution
● Caveats:
○ Overhead of Python interpreter and dependencies
○ Measuring memory usage is complicated!
○ Low replication
● Repository: https://github.com/hnlee/dataframe-benchmarking
DATAFRAMES SHOWDOWN | EXPERIMENT
RESULTS
DATAFRAMES SHOWDOWN | EXPERIMENT
$ ./compare.sh
DataFusion
Elapsed time: 0.36 s
Max resident set size: 8292 Kb
Percent CPU: 99%
Pandas
Elapsed time: 266.41 s
Max resident set size: 17728672 Kb
Percent CPU: 91%
RESULTS
What pandas says about the DataFrame object in memory…
DATAFRAMES SHOWDOWN | EXPERIMENT
> dataframe.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6219749 entries, 0 to 6219748
Columns: 199 entries, repository_url to type
dtypes: float64(43), object(156)
memory usage: 37.1 GB
FURTHER OBSERVATION
What’s the Jupyter notebook experience like with Rust and Datafusion?
SLIDE TEMPLATES
FURTHER OBSERVATION
What’s the Jupyter notebook experience like with Rust and Datafusion?
● Async complexity
● Notebook cells vs scope boundaries
● Lack of pretty-printing for output
● Some support for plots:
○ https://github.com/igiagkiozis/plotly
○ https://github.com/procyon-rs/vega_lite_3.rs
SLIDE TEMPLATES
NEXT STEPS
● Better controlled performance measurements
● Compare common operations
○ Detecting duplicates
○ Counting null values
○ Calculating mean, max, min, quartiles
○ Filtering
What next?
8th Light
Software is our craft. 8thlight.com
Q + A
REFERENCES
● Datafusion
○ https://github.com/apache/arrow-datafusion
○ https://docs.rs/datafusion/latest/datafusion
○ https://arrow.apache.org/datafusion/
● EvCxR
○ https://github.com/google/evcxr/blob/main/evcxr_jupyter/README.md
○ https://github.com/google/evcxr/blob/main/evcxr_jupyter/samples/evcx
r_jupyter_tour.ipynb
DATAFRAMES SHOWDOWN | EXPERIMENT
8thlight.com 8thLightInc 8thLightInc /company/8th-light
Thank you!

Dataframes Showdown (miniConf 2022)

  • 1.
    8thlight.com 8thLightInc 8thLightInc/company/8th-light 8th Light miniConf Dataframes Showdown Hana Lee 8thlight.com 8thLightInc 8thLightInc /company/8th-light
  • 2.
    WHAT ARE THEY? ●Spreadsheet on steroids ● Useful abstraction for data manipulation ● In-memory object ● R, Python library pandas, Apache Spark WHY NOT SQL? ● Readability ● Exploratory analysis ● Denormalized, “wide” data sets ● Integration with C and Fortran libraries for fast vector and matrix operations Dataframes
  • 3.
    PYTHON DATA STACK ●numpy : n-dimensional arrays ● scipy : statistical and mathematical algorithms ● pandas : dataframes ● jupyter : notebooks The rise of Python
  • 4.
    COMMON FRUSTRATIONS ● Pythondependency management and packaging ● pandas is a notorious memory hog ● Concurrency woes ● Slower than a compiled language The pitfalls of Python
  • 5.
    8th Light Software isour craft. 8thlight.com Hypothesis Is there a viable alternative?
  • 6.
    APACHE ARROW ● Columnarmemory format ● Language-agnostic ● Rich type system ● Works well with Parquet A new hope
  • 7.
    ADVANTAGES ● Compiled languagewith explicit memory management ● Modern language features and syntax ● cargo : ergonomic build and dependency management tool CRATES ● datafusion : query engine for Arrow with dataframe API ● evcxr : Jupyter kernel for Rust Why Rust?
  • 8.
    8th Light Software isour craft. 8thlight.com Experiment Datafusion vs Pandas
  • 9.
    SAMPLE DATA SET ●Github data set from Google BigQuery ● 199 columns x 6,219,749 rows DATAFRAMES SHOWDOWN | EXPERIMENT $ du -h github-timeline-sample.csv 4.3G github-timeline-sample.csv
  • 10.
    METHODOLOGY ● Load CSVinto dataframe object ● Crude use of GNU time (gtime on Mac OS X) to profile execution ● Caveats: ○ Overhead of Python interpreter and dependencies ○ Measuring memory usage is complicated! ○ Low replication ● Repository: https://github.com/hnlee/dataframe-benchmarking DATAFRAMES SHOWDOWN | EXPERIMENT
  • 11.
    RESULTS DATAFRAMES SHOWDOWN |EXPERIMENT $ ./compare.sh DataFusion Elapsed time: 0.36 s Max resident set size: 8292 Kb Percent CPU: 99% Pandas Elapsed time: 266.41 s Max resident set size: 17728672 Kb Percent CPU: 91%
  • 12.
    RESULTS What pandas saysabout the DataFrame object in memory… DATAFRAMES SHOWDOWN | EXPERIMENT > dataframe.info(memory_usage="deep") <class 'pandas.core.frame.DataFrame'> RangeIndex: 6219749 entries, 0 to 6219748 Columns: 199 entries, repository_url to type dtypes: float64(43), object(156) memory usage: 37.1 GB
  • 13.
    FURTHER OBSERVATION What’s theJupyter notebook experience like with Rust and Datafusion? SLIDE TEMPLATES
  • 14.
    FURTHER OBSERVATION What’s theJupyter notebook experience like with Rust and Datafusion? ● Async complexity ● Notebook cells vs scope boundaries ● Lack of pretty-printing for output ● Some support for plots: ○ https://github.com/igiagkiozis/plotly ○ https://github.com/procyon-rs/vega_lite_3.rs SLIDE TEMPLATES
  • 15.
    NEXT STEPS ● Bettercontrolled performance measurements ● Compare common operations ○ Detecting duplicates ○ Counting null values ○ Calculating mean, max, min, quartiles ○ Filtering What next?
  • 16.
    8th Light Software isour craft. 8thlight.com Q + A
  • 17.
    REFERENCES ● Datafusion ○ https://github.com/apache/arrow-datafusion ○https://docs.rs/datafusion/latest/datafusion ○ https://arrow.apache.org/datafusion/ ● EvCxR ○ https://github.com/google/evcxr/blob/main/evcxr_jupyter/README.md ○ https://github.com/google/evcxr/blob/main/evcxr_jupyter/samples/evcx r_jupyter_tour.ipynb DATAFRAMES SHOWDOWN | EXPERIMENT
  • 18.
    8thlight.com 8thLightInc 8thLightInc/company/8th-light Thank you!