Successfully reported this slideshow.
Your SlideShare is downloading. ×

DASK and Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Apply MLOps at Scale
Apply MLOps at Scale
Loading in …3
×

Check these out next

1 of 15 Ad

DASK and Apache Spark

Download to read offline

For a Python driven Data Science team, DASK presents a very obvious logical next step for distributed analysis. However, today the de-facto standard choice for exact same purpose is Apache Spark. DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper. Given the above statement, do we even need to compare and contrast to make a choice? Shouldn't DASK be the default choice? Well, that's what this session is about. It goes in detail explaining the various viewpoints and dimensions that need to be considered to pick one over other.

For a Python driven Data Science team, DASK presents a very obvious logical next step for distributed analysis. However, today the de-facto standard choice for exact same purpose is Apache Spark. DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper. Given the above statement, do we even need to compare and contrast to make a choice? Shouldn't DASK be the default choice? Well, that's what this session is about. It goes in detail explaining the various viewpoints and dimensions that need to be considered to pick one over other.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to DASK and Apache Spark (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

DASK and Apache Spark

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Gurpreet Singh, Microsoft DASK and Apache Spark #UnifiedAnalytics #SparkAISummit
  3. 3. 3#UnifiedAnalytics #SparkAISummit …this talk is also about Scaling Python for Data Analysis & Machine Learning! We’ll start with briefly reviewing the Scalability Challenge in the PyData stack… …before Comparing and Contrasting …
  4. 4. Pandas & Scikit-learn 4#UnifiedAnalytics #SparkAISummit Options to Scale… GIL Challenge Multiprocessing Concurrent Futures JobLib Partial_Fit() Hashing Trick
  5. 5. 5#UnifiedAnalytics #SparkAISummit High Level APIs RDD Directed Acyclic Graph (DAG) Lazy Execution Task Scheduler Synchronous Multiprocessing Threaded Local Spark Standalone Low Level APIs Custom Algorithms Spark Streaming Spark MLlib GraphFrames Spark SQL / DataFrames Distributed DASK Delayed DASK Futures Design Approach DASK Arrays [Parallel NumPy] DASK DataFrames [Parallel Pandas] DASK-ML [Parallel Scikit-learn] DASK Bag [Parallel Lists] Mesos YARN Local Scheduler Pluggable Task Scheduling System Custom Graphs Submit graph as Python Dictionary object GraphX doesn’tsupport Python
  6. 6. DASK DataFrame & PySpark 6#UnifiedAnalytics #SparkAISummit DASK DataFrames [Parallel Pandas] § Performance Concerns due to the PySpark Design§ DASK DataFrames API is not identical with Pandas API § Performance Concerns with Operations involving Shuffling § Inefficiencies of Pandas are carried over Challenges Challenges § Follow the Pandas Performance tips § Avoid Shuffle, Use pre-sorting, Persist the Results § Use DataFrames API § Use Vectorized/Pandas UDF (Spark v2.3 onwards) RecommendationsRecommendations
  7. 7. 7#UnifiedAnalytics #SparkAISummit Code Review Wildcard Wildcard Additional Details
  8. 8. 8#UnifiedAnalytics #SparkAISummit While Pandas display a sample of the data, DASK and PySpark show metadata of the DataFrame. The npartitions value shows how many partitions the DataFrame is split into. DASK created a DAG with 99 nodes to process the data. Code Review
  9. 9. 9#UnifiedAnalytics #SparkAISummit Code Review .compute() or .head() method tells Dask to go ahead and run the computation and display the results. .show() displays the DataFrame in a tabular form.
  10. 10. How does DASK-ML work? Parallelize Scikit-Learn Re-implement Algorithms Partner with existing Libraries Scalable Machine Learning 10#UnifiedAnalytics #SparkAISummit OCT ‘17 - DASK-ML Spark MLlib - As of Spark 2.0, the primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. It provides: Distributed JobLib Algorithms Featurization Pipelines Persistence Utilities Scalable ML Approaches § Spark for Feature Engineering + Scikit-learn etc. for Learning § Distributed ML Algorithms from Spark MLlib § Train/evaluate Scikit-learn models in parallel (spark-sklearn) from sklearn.externals.joblib import _dask, parallel_backend from sklearn.utils import register_parallel_backend register_parallel_backend('distributed', _dask.DaskDistributedBackend) from dask_ml.xgboost import XGBRegressor est = XGBRegressor(...) est.fit(train, train_labels) prediction = est.predict(test) Common algorithms e.g. classification, regression, & clustering Feature extraction, Transformation, Dimensionality Reduction Constructing, evaluating and tuning ML pipelines Save/Load Algorithms, Models, and Pipelines Linear Algebra, Statistics, and Data Handling Transformers Estimators Pipelines Chains multiple Transformers and Estimators as per the ML Flow Algorithm that transfers one DataFrame into another Algorithm that trains and produces a model
  11. 11. Distributed Deep Learning 11#UnifiedAnalytics #SparkAISummit with Deep Learning Pipelines Project Hydrogen Peer a DASK Distributed cluster with TensorFlow running in distributed mode. § APIs for scalable deep learning in Python from Databricks § Provides a suite of tools covering loading, Training, Tuning and Deploying § Simplifies Distributed Deep Learning Training § Supports TensorFlow, Keras and PyTorch § Integration with PySpark New scheduling option called Gang Scheduler
  12. 12. Other Dev Considerations… § Workloads/APIs § Custom Algorithms (only in DASK) § SQL, Graph (only in Spark) § Debugging Challenges § DASK Distributed may not align with normal Python Debugging Tools/Practices § PySpark errors may have a mix of JVM and Python Stack Trace § Visualization Options § Down-sample and use Pandas DataFrames § Use open source Libraries e.g. D3, Seaborn, Datashader (only for DASK) etc. § Use Databricks Visualization Feature 12#UnifiedAnalytics #SparkAISummit
  13. 13. Which one to Use? 13#UnifiedAnalytics #SparkAISummit There Are No Solutions, There Are Only Tradeoffs! – Thomas Sowell
  14. 14. Questions? 14#UnifiedAnalytics #SparkAISummit
  15. 15. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×