Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

1,603 views

Published on

A Databricks Webinar presented by Patrick Oleniuk, Lead Data Engineer and Virgin Hyperloop One and Yifan Cao, Senior Project Manager at Databricks

Published in: Data & Analytics
  • Be the first to comment

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

  1. 1. Logistics • We can’t hear you… • Recording will be available… • Code samples and notebooks will be available… • Submit your questions… • Bookmark databricks.com/blog
  2. 2. Our Mission: Helping data teams solve the world’s toughest problems Original creators of popular data and machine learning open source projects Global company with 5,000+ customers and 450+ partners
  3. 3. Data Science, ML, and BI on one cloud platform Access all business and big data in open data lake Securely integrates with your cloud ecosystem BIG DATA & BUSINESS DATA DATA SCIENTISTS ML ENGINEERS DATA ANALYSTS DATA ENGINEERS ENTERPRISE CLOUD SERVICE A simple, scalable, and secure managed service UNIFIED DATA SERVICE High quality data with great performance DATA SCIENCE WORKSPACE Collaboration across the lifecycle BI INTEGRATIONS Access all your data UNIFIED DATA ANALYTICS PLATFORM
  4. 4. About our speakers Yifan Cao, Sr. Product Manager, Machine Learning at Databricks • Product Area: ML/DL algorithms and Databricks Runtime for Machine Learning • Built and grew two ML products to multi-million dollars in annual revenue • B.S. Engineering from UC Berkeley; MBA from MIT Patryk Oleniuk, Lead Data Engineer, Virgin Hyperloop One • Wearing many hats @ Hyperloop: Embedded Devices, Back-end Software, Data Science & Machine Learning • Previous experience includes Samsung R&D, CERN • M.S. in Information Technologies from EPFL (Switzerland)
  5. 5. Agenda 1. Virgin Hyperloop One 2. Hyperloop  Databricks Story 3. What is Databricks and Koalas 4. Koalas (python + notebook, live coding) 5. Short intro to Mlflow tracking 6. Koalas & MLflow (python + notebook, live coding) 7. Koalas tips&tricks
  6. 6.  Startup with 250+ employees in Los Angeles ( hiring! )  New transportation system  Vacuum tube + small pass./cargo vehicles(Pods)  Short travel times  On-demand ( “Ride Hailing” )  Zero direct emission (Electric Lev & Prop)  500m test track in Nevada (video)  New exciting tests, tracks, enhancements (in progress right now) Virgin Hyperloop One (VHO)
  7. 7.  Operational Research for Hyperloop  Analytics products for Business and Tech  Simulation & Data based  Sample Questions:  What’s the optimal Vehicle Capacity?  How many passengers can we realistically handle in scenario X between cities Y and Z?  How much better are we than other modes?  Answers? Data VHO – MIA Machine Intelligence & Analytics
  8. 8. DEMAND MODELLING TRIP PLANNING PERFORMANCE METRICS COST METRICS GEOSPATIAL ANALYTICS 3D ALIGNMENT OPTIMIZER TEST RUNS HW & SW TEST RIGS Examples of AI and Analytics and Data in VHO
  9. 9. Hyperloop Data Story  growing data sizes (from MBs to GBs)  growing processing times (from mins to hrs)  Python scripts crashing ( pandas out of memory )  Need a more Enterprise, and Scalable approach to handling Data (tried different solutions)  Spark and its family is de-facto standard solution to that problem
  10. 10. Hyperloop Data Story  Who’s gonna manage our new Spark infrastructure? Not enough Devops …
  11. 11. Koalas - why? pandas code: pandas_df .groupby(”Destination”) .sum() .nlargest(10, columns = "Trip Count") PySpark code: spark_df .groupby(“Destination”) .sum() .orderBy(“sum(Trip Count)”, ascending = False) .limit(10) Me after learning I need to redo all our pandas scripts in pySpark (and keep doing it for future DS work)
  12. 12. 13 Introducing Koalas
  13. 13. 14  Education (MOOCs, books, universities) → pandas  Analyze small data sets → pandas  Analyze big data sets → DataFrame in Spark ● Standard for distributed workloads ● Big data ● Standard for single machine workloads ● Small data Pandas Apache Spark+ Typical journey of a data scientist
  14. 14. 15  Launched on April 24, 2019 by Databricks  Pure Open Source Python library  Aims at providing the pandas API on top of Apache Spark:  Unifies the two ecosystems with a familiar API  Seamless transition between small and large data What is Koalas? github.com/databricks/koalas
  15. 15. Koalas - why? pandas code: pandas_df .groupby(”Destination”) .sum() .nlargest(10, columns = "Trip Count") PySpark code: spark_df .groupby(“Destination”) .sum() .orderBy(“sum(Trip Count)”, ascending = False) .limit(10) Koalas code: koalas_df .groupby(”Destination”) .sum() .nlargest(10, columns = "Trip Count") github.com/databricks/koalas
  16. 16. Koalas Architecture Catalyst Optimization & Tungsten Execution DataFrame APIs SQL Koalas Core Data Source Connectors Pandas SPAR K A lean API layer
  17. 17. Koalas User Adoption  Better scale the breadth of Pandas to big data  Reduce friction by unifying big data environment  Has been quickly adopted  860+ patches merged since announcement in April 2019  20+ major contributors that are outside Of Databricks  24k+ daily downloads github.com/databricks/koalas
  18. 18. What is Koalas Koalas allows seamless* switch from pandas to Spark  that means scaling the compute power * few caveats covered in the DEMO #1  Databricks can manage our Spark infrastructure (and is also the author of Koalas)
  19. 19. What is Koalas Koalas allows seamless* switch from pandas to Spark  that means scaling the compute power * few caveats covered in the DEMO #1  Need to speed up your computation? Scale-up your Spark workers: - obviously, comes at a $ price - cannot be below few-s processing, saturates
  20. 20. DEMO #1 time
  21. 21. MLflow - purpose with mlflow.start_run(): mlflow.log_param("alpha", a) mlflow.log_param("l1_ratio", l1) rmse, r2, lr = train_score_model(a, l1) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.sklearn.log_model(lr, "model") github.com/mlflow databricks.com/mlflow
  22. 22. MLflow - purpose with mlflow.start_run(): mlflow.log_param("alpha", a) mlflow.log_param("l1_ratio", l1) rmse, r2, lr = train_score_model(a, l1) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.sklearn.log_model(lr, "model") N times (parameter sweep)
  23. 23. MLflow - our model – the demand depends on the hour and day of week – let’s create different ML prediction models and save it to an MLflow experiment
  24. 24. Output Scoring (on test set) Output Scoring (on test set) Output Scoring (on test set) MLflow - our model – Koalas can easily be used for pre- processing the test/train data (reuse existing) – use kdf.apply() to sweep&score models in parallel using Spark workers – let’s create different models and save it to MLflow experiment sweep parameter #1 Log parameters, metrics and model to MLflow sweep parameter #2 ML model ( black box ) ML model ( black box ) ML model ( black box ) predicted trip count
  25. 25. MLflow - our model – the demand depends on the hour and day of week – let’s create different ML prediction models and save it to an MLflow experiment
  26. 26. Output Scoring (on test set) Output Scoring (on test set) Output Scoring (on test set) MLflow - our model – Koalas can easily be used for pre- processing the test/train data (reuse existing) – use kdf.apply() to sweep&score models in parallel using Spark workers – let’s create different models and save it to MLflow experiment sweep parameter #1 Log parameters, metrics and model to MLflow sweep parameter #2 ML model ( black box ) ML model ( black box ) ML model ( black box ) predicted trip count
  27. 27. DEMO #2 Koalas & MLflow
  28. 28. pandas to Koalas – tips and tricks  Almost all popular pandas ops are available in koalas. Some params missing (missing something? Setup an an issue on Koalas github! Don’t be shy like me!).  Some of the functionality is chosen NOT to be implemented, the easiest workaround is: kdf.to_pandas().do_whatever_you_want().to_koalas() DataFrame.values : all the data would be loaded into the driver's memory, OOM errors.  Be aware of different execution principles (ordering, lazy evaluation, underlying Spark df). sort after groupby, different structure of groupby.apply, different NaN treatment & ops  I personally really like using kdf.apply(my_func, axis=1) for any distributed row-based job, including web-scrapping, dict-mapping, MLflow runs, etc. All the function calls (for all the rows) are then distributed among all Spark workers.
  29. 29. pandas to Koalas – tips and tricks #2  kdf = kdf.cache() – will not re-compute from the beginning every time, useful especially for exploratory analysis where you use same kdf in different cells, for 1 long script, Spark is gonna optimize the tree for you. This behavior is very different than pandas!  Problems with Koalas? Take a look at ks.options: - compute.ops_on_diff_frames - compute.ordered_head - plotting.sample_ratio - display.max_rows cacheno cache ks_with_trend = ks_bart_df.groupby(["Date", "Hour"]).mean() ks_trendless = ks_with_trend.copy() ks_trendless["Trip Count"] -= trend["Trip Count"] ks_trendless = ks_trendless.cache() # <-- caching here
  30. 30. Koalas roadmap  Expand pandas API coverage with Koalas Dataframes  Current: ~70% API coverage with pandas  Integration with more visualization packages  Current: support with matplotlib  Deeper Integration with numpy  Current: universal functions are implemented  More example notebooks  Current: a few examples on
  31. 31. Conclusions 1. Virgin Hyperloop One – cool stuff – we’re hiring ( https://hyperloop-one.com/careers ) 2. Hyperloop  Databricks Story – lucky coincidence, amazing partnership 3. What is Databricks and Koalas – pandas API for Spark 4. Koalas DEMO #1 – very convenient, but still can use pySpark if situation requires 5. Short intro to MLflow tracking – excellent for organizing your experiments (not only ML) 6. Koalas & MLflow DEMO #2 – Koalas is also nice for parallel model execution and scoring 7. Next Steps: Sparkifying Matlab
  32. 32. References, links 1. Koalas documentation & Github : https://github.com/databricks/koalas 2. Blog post : “How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas” https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced- processing-time-from-hours-to-minutes-with-koalas.html 3. How to improve your pandas, if you don’t wanna move to Spark ? : “From Pandas-wan to Pandas-master” https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas- master-4860cf0ce442
  33. 33. Q&A Thank you for joining!
  34. 34. databricks.com/sparkaisummit EXPANDED TECHNICAL TRAINING LEARN MORE REGISTER BY MARCH 31 GET $450 OFF! SAVE MY SPOT

×