A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

Founder at Data Science Retreat, launching special track: Fintech. Next,a platform startup in SG, happy to talk about it
Jun. 1, 2016
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons
1 of 59

More Related Content

What's hot

Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowFernando Ortega Gallego
MLOps.pptxMLOps.pptx
MLOps.pptxAllenPeter7
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your DataDataWorks Summit/Hadoop Summit
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks

Viewers also liked

Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey

Similar to A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUGRyan Bosshart
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks

Similar to A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons(20)

Recently uploaded

From Ambition to Go LiveFrom Ambition to Go Live
From Ambition to Go LiveRichard Wallis
the effect of phone electromagnetig  waves on the body  ;docxthe effect of phone electromagnetig  waves on the body  ;docx
the effect of phone electromagnetig waves on the body ;docxHimRong
SQL PPT.pdfSQL PPT.pdf
SQL PPT.pdfarunkumarguptag
Data Contracts: Consensus as Code - Pycon 2023Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023Ryan Collingwood
[CODIT] introduction-eng_ILS.pdf[CODIT] introduction-eng_ILS.pdf
[CODIT] introduction-eng_ILS.pdfCODITDemo
Richard Lawrence - How to measure the impact of LinkedIn ads with zero clicks...Richard Lawrence - How to measure the impact of LinkedIn ads with zero clicks...
Richard Lawrence - How to measure the impact of LinkedIn ads with zero clicks...Richard Lawrence

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

Editor's Notes

  1. Scala and spark are very close: if you learn one you learn the other. Spark is distributed scala
  2. Scala and spark are very close: if you learn one you learn the other. Spark is distributed scala This has been possible for years, but nowadays it’s not only possible but pleasant
  3. You attend a Retreat, not a training
  4. A talk should give you a superpower. - Am I missing out?
  5. redo the diagram
  6. fault-tolerant: missing partitions can be recomputed by using the lineage graph to rerun operations​
  7. When using python, the sparkcontext in python is basically a proxy. py4j is used to launch a JVM and create a native spark context. py4j manages communication between the python and java spark context objects. In the workers, some operations can be executed directly in the JVM. But, for example, if you've implemented a map function in python, a python process is forked to execute this user-supplied mapping. Each thread in the spark worker will have its own python sub-process. When Python wrapper calls the underlying Spark codes written in Scala running on a JVM, translation between two different environments and languages might be the source of more bugs and issues. 
  8. Scala and spark are very close: if you learn one you learn the other. Spark is distributed scala This has been possible for years, but nowadays it’s not only possible but pleasant
  9. Just one Map / Reduce step, but many algorithms are iterative Disk based → long startup times ------- Spark is a wholesale replacement for MapReduce that leverages lessons learned from MapReduce. The Hadoop community realized that areplacement for MR was needed. While MR has served the community well, it’s a decade old and shows clear limitations and problems, as we’ve seen. In late 2013, Cloudera, the largest Hadoop vendor officially embraced Spark as the replacement. Most of the other Hadoop vendors have followed suit. When it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal—this is what it was designed for. Advantages for Hadoop: Security, staffing
  10. sample use case for accumulators: gradient descent
  11. Spark.ml Departs from scikit-learn quite a bit
  12. Good
  13. from https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-apache-spark-1-4.html