SlideShare a Scribd company logo
1 of 12
Download to read offline
Pipeline-Oriented
Data Analytics
Borys Biletskyy
About Myself
● Ph. D. in Theoretical Cybernetics @ Glushkov Institute of Cybernetics
● Machine Learning Engineering consultant
● Agile advisor for Data Science teams
● Google Scholar: https://scholar.google.com/citations?user=D_0HczYAAAAJ
● Github: https://github.com/bbiletskyy
● LinkedIn: https://www.linkedin.com/in/borys-biletskyy-a4aa313/
Why Pipeline-Driven Analytics?
● Bringing ML models to production
● ML models logistics
● Reproducible experiments
● Data Science<->Data Engineering handover
● Maintainability/Scalability/Reusability/Testability
● Unification of ML Workflows (experiment vs production)
Why Pipeline-Oriented Data Analytics
● Bringing ML models to production
● ML models logistics
● Data Science<->Data Engineering handover
● Fast and reproducible experiments
● Maintainability/Scalability/Reusability/Testability
● Unification of ML workflows
How ML Flows Look Like in Theory
How ML Flows Look Like in Practice
users_df = dataset
.sample(True, 0.001, 19)
.select('user_id', 'name', 'age')
.filter(F.col('name').isNonNull())
.withColumn('age', F.col('age').cast(IntegerType()))
.withColumnRenamed('age', 'user_age')
.join(contracts_df, on = 'user_id', how = 'left')
.dropDuplicates()
w=Window.partitionBy(['user_id']).orderBy(F.col('order_date').asc())
order_features_df = users_df
.join(orders_df, on = 'user_id', how = 'left')
.withColumn('order_week_day', F.dayofweek('order_date'))
.groupBy('user_id').agg(F.count(F.col('*')).alias('order_count'),
F.max(F.col('price').alias('max_order_price')),
F.min(F.col('price').alias('min_order_price')))
.drop('address')
.withColumn('prev_order_date', F.lag(F.col('order_time')).over(w))
.withColumn('days_between_orders', F.datediff(F.col('order_date'), F.col('order_date')))
.drop('prev_order_date')
.where(F.col('is_delivered').isNotNull())
.where(F.col('price') > 10.0)
● Where are logic boundaries?
○ i.e. related logic code is in green
● How to test?
● How to reuse?
● How to maintain?
Data
cleaning
Enrichment
and feature
engineering
How ML Flows Could Look Like in Practice
user_features_df = PipelineModel([
Sample(0.01, replacement=False),
SelectColumns('user_id', 'age'),
RenameColumns({'U_ID': 'user_id', 'AGE': 'age'}),
NormalizeColumnTypes({'user_id': 'string', 'age': 'int'}),
DropOutliers(col='age', min=18, max=120),
Join(ContractDao(2019, 10, 7).get(), on = 'user_id'),
MaxOrderCost(),
MinOrderCost(),
AvgDaysBetweenOrders(),
AvgOrdersPerMonth('avg_orders_per_month'),
DropColumns(‘address’, ‘gender’),
]).transform(UserDao(2019, 10, 7).get())
● Narrow scoped components
● Clear logic boundaries
● Testable
● Modular
● Configurable
● Reusable
● Scaleable
● Faster experiments
● Production ready
● Reproducible experiments
● Maintainable
Pipeline-Oriented Data Analytics: Basics
● Pipelines consist of Stages
○ Everything is a Pipeline Stage
● There are Stages of 2 types:
○ Transformer: DataFrame->DataFrame
○ Estimator: DataFrame->Transformer
● Pipelines are also Pipeline Stages
○ Allows nested Pipelines
Pipeline-Oriented Data Analytics
● Custom libraries of Pipeline Stages
○ Estimators
○ Business logic transformers
● Utility Pipeline Stages
○ Sampling
○ Repartitioning
○ Checkpointing
● Infrastructural Stages (beware of side effects!)
○ Logging
○ Data validation
○ Metrics collection
● Control structures in ML Pipelines (beware of persistence!)
○ IF(condition=mode.is_train(), then=[AddLabelColumn()], else=[DropHistory()])
Example: Column Type Normalization
from pyspark.sql import DataFrame
from pyspark.ml import Transformer
from typing import Dict
class NormalizeColumnTypes(Transformer):
"""
Transformer that changes column types using the provided column_type dictionary.
Standard pyspark column types are supported: int, string, timestamp, float, double...
Produces null values if the column value can't be cast to the specified type.
NB this transformer is not persistable.
"""
def __init__(self, column_types: Dict[str, str]):
super(NormalizeColumnTypes, self).__init__()
self.column_types = column_types
def _transform(self, dataset: DataFrame):
df = dataset
for col_name in df.columns:
if col_name in self.column_types:
col_type = self.column_types[col_name]
df = df.withColumn(col_name, df[col_name].cast(col_type))
return df
Example: Unit Test
import pytest
from routing_challenge.transformer import NormalizeColumnTypes
import pyspark.sql.functions as F
class TestNormalizeColumnTypes(object):
def test_normalize_column_types(self, spark):
data = [('Alice', 20.0, '2017-07-09 11:35:00'),
('Bob', 0.0, '2017-08-24 19:25:00'),
('Nina', 40.0, None)]
columns = ['driver_id', 'age', 'time']
df = spark.createDataFrame(data, columns)
column_types = {'driver_id': 'string', 'age': 'int', 'time': 'timestamp'}
res_df = NormalizeColumnTypes(column_types).transform(df)
assert set(column_types.keys()) == set(res_df.schema.names)
assert df.count() == res_df.count()
for col, dtype in res_df.dtypes:
assert dtype == column_types[col]
QA

More Related Content

What's hot

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Scalable Automatic Machine Learning in H2O
 Scalable Automatic Machine Learning in H2O Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&MDatabricks
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowFernando Ortega Gallego
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMatei Zaharia
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 

What's hot (20)

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Scalable Automatic Machine Learning in H2O
 Scalable Automatic Machine Learning in H2O Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlow
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 

Similar to Pipeline oriented data analytics

Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Philip Goddard
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesPhilip Goddard
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowDatabricks
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetupTheodoros Vasiloudis
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2OSri Ambati
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scaleLooker
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksSpark Summit
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 

Similar to Pipeline oriented data analytics (20)

Build machine learning pipelines from research to production
Build machine learning pipelines from research to productionBuild machine learning pipelines from research to production
Build machine learning pipelines from research to production
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflow
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - DatabricksMachine Learning Pipelines - Joseph Bradley - Databricks
Machine Learning Pipelines - Joseph Bradley - Databricks
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 

Pipeline oriented data analytics

  • 2. About Myself ● Ph. D. in Theoretical Cybernetics @ Glushkov Institute of Cybernetics ● Machine Learning Engineering consultant ● Agile advisor for Data Science teams ● Google Scholar: https://scholar.google.com/citations?user=D_0HczYAAAAJ ● Github: https://github.com/bbiletskyy ● LinkedIn: https://www.linkedin.com/in/borys-biletskyy-a4aa313/
  • 3. Why Pipeline-Driven Analytics? ● Bringing ML models to production ● ML models logistics ● Reproducible experiments ● Data Science<->Data Engineering handover ● Maintainability/Scalability/Reusability/Testability ● Unification of ML Workflows (experiment vs production)
  • 4. Why Pipeline-Oriented Data Analytics ● Bringing ML models to production ● ML models logistics ● Data Science<->Data Engineering handover ● Fast and reproducible experiments ● Maintainability/Scalability/Reusability/Testability ● Unification of ML workflows
  • 5. How ML Flows Look Like in Theory
  • 6. How ML Flows Look Like in Practice users_df = dataset .sample(True, 0.001, 19) .select('user_id', 'name', 'age') .filter(F.col('name').isNonNull()) .withColumn('age', F.col('age').cast(IntegerType())) .withColumnRenamed('age', 'user_age') .join(contracts_df, on = 'user_id', how = 'left') .dropDuplicates() w=Window.partitionBy(['user_id']).orderBy(F.col('order_date').asc()) order_features_df = users_df .join(orders_df, on = 'user_id', how = 'left') .withColumn('order_week_day', F.dayofweek('order_date')) .groupBy('user_id').agg(F.count(F.col('*')).alias('order_count'), F.max(F.col('price').alias('max_order_price')), F.min(F.col('price').alias('min_order_price'))) .drop('address') .withColumn('prev_order_date', F.lag(F.col('order_time')).over(w)) .withColumn('days_between_orders', F.datediff(F.col('order_date'), F.col('order_date'))) .drop('prev_order_date') .where(F.col('is_delivered').isNotNull()) .where(F.col('price') > 10.0) ● Where are logic boundaries? ○ i.e. related logic code is in green ● How to test? ● How to reuse? ● How to maintain? Data cleaning Enrichment and feature engineering
  • 7. How ML Flows Could Look Like in Practice user_features_df = PipelineModel([ Sample(0.01, replacement=False), SelectColumns('user_id', 'age'), RenameColumns({'U_ID': 'user_id', 'AGE': 'age'}), NormalizeColumnTypes({'user_id': 'string', 'age': 'int'}), DropOutliers(col='age', min=18, max=120), Join(ContractDao(2019, 10, 7).get(), on = 'user_id'), MaxOrderCost(), MinOrderCost(), AvgDaysBetweenOrders(), AvgOrdersPerMonth('avg_orders_per_month'), DropColumns(‘address’, ‘gender’), ]).transform(UserDao(2019, 10, 7).get()) ● Narrow scoped components ● Clear logic boundaries ● Testable ● Modular ● Configurable ● Reusable ● Scaleable ● Faster experiments ● Production ready ● Reproducible experiments ● Maintainable
  • 8. Pipeline-Oriented Data Analytics: Basics ● Pipelines consist of Stages ○ Everything is a Pipeline Stage ● There are Stages of 2 types: ○ Transformer: DataFrame->DataFrame ○ Estimator: DataFrame->Transformer ● Pipelines are also Pipeline Stages ○ Allows nested Pipelines
  • 9. Pipeline-Oriented Data Analytics ● Custom libraries of Pipeline Stages ○ Estimators ○ Business logic transformers ● Utility Pipeline Stages ○ Sampling ○ Repartitioning ○ Checkpointing ● Infrastructural Stages (beware of side effects!) ○ Logging ○ Data validation ○ Metrics collection ● Control structures in ML Pipelines (beware of persistence!) ○ IF(condition=mode.is_train(), then=[AddLabelColumn()], else=[DropHistory()])
  • 10. Example: Column Type Normalization from pyspark.sql import DataFrame from pyspark.ml import Transformer from typing import Dict class NormalizeColumnTypes(Transformer): """ Transformer that changes column types using the provided column_type dictionary. Standard pyspark column types are supported: int, string, timestamp, float, double... Produces null values if the column value can't be cast to the specified type. NB this transformer is not persistable. """ def __init__(self, column_types: Dict[str, str]): super(NormalizeColumnTypes, self).__init__() self.column_types = column_types def _transform(self, dataset: DataFrame): df = dataset for col_name in df.columns: if col_name in self.column_types: col_type = self.column_types[col_name] df = df.withColumn(col_name, df[col_name].cast(col_type)) return df
  • 11. Example: Unit Test import pytest from routing_challenge.transformer import NormalizeColumnTypes import pyspark.sql.functions as F class TestNormalizeColumnTypes(object): def test_normalize_column_types(self, spark): data = [('Alice', 20.0, '2017-07-09 11:35:00'), ('Bob', 0.0, '2017-08-24 19:25:00'), ('Nina', 40.0, None)] columns = ['driver_id', 'age', 'time'] df = spark.createDataFrame(data, columns) column_types = {'driver_id': 'string', 'age': 'int', 'time': 'timestamp'} res_df = NormalizeColumnTypes(column_types).transform(df) assert set(column_types.keys()) == set(res_df.schema.names) assert df.count() == res_df.count() for col, dtype in res_df.dtypes: assert dtype == column_types[col]
  • 12. QA