2. About Myself
● Ph. D. in Theoretical Cybernetics @ Glushkov Institute of Cybernetics
● Machine Learning Engineering consultant
● Agile advisor for Data Science teams
● Google Scholar: https://scholar.google.com/citations?user=D_0HczYAAAAJ
● Github: https://github.com/bbiletskyy
● LinkedIn: https://www.linkedin.com/in/borys-biletskyy-a4aa313/
3. Why Pipeline-Driven Analytics?
● Bringing ML models to production
● ML models logistics
● Reproducible experiments
● Data Science<->Data Engineering handover
● Maintainability/Scalability/Reusability/Testability
● Unification of ML Workflows (experiment vs production)
4. Why Pipeline-Oriented Data Analytics
● Bringing ML models to production
● ML models logistics
● Data Science<->Data Engineering handover
● Fast and reproducible experiments
● Maintainability/Scalability/Reusability/Testability
● Unification of ML workflows
6. How ML Flows Look Like in Practice
users_df = dataset
.sample(True, 0.001, 19)
.select('user_id', 'name', 'age')
.filter(F.col('name').isNonNull())
.withColumn('age', F.col('age').cast(IntegerType()))
.withColumnRenamed('age', 'user_age')
.join(contracts_df, on = 'user_id', how = 'left')
.dropDuplicates()
w=Window.partitionBy(['user_id']).orderBy(F.col('order_date').asc())
order_features_df = users_df
.join(orders_df, on = 'user_id', how = 'left')
.withColumn('order_week_day', F.dayofweek('order_date'))
.groupBy('user_id').agg(F.count(F.col('*')).alias('order_count'),
F.max(F.col('price').alias('max_order_price')),
F.min(F.col('price').alias('min_order_price')))
.drop('address')
.withColumn('prev_order_date', F.lag(F.col('order_time')).over(w))
.withColumn('days_between_orders', F.datediff(F.col('order_date'), F.col('order_date')))
.drop('prev_order_date')
.where(F.col('is_delivered').isNotNull())
.where(F.col('price') > 10.0)
● Where are logic boundaries?
○ i.e. related logic code is in green
● How to test?
● How to reuse?
● How to maintain?
Data
cleaning
Enrichment
and feature
engineering
7. How ML Flows Could Look Like in Practice
user_features_df = PipelineModel([
Sample(0.01, replacement=False),
SelectColumns('user_id', 'age'),
RenameColumns({'U_ID': 'user_id', 'AGE': 'age'}),
NormalizeColumnTypes({'user_id': 'string', 'age': 'int'}),
DropOutliers(col='age', min=18, max=120),
Join(ContractDao(2019, 10, 7).get(), on = 'user_id'),
MaxOrderCost(),
MinOrderCost(),
AvgDaysBetweenOrders(),
AvgOrdersPerMonth('avg_orders_per_month'),
DropColumns(‘address’, ‘gender’),
]).transform(UserDao(2019, 10, 7).get())
● Narrow scoped components
● Clear logic boundaries
● Testable
● Modular
● Configurable
● Reusable
● Scaleable
● Faster experiments
● Production ready
● Reproducible experiments
● Maintainable
8. Pipeline-Oriented Data Analytics: Basics
● Pipelines consist of Stages
○ Everything is a Pipeline Stage
● There are Stages of 2 types:
○ Transformer: DataFrame->DataFrame
○ Estimator: DataFrame->Transformer
● Pipelines are also Pipeline Stages
○ Allows nested Pipelines
9. Pipeline-Oriented Data Analytics
● Custom libraries of Pipeline Stages
○ Estimators
○ Business logic transformers
● Utility Pipeline Stages
○ Sampling
○ Repartitioning
○ Checkpointing
● Infrastructural Stages (beware of side effects!)
○ Logging
○ Data validation
○ Metrics collection
● Control structures in ML Pipelines (beware of persistence!)
○ IF(condition=mode.is_train(), then=[AddLabelColumn()], else=[DropHistory()])
10. Example: Column Type Normalization
from pyspark.sql import DataFrame
from pyspark.ml import Transformer
from typing import Dict
class NormalizeColumnTypes(Transformer):
"""
Transformer that changes column types using the provided column_type dictionary.
Standard pyspark column types are supported: int, string, timestamp, float, double...
Produces null values if the column value can't be cast to the specified type.
NB this transformer is not persistable.
"""
def __init__(self, column_types: Dict[str, str]):
super(NormalizeColumnTypes, self).__init__()
self.column_types = column_types
def _transform(self, dataset: DataFrame):
df = dataset
for col_name in df.columns:
if col_name in self.column_types:
col_type = self.column_types[col_name]
df = df.withColumn(col_name, df[col_name].cast(col_type))
return df
11. Example: Unit Test
import pytest
from routing_challenge.transformer import NormalizeColumnTypes
import pyspark.sql.functions as F
class TestNormalizeColumnTypes(object):
def test_normalize_column_types(self, spark):
data = [('Alice', 20.0, '2017-07-09 11:35:00'),
('Bob', 0.0, '2017-08-24 19:25:00'),
('Nina', 40.0, None)]
columns = ['driver_id', 'age', 'time']
df = spark.createDataFrame(data, columns)
column_types = {'driver_id': 'string', 'age': 'int', 'time': 'timestamp'}
res_df = NormalizeColumnTypes(column_types).transform(df)
assert set(column_types.keys()) == set(res_df.schema.names)
assert df.count() == res_df.count()
for col, dtype in res_df.dtypes:
assert dtype == column_types[col]