Pipeline oriented data analytics

Pipeline-Oriented
Data Analytics
Borys Biletskyy

About Myself
● Ph. D. in Theoretical Cybernetics @ Glushkov Institute of Cybernetics
● Machine Learning Engineering consultant
● Agile advisor for Data Science teams
● Google Scholar: https://scholar.google.com/citations?user=D_0HczYAAAAJ
● Github: https://github.com/bbiletskyy
● LinkedIn: https://www.linkedin.com/in/borys-biletskyy-a4aa313/

Why Pipeline-Driven Analytics?
● Bringing ML models to production
● ML models logistics
● Reproducible experiments
● Data Science<->Data Engineering handover
● Maintainability/Scalability/Reusability/Testability
● Uniﬁcation of ML Workﬂows (experiment vs production)

Why Pipeline-Oriented Data Analytics
● Bringing ML models to production
● ML models logistics
● Data Science<->Data Engineering handover
● Fast and reproducible experiments
● Maintainability/Scalability/Reusability/Testability
● Uniﬁcation of ML workﬂows

How ML Flows Look Like in Theory

How ML Flows Look Like in Practice
users_df = dataset
.sample(True, 0.001, 19)
.select('user_id', 'name', 'age')
.filter(F.col('name').isNonNull())
.withColumn('age', F.col('age').cast(IntegerType()))
.withColumnRenamed('age', 'user_age')
.join(contracts_df, on = 'user_id', how = 'left')
.dropDuplicates()
w=Window.partitionBy(['user_id']).orderBy(F.col('order_date').asc())
order_features_df = users_df
.join(orders_df, on = 'user_id', how = 'left')
.withColumn('order_week_day', F.dayofweek('order_date'))
.groupBy('user_id').agg(F.count(F.col('*')).alias('order_count'),
F.max(F.col('price').alias('max_order_price')),
F.min(F.col('price').alias('min_order_price')))
.drop('address')
.withColumn('prev_order_date', F.lag(F.col('order_time')).over(w))
.withColumn('days_between_orders', F.datediff(F.col('order_date'), F.col('order_date')))
.drop('prev_order_date')
.where(F.col('is_delivered').isNotNull())
.where(F.col('price') > 10.0)
● Where are logic boundaries?
○ i.e. related logic code is in green
● How to test?
● How to reuse?
● How to maintain?
Data
cleaning
Enrichment
and feature
engineering

How ML Flows Could Look Like in Practice
user_features_df = PipelineModel([
Sample(0.01, replacement=False),
SelectColumns('user_id', 'age'),
RenameColumns({'U_ID': 'user_id', 'AGE': 'age'}),
NormalizeColumnTypes({'user_id': 'string', 'age': 'int'}),
DropOutliers(col='age', min=18, max=120),
Join(ContractDao(2019, 10, 7).get(), on = 'user_id'),
MaxOrderCost(),
MinOrderCost(),
AvgDaysBetweenOrders(),
AvgOrdersPerMonth('avg_orders_per_month'),
DropColumns(‘address’, ‘gender’),
]).transform(UserDao(2019, 10, 7).get())
● Narrow scoped components
● Clear logic boundaries
● Testable
● Modular
● Conﬁgurable
● Reusable
● Scaleable
● Faster experiments
● Production ready
● Reproducible experiments
● Maintainable

Pipeline-Oriented Data Analytics: Basics
● Pipelines consist of Stages
○ Everything is a Pipeline Stage
● There are Stages of 2 types:
○ Transformer: DataFrame->DataFrame
○ Estimator: DataFrame->Transformer
● Pipelines are also Pipeline Stages
○ Allows nested Pipelines

Pipeline-Oriented Data Analytics
● Custom libraries of Pipeline Stages
○ Estimators
○ Business logic transformers
● Utility Pipeline Stages
○ Sampling
○ Repartitioning
○ Checkpointing
● Infrastructural Stages (beware of side effects!)
○ Logging
○ Data validation
○ Metrics collection
● Control structures in ML Pipelines (beware of persistence!)
○ IF(condition=mode.is_train(), then=[AddLabelColumn()], else=[DropHistory()])

Example: Column Type Normalization
from pyspark.sql import DataFrame
from pyspark.ml import Transformer
from typing import Dict
class NormalizeColumnTypes(Transformer):
"""
Transformer that changes column types using the provided column_type dictionary.
Standard pyspark column types are supported: int, string, timestamp, float, double...
Produces null values if the column value can't be cast to the specified type.
NB this transformer is not persistable.
"""
def __init__(self, column_types: Dict[str, str]):
super(NormalizeColumnTypes, self).__init__()
self.column_types = column_types
def _transform(self, dataset: DataFrame):
df = dataset
for col_name in df.columns:
if col_name in self.column_types:
col_type = self.column_types[col_name]
df = df.withColumn(col_name, df[col_name].cast(col_type))
return df

Example: Unit Test
import pytest
from routing_challenge.transformer import NormalizeColumnTypes
import pyspark.sql.functions as F
class TestNormalizeColumnTypes(object):
def test_normalize_column_types(self, spark):
data = [('Alice', 20.0, '2017-07-09 11:35:00'),
('Bob', 0.0, '2017-08-24 19:25:00'),
('Nina', 40.0, None)]
columns = ['driver_id', 'age', 'time']
df = spark.createDataFrame(data, columns)
column_types = {'driver_id': 'string', 'age': 'int', 'time': 'timestamp'}
res_df = NormalizeColumnTypes(column_types).transform(df)
assert set(column_types.keys()) == set(res_df.schema.names)
assert df.count() == res_df.count()
for col, dtype in res_df.dtypes:
assert dtype == column_types[col]

Pipeline oriented data analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pipeline oriented data analytics

Similar to Pipeline oriented data analytics (20)

Recently uploaded

Recently uploaded (20)

Pipeline oriented data analytics