7 key recipes for data engineering

univalence
univalence univalence
7 Key Recipes
for
Data Engineering
Scala Matsuri 2017
データ・エンジニアリング 7大レシピ
7 Key Recipes For Data Eng
Introduction
We will explore 7 key recipes
on Data Engineering.
If you could only pick one, the 5th
on joins/cogroups is essential.
2
文字数制限あり。折りたたみやエンコーディングは無し。
データ・エンジニアリングの 7大レシピ
7 Key Recipes For Data Eng
About Me
Jonathan WINANDY
Scala user (6 years)
Lead Data Engineer:
- Data Lake building,
- Audit/Coaching,
- Spark/Scala/Kafka Trainings.
Founder of Univalence (BI / Big Data)
Co-Founder of CYM (Predictive Maintenance),
and Valwin (Health Care Data).
3
データエンジニアとしてデータ基盤構築やトレーニング等を実施
Univalence、CYM、Valwin などのデータ分析ビジネスを創業
7 Key Recipes For Data Eng
Bachir AIT MBAREK
4
Thank you
7 Key Recipes For Data Eng
Outline
1. Organisations
2. Work Optimization
3. Staging
4. RDD/Dataframe
5. Join/Cogroup
6. Data quality
7. Real Programs
5
1. It’s always about our organizations!
(in Europe)
6
一に組織 (ヨーロッパはこればっかり)
7 Key Recipes For Data Eng 7
1. Organisations
In Data Engineering, we tend to think our problems
come from or are solved by those tools :
データエンジニアリングではツールが問題の原因であるとか
あるいはツールによって問題を解くのだと思われがち
7 Key Recipes For Data Eng
1. Organisations
However our most difficult problems or durable
solutions come from organisational contexts.
It’s true for IT at large, but it’s much more
dominant in Data Engineering.
8
IT において、最も困難な課題や持続的な解決策は組織の文脈からやってくる
この点、データエンジニアリングではさらに支配的
7 Key Recipes For Data Eng
1. Organisations
9
Because Data Engineering
enables access to Data!
理由はデータ・エンジニアリングはデータへのアクセスを活性化させるから
7 Key Recipes For Data Eng 10
It enables access to Data in
very complex organisations.
1. Organisations
Product BI
Your TeamMarketing
data
new data
複雑な組織においてデータアクセスを活性化させると…
7 Key Recipes For Data Eng
data
11
Your Team
Global
Marketing
1. Organisations
It enables access to Data in
very complex organisations. Global
IT
Marketing IT
BI
Holding
Subsidies
Marketing IT
BI
Marketing IT
BI
「超」複雑な組織においてデータアクセスを活性化させると…
7 Key Recipes For Data Eng
It happens to be very frustrating!
12
1. Organisations
By being a Data Eng, you take part in some of the most
technically diverse teams that are:
● Running cutting edge technologies,
● Solving some of the hardest problems,
while being constantly dependent on other teams that
often don’t share your vision.
先端技術を駆使して難題に取り組みつつ、ビジョンを共有しない他のチームに依存して仕
事を進めざるをえない。とてもフラストレーションが溜まる状況だ
7 Key Recipes For Data Eng
1. Organisations
Small tips:
● One hadoop cluster (no Test or QA clusters).
● Document your vision, so it can be shared.
● What happens between teams matters a lot.
13
コツ: Hadoopクラスタは1つに、ビジョンは文書化して事前に根回し
チーム間の関係は大切
2. Optimizing our work
14
業務の最適化
7 Key Recipes For Data Eng
2. Work Optimization
To optimize our work, there are 3 key concerns
governing our decisions :
● Lead time,
● Impact,
● Failure management.
15
業務最適化における意思決定で大切なこと:
リードタイム、インパクト、失敗の管理
7 Key Recipes For Data Eng
2. Work Optimization
Lead time:
The period of time between the
initial phase and the completion.
Impact:
Positive effects beyond the
current context.
Failure management:
Failure is the nominal case.
Unprepared failures will pile up.
16
リードタイム→企画から完成までの期間、インパクト→今の文脈を超えた良い効果失敗の
管理→想定外の失敗は積み上がる
7 Key Recipes For Data Eng
2. Work Optimization
Being Proactive!
To avoid the “MapReduce then Wait”,
two methods :
● Proactive Task Simulation,
● “What will fail?”
17
先を見越して動こう!
「MapReduce を動かして待機」を回避するには?
7 Key Recipes For Data Eng
2. Work Optimization
Proactive Task Simulation.
The idea is to solve a task:
● map all the possible ways,
● on each way estimate:
○ Lead time and cost,
○ Decidability,
○ Success rate,
○ Generated opportunities,
○ and other By-Products.
● then choose which way to start with.
18
解決したいタスクについて、ありうる可能性を全て挙げてリードタイムやコストなどを見積
もった上で、どの方法から始めるかを選ぶ
7 Key Recipes For Data Eng
2. Work Optimization
What will fail ?
The idea is to guess what may fail on a
given component.
Then you can engage in a discussion on:
● Knowing how likely it will fail,
● Preventing that failure,
● Planning the recovery ahead.
19
あるコンポーネントで何が失敗しそうか考え、
その頻度や予防策、復旧プランを議論する
3. Staging Data
Back to technical recipes!
20
技術的なレシピに戻ろう
7 Key Recipes For Data Eng
3. Staging
Data is moving around, freeze it!
Staging changed with Big Data. We moved from
transient staging (FTP, NFS, etc.) to persistent
staging thank to distributed solutions:
● in Kafka, we can retain logs for months,
● in HDFS, we can retain sources for years.
21
まずは、動いているデータを凍結する
Kafka や HDFS のおかげでビッグデータを長期間ステージングできるように
7 Key Recipes For Data Eng
3. Staging
But there are a lot of staging
anti-patterns out there:
● Updating directories,
● Incomplete datasets,
● Short retention.
Staging should be seen as a
persistent data structure.
If you liked immutability in Scala, go for it with your Data!
22
ステージングは永続データ構造として見えるようにすべき
データは Scala のイミュータブルと同じように扱おう
7 Key Recipes For Data Eng
3. Staging
Example, with HDFS:
Writing in unique directories:
/staging
|-- $tablename
|-- dtint=$dtint
|-- dsparam.name=$dsparam.value
|-- ...
|-- ...
|-- uuid=$uuid
23
UUID を使ったディレクトリに書き込む
4. Using RDDs or Dataframes
24
RDD と Dataframe について
7 Key Recipes For Data Eng
4. RDD/Dataframe
Dataframes have great performance,
but are “untyped” and foreign.
RDDs have a robust Scala API,
but are a difficult to map from data sources.
SQL is the current lingua franca of Data.
25
データ操作にはなんだかんだ言っても SQL
7 Key Recipes For Data Eng
4. RDD/Dataframe
Dataframe RDD
Predicate push down Types!!
Bare metal / unboxed Nested structures
Connectors Better unit tests
Pluggable Optimizer Less stages
SQL + Meta Scala * Scala
26
Comparative Advantages
7 Key Recipes For Data Eng
RDD based jobs are like
marine mammals, fit for their
environnement starting from a
certain size.
RDDs are building blocks for
large jobs.
27
RDD は海獣みたいなもので、その環境に特化している
RDD は大きい仕事のビルディング・ブロック
4. RDD/Dataframe
7 Key Recipes For Data Eng
4. RDD/Dataframe
RDDs are very good for ETL workloads:
● Control over shuffles,
● Unit tests are easier to write.
They can leverage Dataframe API for job
boundaries:
● Loading, storing data with Dataframe APIs,
● Map Dataframe in case classes,
● Perform type safe transformations.
28
RDD は ETL に向いている
データ順の制御や単体テストの書き易さ
7 Key Recipes For Data Eng
4. RDD/Dataframe
Dataframes are perfect for:
● Data Exploration (notebook),
● Light Jobs (SQL + DF) ,
● Dynamic jobs (xlsx specs =>
spark job).
User Defined Functions improve
code reuse,
User Defined Aggregate Functions
improve performance over
Standard SQL. 29
Dataframe は Notebook を使ったデータ探索や SQL と組み合わせた軽量なジョブ、
動的なジョブに向いている
5. Cogroup all the things
30
Cogroup を使ってみる
7 Key Recipes For Data Eng
5. Cogroup
The cogroup is the best operation
to link data together.
31
データの連結に使える
7 Key Recipes For Data Eng
Cogroup API
from (left:RDD[(K,A)],right:RDD[(K,B)])
○ join : RDD[(K,( A , B ))]
○ outerJoin : RDD[(K,(Option[A],Option[B]))]
○ cogroup : RDD[(K,( Seq[A], Seq[B]))]
from (rdd:RDD[(K,A)])
○ groupBy : RDD[(K,Seq[A])]
On cogroup and groupBy, for a given key:K, there is only
one unique row with that key in the output dataset.
5. Cogroup
32
cogroup と groupBy は任意のキーに対して単一の行を返す
7 Key Recipes For Data Eng
5. Cogroup
33
rddL
.filter(pL)
.map(mL)
.keyBy(kL)
.cogroup(
rddR
.filter(pR)
.map(mR)
.keyBy(kL))
.map(mC)
7 Key Recipes For Data Eng
5. Cogroup
CHECKPOINT on DISK (save)
34
rddL.keyBy(mL.andThen(kL))
.cogroup(
rddR.keyBy(mR.andThen(kR)))
.map({case (k,(ls,rs)) =>
(k,(ls.filter(pL).map(mL),
rs.filter(pR).map(mR)))})
.map(mC)
REWRITE
7 Key Recipes For Data Eng
5. Cogroup
Lines of Code : 3000
Duration : 30min
(non-blocking)
Lines of Code : 15
Duration : 11h
(blocking)
35
CHECKPOINT on DISK
Moving the code after
the checkpoint allows
fast feedback loops.
ディスク書き出しの後にコードを置くことで素早くフィードバックループを回せる
7 Key Recipes For Data Eng
5. Cogroup
Cogroups allow writing tests on a
minimised case.
Test workflow:
● Isolate potential cases,
● Get the smallest cogrouped row
○ output the row in test resources,
● Reproduce the bug,
● Write tests and fix code.
36
cogroup を使うと問題を最小化してテストを書けるのでバグを再現しやすい
6. Inline data quality
37
データ品質のインライン化
7 Key Recipes For Data Eng
6. Inline data quality
Data quality improves resilience to bad data.
However, data quality concerns often come second.
38
データ品質を高めることでバッドデータへのレジリエンスが向上するが
データ品質は二の次にされがち
7 Key Recipes For Data Eng
6. Inline data quality
Our solution: Integrate Data Quality deep inside jobs, by
unifying Data quality with Data transformation.
We defined a structure Result similar to ValidationNeL
(Applicatives).
39
データ品質はジョブの奥まで統合させる
ValidationNeL的な Result というものを定義した
7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
case class Annotation(path:String,
typeName:String,
msg:String,
discardedData:Seq[String],
entityIdType:Option[String],
entityId:Option[String],
level:Int,
stage:String)
6. Inline data quality
40
7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
Result is either:
● Containing a value, with a list of warnings,
● Empty, with a list containing the error and
warnings.
(Serialization and Big Data don’t like Sum types, so it’s pre-projected
onto a product type)
6. Inline data quality
41
値を持つか、Empty の二値
それぞれ警告やエラーを表す注釈も持つ
7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
Then we can use applicatives to combine results.
case class Person(name:String,age:Int)
def build(name:Result[String],
age:Result[Int]):Result[Person] =
...
6. Inline data quality
42
アプリカティブを使って結果を組み合わせる
7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
The annotations are accumulated at the top of
the hierarchy, and saved with the data:
6. Inline data quality
43
注釈は蓄積されて、データと一緒に保存される
7 Key Recipes For Data Eng
Annotations can be aggregated on dimensions:
6. Inline data quality
Message:
● EMPTY_STRING
● MULTIPLE_VALUES
● NOT_IN_ENUM
● PARSE_ERROR
● ______________
Levels:
● WARNING
● ERROR
● CRITICAL
44
注釈は次元ごとに集約できる
7 Key Recipes For Data Eng
6. Inline data quality
If you are interested by the approach, you can take a look at
this repository:
Macros based on Shapeless to build Result[T] from case classes.
https://github.com/ahoy-jon/autoBuild (~october 2015)
45
気になった人はレポジトリをみてください
7. Designing real programs
46
業務で使うプログラムの設計
7 Key Recipes For Data Eng
7. Real programs
Most pipelines parts are designed as
Stateless computations.
They either require no external state (great)
or infer their state based on filesystem state (meh).
47
ステートレスな計算が基本
7 Key Recipes For Data Eng
7. Real programs
Spark allows us to program inside the Driver.
We can create actual programs.
In Scala, we can use:
● Scopt to parse common args and feature flips,
● TypesafeConfig to load/overload program settings,
● EventSourcing to read/write app events,
● Sbt-Pack Coursier to package and create launchers.
48
Spark なら Scala を使ってちゃんとしたプログラムが書ける
7 Key Recipes For Data Eng
Deterministic effects
We then make sure that our program are as
deterministic as possible, and are
idempotent (if possible).
Example: Storing past execution so as to not recompute
something already computed, unless forced.
49
7. Real programs
できるかぎり決定論的なプログラムを目指す
7 Key Recipes For Data Eng
Level 0 Event Sourcing
Level 1 Name resolving
Level 2 Triggered exec (schema capture, deltaQA, …)
Level 3 Scheduling (replay, coherence, ...)
Level 4 “code as data” (=> continuous delivery)
7. Real programs
In progress: project Kerguelen, API for data jobs.
Enable the creation of coherent jobs, integrating different
abstraction levels:
50
プロジェクト Kerguelen というものを作っている
7 Key Recipes For Data Eng
8. More
More recipes:
● Automatic QA,
● Structural Sharing for Datasets,
● Jsonoids mapping generation,
● Advanced UDAF,
● ...
But that’s it for today!
51
他にもあるけど、今日はここまで
Conclusion
52
Thank you
for listening!
Questions?
jonathan@univalence.io
@ahoy_jon
53
ありがとうございました
7 Key Recipes For Data Eng
PSUG Note
54
If you happen to visit Paris, don’t
hesitate to submit a talk at our Paris
Scala User Group.
パリに来たら是非 Paris 勉強会でトークしてください
1 of 54

Recommended

50 must read hadoop interview questions & answers - whizlabs by
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
320 views15 slides
Big Data Beyond Hadoop*: Research Directions for the Future by
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
3.1K views36 slides
international monetary fund by
international monetary fund international monetary fund
international monetary fund zainulla
1K views33 slides
Tendencias pedagógicas by
Tendencias pedagógicasTendencias pedagógicas
Tendencias pedagógicasyin cedeño
147 views4 slides
Θεματική ΓΗ by
Θεματική ΓΗΘεματική ΓΗ
Θεματική ΓΗGeorgina Spyres
459 views8 slides
Θεματική ΝΕΡΟ by
Θεματική ΝΕΡΟΘεματική ΝΕΡΟ
Θεματική ΝΕΡΟGeorgina Spyres
357 views8 slides

More Related Content

Viewers also liked

05 chapter 6 mat foundations by
05 chapter 6 mat foundations05 chapter 6 mat foundations
05 chapter 6 mat foundationsFábio Albino De Souza
21.3K views80 slides
Aulas virtuales. equipo 12 by
Aulas virtuales. equipo 12Aulas virtuales. equipo 12
Aulas virtuales. equipo 12FACILITADOR UCLA
185 views15 slides

Viewers also liked(20)

Мастер-класс: Системное мышление by CEE-SEC(R)
Мастер-класс: Системное мышлениеМастер-класс: Системное мышление
Мастер-класс: Системное мышление
CEE-SEC(R)809 views
Андрей Циликов, директор по развитию Sendsay by maria_bu22
Андрей Циликов, директор по развитию SendsayАндрей Циликов, директор по развитию Sendsay
Андрей Циликов, директор по развитию Sendsay
maria_bu22806 views
Юрий Войнлилов, Алена Нефедова. Личные роботы и генная инженерия: к каким инн... by Future Foundation
Юрий Войнлилов, Алена Нефедова. Личные роботы и генная инженерия: к каким инн...Юрий Войнлилов, Алена Нефедова. Личные роботы и генная инженерия: к каким инн...
Юрий Войнлилов, Алена Нефедова. Личные роботы и генная инженерия: к каким инн...
Future Foundation429 views
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant... by Frank van Harmelen
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...
Frank van Harmelen1.2K views
Алеш Живкович. Университет Иннополис. "Оптимизация затрат на ИТ с помощью фре... by Expolink
Алеш Живкович. Университет Иннополис. "Оптимизация затрат на ИТ с помощью фре...Алеш Живкович. Университет Иннополис. "Оптимизация затрат на ИТ с помощью фре...
Алеш Живкович. Университет Иннополис. "Оптимизация затрат на ИТ с помощью фре...
Expolink437 views
Практики жизненного цикла систем машинного обучения by CEE-SEC(R)
Практики жизненного цикла систем машинного обученияПрактики жизненного цикла систем машинного обучения
Практики жизненного цикла систем машинного обучения
CEE-SEC(R)834 views
Конкуренция городов среди ИТ-специалистов by IT-Доминанта
Конкуренция городов среди ИТ-специалистовКонкуренция городов среди ИТ-специалистов
Конкуренция городов среди ИТ-специалистов
project presentation by Anna Botova
project presentationproject presentation
project presentation
Anna Botova3.1K views
Using Deep Learning for Recommendation by Eduardo Gonzalez
Using Deep Learning for RecommendationUsing Deep Learning for Recommendation
Using Deep Learning for Recommendation
Eduardo Gonzalez1.1K views
Principles for knowledge engineering on the Web by Guus Schreiber
Principles for knowledge engineering on the WebPrinciples for knowledge engineering on the Web
Principles for knowledge engineering on the Web
Guus Schreiber8.2K views
The artof of knowledge engineering, or: knowledge engineering of art by Guus Schreiber
The artof of knowledge engineering, or: knowledge engineering of artThe artof of knowledge engineering, or: knowledge engineering of art
The artof of knowledge engineering, or: knowledge engineering of art
Guus Schreiber8.3K views
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ... by Cambridge Semantics
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...
Cambridge Semantics2.2K views

Similar to 7 key recipes for data engineering

Sandip.Basak_Resume_9804277207_PT by
Sandip.Basak_Resume_9804277207_PTSandip.Basak_Resume_9804277207_PT
Sandip.Basak_Resume_9804277207_PTSandip Basak
89 views7 slides
hari_duche_updated by
hari_duche_updatedhari_duche_updated
hari_duche_updatedHari Duche
140 views7 slides
Adopting software design practices for better machine learning by
Adopting software design practices for better machine learningAdopting software design practices for better machine learning
Adopting software design practices for better machine learningMLconf
511 views15 slides
Agile Methods and Data Warehousing (2016 update) by
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
2.1K views48 slides
Agile methods and dw mha by
Agile methods and dw mhaAgile methods and dw mha
Agile methods and dw mhaAgileDenver
346 views38 slides
Graph Data Science at Scale by
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at ScaleNeo4j
448 views34 slides

Similar to 7 key recipes for data engineering(20)

Sandip.Basak_Resume_9804277207_PT by Sandip Basak
Sandip.Basak_Resume_9804277207_PTSandip.Basak_Resume_9804277207_PT
Sandip.Basak_Resume_9804277207_PT
Sandip Basak89 views
hari_duche_updated by Hari Duche
hari_duche_updatedhari_duche_updated
hari_duche_updated
Hari Duche140 views
Adopting software design practices for better machine learning by MLconf
Adopting software design practices for better machine learningAdopting software design practices for better machine learning
Adopting software design practices for better machine learning
MLconf511 views
Agile Methods and Data Warehousing (2016 update) by Kent Graziano
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
Kent Graziano2.1K views
Agile methods and dw mha by AgileDenver
Agile methods and dw mhaAgile methods and dw mha
Agile methods and dw mha
AgileDenver346 views
Graph Data Science at Scale by Neo4j
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Neo4j448 views
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling by Kent Graziano
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Kent Graziano12.6K views
Very large scale distributed deep learning on BigDL by DESMOND YUEN
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
DESMOND YUEN292 views
Best practices for building and deploying predictive models over big data pre... by Kun Le
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le5.4K views
Get a clearer picture of potential cloud performance by looking beyond SPECra... by Principled Technologies
Get a clearer picture of potential cloud performance by looking beyond SPECra...Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...
Agile Methods and Data Warehousing by Kent Graziano
Agile Methods and Data WarehousingAgile Methods and Data Warehousing
Agile Methods and Data Warehousing
Kent Graziano2.2K views
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness by Anant Corporation
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Preventing Database Perfomance Issues | DB Optimizer by Michael Findling
Preventing Database Perfomance Issues | DB OptimizerPreventing Database Perfomance Issues | DB Optimizer
Preventing Database Perfomance Issues | DB Optimizer
Michael Findling410 views
Hadoop vs Java Batch Processing JSR 352 by Armel Nene
Hadoop vs Java Batch Processing JSR 352Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352
Armel Nene15.4K views
Make your BW fit for the future by DataVard
Make your BW fit for the futureMake your BW fit for the future
Make your BW fit for the future
DataVard3.4K views
Accelerate Your AI Today by DESMOND YUEN
Accelerate Your AI TodayAccelerate Your AI Today
Accelerate Your AI Today
DESMOND YUEN47 views
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci by Intel® Software
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Intel® Software2.2K views
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server by Ahsan Javed Awan
How Data Volume Affects Spark Based Data Analytics on a Scale-up ServerHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Ahsan Javed Awan230 views

More from univalence

Scala pour le Data Eng by
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Engunivalence
115 views20 slides
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) by
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) univalence
1.3K views55 slides
7 key recipes for data engineering by
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
3.2K views51 slides
Streaming in Scala with Avro by
Streaming in Scala with AvroStreaming in Scala with Avro
Streaming in Scala with Avrounivalence
812 views29 slides
Beyond tabular data by
Beyond tabular dataBeyond tabular data
Beyond tabular dataunivalence
696 views19 slides
Introduction à kafka by
Introduction à kafkaIntroduction à kafka
Introduction à kafkaunivalence
764 views36 slides

More from univalence (9)

Scala pour le Data Eng by univalence
Scala pour le Data EngScala pour le Data Eng
Scala pour le Data Eng
univalence 115 views
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) by univalence
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017)
univalence 1.3K views
7 key recipes for data engineering by univalence
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
univalence 3.2K views
Streaming in Scala with Avro by univalence
Streaming in Scala with AvroStreaming in Scala with Avro
Streaming in Scala with Avro
univalence 812 views
Beyond tabular data by univalence
Beyond tabular dataBeyond tabular data
Beyond tabular data
univalence 696 views
Introduction à kafka by univalence
Introduction à kafkaIntroduction à kafka
Introduction à kafka
univalence 764 views
Data encoding and Metadata for Streams by univalence
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streams
univalence 3.6K views
Introduction aux Macros by univalence
Introduction aux MacrosIntroduction aux Macros
Introduction aux Macros
univalence 893 views
Big data forever by univalence
Big data foreverBig data forever
Big data forever
univalence 823 views

Recently uploaded

AvizoImageSegmentation.pptx by
AvizoImageSegmentation.pptxAvizoImageSegmentation.pptx
AvizoImageSegmentation.pptxnathanielbutterworth1
6 views14 slides
SUPER STORE SQL PROJECT.pptx by
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptxkhan888620
13 views16 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
15 views29 slides
LIVE OAK MEMORIAL PARK.pptx by
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptxms2332always
7 views6 slides
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
8 views36 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
17 views4 slides

Recently uploaded(20)

SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9017 views
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... by DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821715 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...

7 key recipes for data engineering

  • 1. 7 Key Recipes for Data Engineering Scala Matsuri 2017 データ・エンジニアリング 7大レシピ
  • 2. 7 Key Recipes For Data Eng Introduction We will explore 7 key recipes on Data Engineering. If you could only pick one, the 5th on joins/cogroups is essential. 2 文字数制限あり。折りたたみやエンコーディングは無し。 データ・エンジニアリングの 7大レシピ
  • 3. 7 Key Recipes For Data Eng About Me Jonathan WINANDY Scala user (6 years) Lead Data Engineer: - Data Lake building, - Audit/Coaching, - Spark/Scala/Kafka Trainings. Founder of Univalence (BI / Big Data) Co-Founder of CYM (Predictive Maintenance), and Valwin (Health Care Data). 3 データエンジニアとしてデータ基盤構築やトレーニング等を実施 Univalence、CYM、Valwin などのデータ分析ビジネスを創業
  • 4. 7 Key Recipes For Data Eng Bachir AIT MBAREK 4 Thank you
  • 5. 7 Key Recipes For Data Eng Outline 1. Organisations 2. Work Optimization 3. Staging 4. RDD/Dataframe 5. Join/Cogroup 6. Data quality 7. Real Programs 5
  • 6. 1. It’s always about our organizations! (in Europe) 6 一に組織 (ヨーロッパはこればっかり)
  • 7. 7 Key Recipes For Data Eng 7 1. Organisations In Data Engineering, we tend to think our problems come from or are solved by those tools : データエンジニアリングではツールが問題の原因であるとか あるいはツールによって問題を解くのだと思われがち
  • 8. 7 Key Recipes For Data Eng 1. Organisations However our most difficult problems or durable solutions come from organisational contexts. It’s true for IT at large, but it’s much more dominant in Data Engineering. 8 IT において、最も困難な課題や持続的な解決策は組織の文脈からやってくる この点、データエンジニアリングではさらに支配的
  • 9. 7 Key Recipes For Data Eng 1. Organisations 9 Because Data Engineering enables access to Data! 理由はデータ・エンジニアリングはデータへのアクセスを活性化させるから
  • 10. 7 Key Recipes For Data Eng 10 It enables access to Data in very complex organisations. 1. Organisations Product BI Your TeamMarketing data new data 複雑な組織においてデータアクセスを活性化させると…
  • 11. 7 Key Recipes For Data Eng data 11 Your Team Global Marketing 1. Organisations It enables access to Data in very complex organisations. Global IT Marketing IT BI Holding Subsidies Marketing IT BI Marketing IT BI 「超」複雑な組織においてデータアクセスを活性化させると…
  • 12. 7 Key Recipes For Data Eng It happens to be very frustrating! 12 1. Organisations By being a Data Eng, you take part in some of the most technically diverse teams that are: ● Running cutting edge technologies, ● Solving some of the hardest problems, while being constantly dependent on other teams that often don’t share your vision. 先端技術を駆使して難題に取り組みつつ、ビジョンを共有しない他のチームに依存して仕 事を進めざるをえない。とてもフラストレーションが溜まる状況だ
  • 13. 7 Key Recipes For Data Eng 1. Organisations Small tips: ● One hadoop cluster (no Test or QA clusters). ● Document your vision, so it can be shared. ● What happens between teams matters a lot. 13 コツ: Hadoopクラスタは1つに、ビジョンは文書化して事前に根回し チーム間の関係は大切
  • 14. 2. Optimizing our work 14 業務の最適化
  • 15. 7 Key Recipes For Data Eng 2. Work Optimization To optimize our work, there are 3 key concerns governing our decisions : ● Lead time, ● Impact, ● Failure management. 15 業務最適化における意思決定で大切なこと: リードタイム、インパクト、失敗の管理
  • 16. 7 Key Recipes For Data Eng 2. Work Optimization Lead time: The period of time between the initial phase and the completion. Impact: Positive effects beyond the current context. Failure management: Failure is the nominal case. Unprepared failures will pile up. 16 リードタイム→企画から完成までの期間、インパクト→今の文脈を超えた良い効果失敗の 管理→想定外の失敗は積み上がる
  • 17. 7 Key Recipes For Data Eng 2. Work Optimization Being Proactive! To avoid the “MapReduce then Wait”, two methods : ● Proactive Task Simulation, ● “What will fail?” 17 先を見越して動こう! 「MapReduce を動かして待機」を回避するには?
  • 18. 7 Key Recipes For Data Eng 2. Work Optimization Proactive Task Simulation. The idea is to solve a task: ● map all the possible ways, ● on each way estimate: ○ Lead time and cost, ○ Decidability, ○ Success rate, ○ Generated opportunities, ○ and other By-Products. ● then choose which way to start with. 18 解決したいタスクについて、ありうる可能性を全て挙げてリードタイムやコストなどを見積 もった上で、どの方法から始めるかを選ぶ
  • 19. 7 Key Recipes For Data Eng 2. Work Optimization What will fail ? The idea is to guess what may fail on a given component. Then you can engage in a discussion on: ● Knowing how likely it will fail, ● Preventing that failure, ● Planning the recovery ahead. 19 あるコンポーネントで何が失敗しそうか考え、 その頻度や予防策、復旧プランを議論する
  • 20. 3. Staging Data Back to technical recipes! 20 技術的なレシピに戻ろう
  • 21. 7 Key Recipes For Data Eng 3. Staging Data is moving around, freeze it! Staging changed with Big Data. We moved from transient staging (FTP, NFS, etc.) to persistent staging thank to distributed solutions: ● in Kafka, we can retain logs for months, ● in HDFS, we can retain sources for years. 21 まずは、動いているデータを凍結する Kafka や HDFS のおかげでビッグデータを長期間ステージングできるように
  • 22. 7 Key Recipes For Data Eng 3. Staging But there are a lot of staging anti-patterns out there: ● Updating directories, ● Incomplete datasets, ● Short retention. Staging should be seen as a persistent data structure. If you liked immutability in Scala, go for it with your Data! 22 ステージングは永続データ構造として見えるようにすべき データは Scala のイミュータブルと同じように扱おう
  • 23. 7 Key Recipes For Data Eng 3. Staging Example, with HDFS: Writing in unique directories: /staging |-- $tablename |-- dtint=$dtint |-- dsparam.name=$dsparam.value |-- ... |-- ... |-- uuid=$uuid 23 UUID を使ったディレクトリに書き込む
  • 24. 4. Using RDDs or Dataframes 24 RDD と Dataframe について
  • 25. 7 Key Recipes For Data Eng 4. RDD/Dataframe Dataframes have great performance, but are “untyped” and foreign. RDDs have a robust Scala API, but are a difficult to map from data sources. SQL is the current lingua franca of Data. 25 データ操作にはなんだかんだ言っても SQL
  • 26. 7 Key Recipes For Data Eng 4. RDD/Dataframe Dataframe RDD Predicate push down Types!! Bare metal / unboxed Nested structures Connectors Better unit tests Pluggable Optimizer Less stages SQL + Meta Scala * Scala 26 Comparative Advantages
  • 27. 7 Key Recipes For Data Eng RDD based jobs are like marine mammals, fit for their environnement starting from a certain size. RDDs are building blocks for large jobs. 27 RDD は海獣みたいなもので、その環境に特化している RDD は大きい仕事のビルディング・ブロック 4. RDD/Dataframe
  • 28. 7 Key Recipes For Data Eng 4. RDD/Dataframe RDDs are very good for ETL workloads: ● Control over shuffles, ● Unit tests are easier to write. They can leverage Dataframe API for job boundaries: ● Loading, storing data with Dataframe APIs, ● Map Dataframe in case classes, ● Perform type safe transformations. 28 RDD は ETL に向いている データ順の制御や単体テストの書き易さ
  • 29. 7 Key Recipes For Data Eng 4. RDD/Dataframe Dataframes are perfect for: ● Data Exploration (notebook), ● Light Jobs (SQL + DF) , ● Dynamic jobs (xlsx specs => spark job). User Defined Functions improve code reuse, User Defined Aggregate Functions improve performance over Standard SQL. 29 Dataframe は Notebook を使ったデータ探索や SQL と組み合わせた軽量なジョブ、 動的なジョブに向いている
  • 30. 5. Cogroup all the things 30 Cogroup を使ってみる
  • 31. 7 Key Recipes For Data Eng 5. Cogroup The cogroup is the best operation to link data together. 31 データの連結に使える
  • 32. 7 Key Recipes For Data Eng Cogroup API from (left:RDD[(K,A)],right:RDD[(K,B)]) ○ join : RDD[(K,( A , B ))] ○ outerJoin : RDD[(K,(Option[A],Option[B]))] ○ cogroup : RDD[(K,( Seq[A], Seq[B]))] from (rdd:RDD[(K,A)]) ○ groupBy : RDD[(K,Seq[A])] On cogroup and groupBy, for a given key:K, there is only one unique row with that key in the output dataset. 5. Cogroup 32 cogroup と groupBy は任意のキーに対して単一の行を返す
  • 33. 7 Key Recipes For Data Eng 5. Cogroup 33 rddL .filter(pL) .map(mL) .keyBy(kL) .cogroup( rddR .filter(pR) .map(mR) .keyBy(kL)) .map(mC)
  • 34. 7 Key Recipes For Data Eng 5. Cogroup CHECKPOINT on DISK (save) 34 rddL.keyBy(mL.andThen(kL)) .cogroup( rddR.keyBy(mR.andThen(kR))) .map({case (k,(ls,rs)) => (k,(ls.filter(pL).map(mL), rs.filter(pR).map(mR)))}) .map(mC) REWRITE
  • 35. 7 Key Recipes For Data Eng 5. Cogroup Lines of Code : 3000 Duration : 30min (non-blocking) Lines of Code : 15 Duration : 11h (blocking) 35 CHECKPOINT on DISK Moving the code after the checkpoint allows fast feedback loops. ディスク書き出しの後にコードを置くことで素早くフィードバックループを回せる
  • 36. 7 Key Recipes For Data Eng 5. Cogroup Cogroups allow writing tests on a minimised case. Test workflow: ● Isolate potential cases, ● Get the smallest cogrouped row ○ output the row in test resources, ● Reproduce the bug, ● Write tests and fix code. 36 cogroup を使うと問題を最小化してテストを書けるのでバグを再現しやすい
  • 37. 6. Inline data quality 37 データ品質のインライン化
  • 38. 7 Key Recipes For Data Eng 6. Inline data quality Data quality improves resilience to bad data. However, data quality concerns often come second. 38 データ品質を高めることでバッドデータへのレジリエンスが向上するが データ品質は二の次にされがち
  • 39. 7 Key Recipes For Data Eng 6. Inline data quality Our solution: Integrate Data Quality deep inside jobs, by unifying Data quality with Data transformation. We defined a structure Result similar to ValidationNeL (Applicatives). 39 データ品質はジョブの奥まで統合させる ValidationNeL的な Result というものを定義した
  • 40. 7 Key Recipes For Data Eng case class Result[T](value:Option[T], annotations:Seq[Annotation]) case class Annotation(path:String, typeName:String, msg:String, discardedData:Seq[String], entityIdType:Option[String], entityId:Option[String], level:Int, stage:String) 6. Inline data quality 40
  • 41. 7 Key Recipes For Data Eng case class Result[T](value:Option[T], annotations:Seq[Annotation]) Result is either: ● Containing a value, with a list of warnings, ● Empty, with a list containing the error and warnings. (Serialization and Big Data don’t like Sum types, so it’s pre-projected onto a product type) 6. Inline data quality 41 値を持つか、Empty の二値 それぞれ警告やエラーを表す注釈も持つ
  • 42. 7 Key Recipes For Data Eng case class Result[T](value:Option[T], annotations:Seq[Annotation]) Then we can use applicatives to combine results. case class Person(name:String,age:Int) def build(name:Result[String], age:Result[Int]):Result[Person] = ... 6. Inline data quality 42 アプリカティブを使って結果を組み合わせる
  • 43. 7 Key Recipes For Data Eng case class Result[T](value:Option[T], annotations:Seq[Annotation]) The annotations are accumulated at the top of the hierarchy, and saved with the data: 6. Inline data quality 43 注釈は蓄積されて、データと一緒に保存される
  • 44. 7 Key Recipes For Data Eng Annotations can be aggregated on dimensions: 6. Inline data quality Message: ● EMPTY_STRING ● MULTIPLE_VALUES ● NOT_IN_ENUM ● PARSE_ERROR ● ______________ Levels: ● WARNING ● ERROR ● CRITICAL 44 注釈は次元ごとに集約できる
  • 45. 7 Key Recipes For Data Eng 6. Inline data quality If you are interested by the approach, you can take a look at this repository: Macros based on Shapeless to build Result[T] from case classes. https://github.com/ahoy-jon/autoBuild (~october 2015) 45 気になった人はレポジトリをみてください
  • 46. 7. Designing real programs 46 業務で使うプログラムの設計
  • 47. 7 Key Recipes For Data Eng 7. Real programs Most pipelines parts are designed as Stateless computations. They either require no external state (great) or infer their state based on filesystem state (meh). 47 ステートレスな計算が基本
  • 48. 7 Key Recipes For Data Eng 7. Real programs Spark allows us to program inside the Driver. We can create actual programs. In Scala, we can use: ● Scopt to parse common args and feature flips, ● TypesafeConfig to load/overload program settings, ● EventSourcing to read/write app events, ● Sbt-Pack Coursier to package and create launchers. 48 Spark なら Scala を使ってちゃんとしたプログラムが書ける
  • 49. 7 Key Recipes For Data Eng Deterministic effects We then make sure that our program are as deterministic as possible, and are idempotent (if possible). Example: Storing past execution so as to not recompute something already computed, unless forced. 49 7. Real programs できるかぎり決定論的なプログラムを目指す
  • 50. 7 Key Recipes For Data Eng Level 0 Event Sourcing Level 1 Name resolving Level 2 Triggered exec (schema capture, deltaQA, …) Level 3 Scheduling (replay, coherence, ...) Level 4 “code as data” (=> continuous delivery) 7. Real programs In progress: project Kerguelen, API for data jobs. Enable the creation of coherent jobs, integrating different abstraction levels: 50 プロジェクト Kerguelen というものを作っている
  • 51. 7 Key Recipes For Data Eng 8. More More recipes: ● Automatic QA, ● Structural Sharing for Datasets, ● Jsonoids mapping generation, ● Advanced UDAF, ● ... But that’s it for today! 51 他にもあるけど、今日はここまで
  • 54. 7 Key Recipes For Data Eng PSUG Note 54 If you happen to visit Paris, don’t hesitate to submit a talk at our Paris Scala User Group. パリに来たら是非 Paris 勉強会でトークしてください