7 Key Recipes for
Data Engineering
Introduction
We will explore 7 key recipes about Data
Engineering.
The 5th is absolutely game changing!
Thank You
Bachir AIT MBAREK
BI and Big Data Consultant
About Me
Jonathan WINANDY
Lead Data Engineer:
- Data Lake building,
- Audit / Coaching,
- Spark Training.
Founder of Univalence (BI / Big Data)
Co-Founder of CYM (IoT / Predictive Maintenance),
Craft Analytics† (BI / Big Data),
and Valwin (Health Care Data).
2016 has been amazing for Data Engineering !
but ...
1.It’s all about our
organisations!
1.It’s all about our Organisations
Data engineering is not about scaling
computation.
1.It’s all about our Organisations
Data engineering is not a support
function
for Data Scientists[1].
[1] whatever they are nowadays
1.It’s all about our Organisations
Instead, Data engineering
enables access to Data!
1.It’s all about our Organisations
access to Data … in complex organisations.
Product OpsBI You
Marketing
data
new data
Holding
1.It’s all about our Organisations
access to Data … in complex organisations.
Marketing
Yo
u
data
new data
Entity 1
MarketingIT
Entity N
MarketingIT
1.It’s all about our Organisations
access to Data … in complex organisations.
It’s very frustrating!
We run a support group meetup if you are interested : Paris Data Engineers!
1.It’s all about our Organisations
Small tips :
Only one hadoop cluster (no TEST/REC/INT/PREPROD).
No Air-Data-Eng, it helps no one.
Radical transparency with other teams.
Hack that sh**.
2. Optimising our work
2. Optimising our work
There are 3 key concerns governing our decisions :
Lead time
Impact
Failure management
2. Optimising our work
Lead time (noun) :
The period of time between the initial phase of a process and the emergence
of results, as between the planning and completed manufacture of a product.
Short lead times are essential!
The Elastic stack helps a lot in this area.
2. Optimising our work
Impact
To have impact, we have to analyse beyond
immediate needs. That way, we’re able to
provide solutions to entire kinds of
problems.
2. Optimising our work
Failure management
Things fail, be prepared!
On the same morning the RER A public transports
and
our Hadoop job tracker can fail.
Unprepared failures may pile up and lead to huge wastes.
2. Optimising our work
“What is likely to fail?” $componentName_____
“How? (root cause)”
“Can we know if this will fail?”
“Can we prevent this failure?”
“What are the impacts?”
“How to fix it when it happens?”
“Can we facilitate today?”
How to mitigate failure in 7 questions.
2. Optimising our work
Track your work!
3. Staging the Data
3. Staging the data
Data is moving around, freeze it!
Staging changed with Big Data. We moved from transient
staging (FTP, NFS, etc.) to persistent staging in distributed
solutions:
● In Streaming with Kafka, we may retain logs in Kafka
for several months.
● In Batch, staging in HDFS may retain source Data for
years.
3. Staging the data
Modern staging anti-pattern :
Dropping destination places before moving the Data.
Having incomplete data visible.
Short log retention in streams (=> new failure modes).
Modern staging should be seen as a persistent data structure.
3. Staging the data
HDFS staging :
/staging
|-- $tablename
|-- dtint=$dtint
|-- dsparam.name=$dsparam.value
|-- ...
|-- ...
|-- uuid=$uuid
4. Using RDDs or Dataframes
4. Using RDDs or Dataframes
Dataframes have great performance,
but are untyped and foreign.
RDDs have a robust Scala API,
but are a pain to map from data sources.
btw, SQL is AWESOME
4. Using RDDs or Dataframes
Dataframes RDDs
Predicate push down Types !!
Bare metal / unboxed Nested structures
Connectors Better unit tests
Pluggable Optimizer Less stages
SQL + Meta Scala * Scala
4. Using RDDs or Dataframes
We should use RDDs in large ETL jobs :
Loading the data with dataframe APIs,
Basic case class mapping (or better Datasets),
Typesafe transformations,
Storing with dataframe APIs
4. Using RDDs or Dataframes
Dataframes are perfect for :
Exploration, drill down,
Light jobs,
Dynamic jobs.
4. Using RDDs or Dataframes
RDD based jobs are like marine
mammals.
5. Cogroup all the things
5. Cogroup all the things
The cogroup is the best operation
to link data together.
It changes fundamentally the way we work with data.
5. Cogroup all the things
join (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A , B ))]
leftJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A ,Option[B]))]
rightJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A], B) )]
outerJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A],Option[B]))]
cogroup (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( Seq[A], Seq[B]))]
groupBy (rdd:RDD[(K,A)]):RDD[(K,Seq[A])]
On cogroup and groupBy, for a given key:K, there is only one
unique row with that key in the output dataset.
5. Cogroup all the things
5. Cogroup all the things
{case (k,(s1,s2)) => (k,(s1.map(fA).filter(pA)
,s2.map(fB).filter(pB)))}
CHECKPOINT
5. Cogroup all the things
3k LoC
30 minutes to run (non-
blocking)
15 LoC
11 hours to run
(blocking)
5. Cogroup all the things
What about tests? Cogrouping allows us to have “ScalaChecks-like” tests, by
minimising examples.
Test workflow :
Write a predicate to isolate the bug.
Get the minimal cogrouped row
ouput the row in test resources.
Reproduce the bug.
Write tests and fix code.
6. Inline data quality
6. Inline data quality
Data quality improves resilience to bad data.
But data quality concerns come second.
6. Inline data quality
case class FixeVisiteur(
devicetype: String,
isrobot: Boolean,
recherche_visitorid: String,
sessions: List[FixeSession]
) {
def recherches: List[FixeRecherche] = sessions.flatMap(_.recherches)
}
object FixeVisiteur {
@autoBuildResult
def build(
devicetype: Result[String],
isrobot: Result[Boolean],
recherche_visitorid: Result[String],
sessions: Result[List[FixeSession]]
): Result[FixeVisiteur] = MacroMarker.generated_applicative
}
Example :
6. Inline data quality
case class Annotation(
anchor: Anchor,
message: String,
badData: Option[String],
expectedData: List[String],
remainingData: List[String],
level: String @@ AnnotationLevel,
annotationId: Option[AnnotationId],
stage: String
)
case class Anchor(path: String @@ AnchorPath,
typeName: String)
6. Inline data quality
Message :
EMPTY_STRING
MULTIPLE_VALUES
NOT_IN_ENUM
PARSE_ERROR
______________
Levels :
WARNING
ERROR
CRITICAL
6. Inline data quality
Data quality is available within the output rows.
case class HVisiteur(
visitorId: String,
atVisitorId: Option[String],
isRobot: Boolean,
typeClient: String @@ TypeClient,
typeSupport: String @@ TypeSupport,
typeSource: String @@ TypeSource,
hVisiteurPlus: Option[HVisiteurPlus],
sessions: List[HSession],
annotations: Seq[HAnnotation]
)
6. Inline data quality
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> lib_source, message ->
NOT_IN_ENUM, type -> String @@ LibSource, level -> WARNING)),657366)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
analyseInfos.analyse_typequoi, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level ->
WARNING)),201930)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> isrobot, message ->
MULTIPLE_VALUE, type -> String, level -> WARNING)),15)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> rechercheInfos, message ->
MULTIPLE_VALUE, type -> String, level -> WARNING)),566973)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
reponseInfos.reponse_nbblocs, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),571313)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
requeteInfos.requete_typerequete, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),315297)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path ->
analyseInfos.analyse_typequoi_sec, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level ->
WARNING)),201930)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> typereponse, message ->
EMPTY_STRING, type -> String @@ TypeReponse, level -> WARNING)),323614)
(KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> grp_source, message ->
MULTIPLE_VALUE, type -> String, level -> WARNING)),94)
6. Inline data quality
https://github.com/ahoy-jon/autoBuild (presented in october 2015)
There are opportunities to make those approaches more “precepte-like”.
(DAG of workflow, provenance of every fields, structure tags)
7. Create real programs
7. Create real programs
Most pipelines are designed as “Stateless” computation.
They require no state (good)
Or
Infer the current state based on filesystem’ states (bad).
7. Create real programs
Solution : Allow pipelines to access a commit log to read about past execution
and to push data for future execution.
7. Create real programs
In progress: project codename Kerguelen
Multi level abstractions / commit log backed / api for jobs.
Allow creation of jobs that have different concern level.
Level 1 : name resolving
Level 2 : smart intermediaries (schema capture, stats, delta, …)
Level 3 : smart high level scheduler (replay, load management, coherence)
Level 4 : “code as data” (=> continuous delivery, auto QA, auto mep)
Conclusion
Thank you
for listening!
Questions?
jonathan@univalence.io

7 key recipes for data engineering

  • 1.
    7 Key Recipesfor Data Engineering
  • 2.
    Introduction We will explore7 key recipes about Data Engineering. The 5th is absolutely game changing!
  • 3.
    Thank You Bachir AITMBAREK BI and Big Data Consultant
  • 4.
    About Me Jonathan WINANDY LeadData Engineer: - Data Lake building, - Audit / Coaching, - Spark Training. Founder of Univalence (BI / Big Data) Co-Founder of CYM (IoT / Predictive Maintenance), Craft Analytics† (BI / Big Data), and Valwin (Health Care Data).
  • 5.
    2016 has beenamazing for Data Engineering ! but ...
  • 6.
    1.It’s all aboutour organisations!
  • 7.
    1.It’s all aboutour Organisations Data engineering is not about scaling computation.
  • 8.
    1.It’s all aboutour Organisations Data engineering is not a support function for Data Scientists[1]. [1] whatever they are nowadays
  • 9.
    1.It’s all aboutour Organisations Instead, Data engineering enables access to Data!
  • 10.
    1.It’s all aboutour Organisations access to Data … in complex organisations. Product OpsBI You Marketing data new data
  • 11.
    Holding 1.It’s all aboutour Organisations access to Data … in complex organisations. Marketing Yo u data new data Entity 1 MarketingIT Entity N MarketingIT
  • 12.
    1.It’s all aboutour Organisations access to Data … in complex organisations. It’s very frustrating! We run a support group meetup if you are interested : Paris Data Engineers!
  • 13.
    1.It’s all aboutour Organisations Small tips : Only one hadoop cluster (no TEST/REC/INT/PREPROD). No Air-Data-Eng, it helps no one. Radical transparency with other teams. Hack that sh**.
  • 14.
  • 15.
    2. Optimising ourwork There are 3 key concerns governing our decisions : Lead time Impact Failure management
  • 16.
    2. Optimising ourwork Lead time (noun) : The period of time between the initial phase of a process and the emergence of results, as between the planning and completed manufacture of a product. Short lead times are essential! The Elastic stack helps a lot in this area.
  • 17.
    2. Optimising ourwork Impact To have impact, we have to analyse beyond immediate needs. That way, we’re able to provide solutions to entire kinds of problems.
  • 18.
    2. Optimising ourwork Failure management Things fail, be prepared! On the same morning the RER A public transports and our Hadoop job tracker can fail. Unprepared failures may pile up and lead to huge wastes.
  • 19.
    2. Optimising ourwork “What is likely to fail?” $componentName_____ “How? (root cause)” “Can we know if this will fail?” “Can we prevent this failure?” “What are the impacts?” “How to fix it when it happens?” “Can we facilitate today?” How to mitigate failure in 7 questions.
  • 20.
    2. Optimising ourwork Track your work!
  • 21.
  • 22.
    3. Staging thedata Data is moving around, freeze it! Staging changed with Big Data. We moved from transient staging (FTP, NFS, etc.) to persistent staging in distributed solutions: ● In Streaming with Kafka, we may retain logs in Kafka for several months. ● In Batch, staging in HDFS may retain source Data for years.
  • 23.
    3. Staging thedata Modern staging anti-pattern : Dropping destination places before moving the Data. Having incomplete data visible. Short log retention in streams (=> new failure modes). Modern staging should be seen as a persistent data structure.
  • 24.
    3. Staging thedata HDFS staging : /staging |-- $tablename |-- dtint=$dtint |-- dsparam.name=$dsparam.value |-- ... |-- ... |-- uuid=$uuid
  • 25.
    4. Using RDDsor Dataframes
  • 26.
    4. Using RDDsor Dataframes Dataframes have great performance, but are untyped and foreign. RDDs have a robust Scala API, but are a pain to map from data sources. btw, SQL is AWESOME
  • 27.
    4. Using RDDsor Dataframes Dataframes RDDs Predicate push down Types !! Bare metal / unboxed Nested structures Connectors Better unit tests Pluggable Optimizer Less stages SQL + Meta Scala * Scala
  • 28.
    4. Using RDDsor Dataframes We should use RDDs in large ETL jobs : Loading the data with dataframe APIs, Basic case class mapping (or better Datasets), Typesafe transformations, Storing with dataframe APIs
  • 29.
    4. Using RDDsor Dataframes Dataframes are perfect for : Exploration, drill down, Light jobs, Dynamic jobs.
  • 30.
    4. Using RDDsor Dataframes RDD based jobs are like marine mammals.
  • 31.
    5. Cogroup allthe things
  • 32.
    5. Cogroup allthe things The cogroup is the best operation to link data together. It changes fundamentally the way we work with data.
  • 33.
    5. Cogroup allthe things join (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A , B ))] leftJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A ,Option[B]))] rightJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A], B) )] outerJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A],Option[B]))] cogroup (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( Seq[A], Seq[B]))] groupBy (rdd:RDD[(K,A)]):RDD[(K,Seq[A])] On cogroup and groupBy, for a given key:K, there is only one unique row with that key in the output dataset.
  • 34.
    5. Cogroup allthe things
  • 35.
    5. Cogroup allthe things {case (k,(s1,s2)) => (k,(s1.map(fA).filter(pA) ,s2.map(fB).filter(pB)))} CHECKPOINT
  • 36.
    5. Cogroup allthe things 3k LoC 30 minutes to run (non- blocking) 15 LoC 11 hours to run (blocking)
  • 37.
    5. Cogroup allthe things What about tests? Cogrouping allows us to have “ScalaChecks-like” tests, by minimising examples. Test workflow : Write a predicate to isolate the bug. Get the minimal cogrouped row ouput the row in test resources. Reproduce the bug. Write tests and fix code.
  • 38.
  • 39.
    6. Inline dataquality Data quality improves resilience to bad data. But data quality concerns come second.
  • 40.
    6. Inline dataquality case class FixeVisiteur( devicetype: String, isrobot: Boolean, recherche_visitorid: String, sessions: List[FixeSession] ) { def recherches: List[FixeRecherche] = sessions.flatMap(_.recherches) } object FixeVisiteur { @autoBuildResult def build( devicetype: Result[String], isrobot: Result[Boolean], recherche_visitorid: Result[String], sessions: Result[List[FixeSession]] ): Result[FixeVisiteur] = MacroMarker.generated_applicative } Example :
  • 41.
    6. Inline dataquality case class Annotation( anchor: Anchor, message: String, badData: Option[String], expectedData: List[String], remainingData: List[String], level: String @@ AnnotationLevel, annotationId: Option[AnnotationId], stage: String ) case class Anchor(path: String @@ AnchorPath, typeName: String)
  • 42.
    6. Inline dataquality Message : EMPTY_STRING MULTIPLE_VALUES NOT_IN_ENUM PARSE_ERROR ______________ Levels : WARNING ERROR CRITICAL
  • 43.
    6. Inline dataquality Data quality is available within the output rows. case class HVisiteur( visitorId: String, atVisitorId: Option[String], isRobot: Boolean, typeClient: String @@ TypeClient, typeSupport: String @@ TypeSupport, typeSource: String @@ TypeSource, hVisiteurPlus: Option[HVisiteurPlus], sessions: List[HSession], annotations: Seq[HAnnotation] )
  • 44.
    6. Inline dataquality (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> lib_source, message -> NOT_IN_ENUM, type -> String @@ LibSource, level -> WARNING)),657366) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> isrobot, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),15) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> rechercheInfos, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),566973) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> reponseInfos.reponse_nbblocs, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),571313) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> requeteInfos.requete_typerequete, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),315297) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi_sec, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> typereponse, message -> EMPTY_STRING, type -> String @@ TypeReponse, level -> WARNING)),323614) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> grp_source, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),94)
  • 45.
    6. Inline dataquality https://github.com/ahoy-jon/autoBuild (presented in october 2015) There are opportunities to make those approaches more “precepte-like”. (DAG of workflow, provenance of every fields, structure tags)
  • 46.
  • 47.
    7. Create realprograms Most pipelines are designed as “Stateless” computation. They require no state (good) Or Infer the current state based on filesystem’ states (bad).
  • 48.
    7. Create realprograms Solution : Allow pipelines to access a commit log to read about past execution and to push data for future execution.
  • 49.
    7. Create realprograms In progress: project codename Kerguelen Multi level abstractions / commit log backed / api for jobs. Allow creation of jobs that have different concern level. Level 1 : name resolving Level 2 : smart intermediaries (schema capture, stats, delta, …) Level 3 : smart high level scheduler (replay, load management, coherence) Level 4 : “code as data” (=> continuous delivery, auto QA, auto mep)
  • 50.
  • 51.

Editor's Notes