Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

7 key recipes for data engineering

2,721 views

Published on

After the construction of several datalakes and large business intelligence pipelines, we now know that the use of Scala and its principles were essential to the success of those large undertakings.

In this talk, we will go through the 7 key scala-based architectures and methodologies that were used in real-life projects. More specifically, we will see the impact of these recipes on Spark performances, and how it enabled the rapid growth of those projects.

Published in: Software
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

7 key recipes for data engineering

  1. 1. 7 Key Recipes for Data Engineering
  2. 2. Introduction We will explore 7 key recipes about Data Engineering. The 5th is absolutely game changing!
  3. 3. Thank You Bachir AIT MBAREK BI and Big Data Consultant
  4. 4. About Me Jonathan WINANDY Lead Data Engineer: - Data Lake building, - Audit / Coaching, - Spark Training. Founder of Univalence (BI / Big Data) Co-Founder of CYM (IoT / Predictive Maintenance), Craft Analytics† (BI / Big Data), and Valwin (Health Care Data).
  5. 5. 2016 has been amazing for Data Engineering ! but ...
  6. 6. 1.It’s all about our organisations!
  7. 7. 1.It’s all about our Organisations Data engineering is not about scaling computation.
  8. 8. 1.It’s all about our Organisations Data engineering is not a support function for Data Scientists[1]. [1] whatever they are nowadays
  9. 9. 1.It’s all about our Organisations Instead, Data engineering enables access to Data!
  10. 10. 1.It’s all about our Organisations access to Data … in complex organisations. Product OpsBI You Marketing data new data
  11. 11. Holding 1.It’s all about our Organisations access to Data … in complex organisations. Marketing Yo u data new data Entity 1 MarketingIT Entity N MarketingIT
  12. 12. 1.It’s all about our Organisations access to Data … in complex organisations. It’s very frustrating! We run a support group meetup if you are interested : Paris Data Engineers!
  13. 13. 1.It’s all about our Organisations Small tips : Only one hadoop cluster (no TEST/REC/INT/PREPROD). No Air-Data-Eng, it helps no one. Radical transparency with other teams. Hack that sh**.
  14. 14. 2. Optimising our work
  15. 15. 2. Optimising our work There are 3 key concerns governing our decisions : Lead time Impact Failure management
  16. 16. 2. Optimising our work Lead time (noun) : The period of time between the initial phase of a process and the emergence of results, as between the planning and completed manufacture of a product. Short lead times are essential! The Elastic stack helps a lot in this area.
  17. 17. 2. Optimising our work Impact To have impact, we have to analyse beyond immediate needs. That way, we’re able to provide solutions to entire kinds of problems.
  18. 18. 2. Optimising our work Failure management Things fail, be prepared! On the same morning the RER A public transports and our Hadoop job tracker can fail. Unprepared failures may pile up and lead to huge wastes.
  19. 19. 2. Optimising our work “What is likely to fail?” $componentName_____ “How? (root cause)” “Can we know if this will fail?” “Can we prevent this failure?” “What are the impacts?” “How to fix it when it happens?” “Can we facilitate today?” How to mitigate failure in 7 questions.
  20. 20. 2. Optimising our work Track your work!
  21. 21. 3. Staging the Data
  22. 22. 3. Staging the data Data is moving around, freeze it! Staging changed with Big Data. We moved from transient staging (FTP, NFS, etc.) to persistent staging in distributed solutions: ● In Streaming with Kafka, we may retain logs in Kafka for several months. ● In Batch, staging in HDFS may retain source Data for years.
  23. 23. 3. Staging the data Modern staging anti-pattern : Dropping destination places before moving the Data. Having incomplete data visible. Short log retention in streams (=> new failure modes). Modern staging should be seen as a persistent data structure.
  24. 24. 3. Staging the data HDFS staging : /staging |-- $tablename |-- dtint=$dtint |-- dsparam.name=$dsparam.value |-- ... |-- ... |-- uuid=$uuid
  25. 25. 4. Using RDDs or Dataframes
  26. 26. 4. Using RDDs or Dataframes Dataframes have great performance, but are untyped and foreign. RDDs have a robust Scala API, but are a pain to map from data sources. btw, SQL is AWESOME
  27. 27. 4. Using RDDs or Dataframes Dataframes RDDs Predicate push down Types !! Bare metal / unboxed Nested structures Connectors Better unit tests Pluggable Optimizer Less stages SQL + Meta Scala * Scala
  28. 28. 4. Using RDDs or Dataframes We should use RDDs in large ETL jobs : Loading the data with dataframe APIs, Basic case class mapping (or better Datasets), Typesafe transformations, Storing with dataframe APIs
  29. 29. 4. Using RDDs or Dataframes Dataframes are perfect for : Exploration, drill down, Light jobs, Dynamic jobs.
  30. 30. 4. Using RDDs or Dataframes RDD based jobs are like marine mammals.
  31. 31. 5. Cogroup all the things
  32. 32. 5. Cogroup all the things The cogroup is the best operation to link data together. It changes fundamentally the way we work with data.
  33. 33. 5. Cogroup all the things join (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A , B ))] leftJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( A ,Option[B]))] rightJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A], B) )] outerJoin (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,(Option[A],Option[B]))] cogroup (left:RDD[(K,A)],right:RDD[(K,B)]):RDD[(K,( Seq[A], Seq[B]))] groupBy (rdd:RDD[(K,A)]):RDD[(K,Seq[A])] On cogroup and groupBy, for a given key:K, there is only one unique row with that key in the output dataset.
  34. 34. 5. Cogroup all the things
  35. 35. 5. Cogroup all the things {case (k,(s1,s2)) => (k,(s1.map(fA).filter(pA) ,s2.map(fB).filter(pB)))} CHECKPOINT
  36. 36. 5. Cogroup all the things 3k LoC 30 minutes to run (non- blocking) 15 LoC 11 hours to run (blocking)
  37. 37. 5. Cogroup all the things What about tests? Cogrouping allows us to have “ScalaChecks-like” tests, by minimising examples. Test workflow : Write a predicate to isolate the bug. Get the minimal cogrouped row ouput the row in test resources. Reproduce the bug. Write tests and fix code.
  38. 38. 6. Inline data quality
  39. 39. 6. Inline data quality Data quality improves resilience to bad data. But data quality concerns come second.
  40. 40. 6. Inline data quality case class FixeVisiteur( devicetype: String, isrobot: Boolean, recherche_visitorid: String, sessions: List[FixeSession] ) { def recherches: List[FixeRecherche] = sessions.flatMap(_.recherches) } object FixeVisiteur { @autoBuildResult def build( devicetype: Result[String], isrobot: Result[Boolean], recherche_visitorid: Result[String], sessions: Result[List[FixeSession]] ): Result[FixeVisiteur] = MacroMarker.generated_applicative } Example :
  41. 41. 6. Inline data quality case class Annotation( anchor: Anchor, message: String, badData: Option[String], expectedData: List[String], remainingData: List[String], level: String @@ AnnotationLevel, annotationId: Option[AnnotationId], stage: String ) case class Anchor(path: String @@ AnchorPath, typeName: String)
  42. 42. 6. Inline data quality Message : EMPTY_STRING MULTIPLE_VALUES NOT_IN_ENUM PARSE_ERROR ______________ Levels : WARNING ERROR CRITICAL
  43. 43. 6. Inline data quality Data quality is available within the output rows. case class HVisiteur( visitorId: String, atVisitorId: Option[String], isRobot: Boolean, typeClient: String @@ TypeClient, typeSupport: String @@ TypeSupport, typeSource: String @@ TypeSource, hVisiteurPlus: Option[HVisiteurPlus], sessions: List[HSession], annotations: Seq[HAnnotation] )
  44. 44. 6. Inline data quality (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> lib_source, message -> NOT_IN_ENUM, type -> String @@ LibSource, level -> WARNING)),657366) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> isrobot, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),15) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> rechercheInfos, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),566973) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> reponseInfos.reponse_nbblocs, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),571313) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> requeteInfos.requete_typerequete, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),315297) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> analyseInfos.analyse_typequoi_sec, message -> EMPTY_STRING, type -> String @@ TypeRecherche, level -> WARNING)),201930) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> typereponse, message -> EMPTY_STRING, type -> String @@ TypeReponse, level -> WARNING)),323614) (KeyPerformanceIndicator(Annotation,annotation,Map(stage -> MappingFixe, path -> grp_source, message -> MULTIPLE_VALUE, type -> String, level -> WARNING)),94)
  45. 45. 6. Inline data quality https://github.com/ahoy-jon/autoBuild (presented in october 2015) There are opportunities to make those approaches more “precepte-like”. (DAG of workflow, provenance of every fields, structure tags)
  46. 46. 7. Create real programs
  47. 47. 7. Create real programs Most pipelines are designed as “Stateless” computation. They require no state (good) Or Infer the current state based on filesystem’ states (bad).
  48. 48. 7. Create real programs Solution : Allow pipelines to access a commit log to read about past execution and to push data for future execution.
  49. 49. 7. Create real programs In progress: project codename Kerguelen Multi level abstractions / commit log backed / api for jobs. Allow creation of jobs that have different concern level. Level 1 : name resolving Level 2 : smart intermediaries (schema capture, stats, delta, …) Level 3 : smart high level scheduler (replay, load management, coherence) Level 4 : “code as data” (=> continuous delivery, auto QA, auto mep)
  50. 50. Conclusion
  51. 51. Thank you for listening! Questions? jonathan@univalence.io

×