This document provides an overview of 7 key recipes for data engineering:
1. Focus on organizational contexts as the most difficult problems and durable solutions come from organizational contexts.
2. Optimize work by considering lead time, impact, and failure management when making decisions.
3. Stage data persistently using solutions like Kafka and HDFS to retain data for long periods of time.
4. Use RDDs for ETL workloads and dataframes for exploration, lightweight jobs, and dynamic jobs.
5. Leverage cogroups to efficiently link different data sources together.
6. Integrate data quality checks directly into jobs to improve resilience to bad data.
7. Design real programs using stateless computations
50 must read hadoop interview questions & answers - whizlabsWhizlabs
At present, the Big Data Hadoop jobs are on the rise. So, here we present top 50 Hadoop Interview Questions and Answers to help you crack job interview..!!
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
Michael Wrinn
Research Program Director, University Research Office,
Intel Corporation
Jason Dai
Engineering Director and Principal Engineer,
Intel Corporation
50 must read hadoop interview questions & answers - whizlabsWhizlabs
At present, the Big Data Hadoop jobs are on the rise. So, here we present top 50 Hadoop Interview Questions and Answers to help you crack job interview..!!
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
Michael Wrinn
Research Program Director, University Research Office,
Intel Corporation
Jason Dai
Engineering Director and Principal Engineer,
Intel Corporation
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...Frank van Harmelen
We show how it is possible to apply the problem solving patterns from knowledge engineering to systems developed on the semantic web. This give us re-usable problem solving patterns for the semantic web, and would greatly help us to build and understand such systems.
A bit of art-direction. What should know and be able to art director? How is it different from a designer? Why not art-direction your product is doomed?
Learn data-driven marketing theory and practice.
"Marketing has become a technology-powered discipline, and therefore, marketing organizations must infuse technical capabilities into their DNA." ~Scott Brinker
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Cambridge Semantics
The financial industry is facing a perfect storm of disruptive drivers for data management. While regulators seek accuracy and transparency, institutions are struggling with fragmented data and IT infrastructures. The path forward is “data engineering” – applying consistent semantics with scalable infrastructure to harmonize data and enable traceable and dynamic analytics. In this webinar, we hear from industry practitioners and thought leaders on how this vision is being deployed and also see it in action.
Knowledge Engineering rediscovered, Towards Reasoning Patterns for the Semant...Frank van Harmelen
We show how it is possible to apply the problem solving patterns from knowledge engineering to systems developed on the semantic web. This give us re-usable problem solving patterns for the semantic web, and would greatly help us to build and understand such systems.
A bit of art-direction. What should know and be able to art director? How is it different from a designer? Why not art-direction your product is doomed?
Learn data-driven marketing theory and practice.
"Marketing has become a technology-powered discipline, and therefore, marketing organizations must infuse technical capabilities into their DNA." ~Scott Brinker
Applying Data Engineering and Semantic Standards to Tame the "Perfect Storm" ...Cambridge Semantics
The financial industry is facing a perfect storm of disruptive drivers for data management. While regulators seek accuracy and transparency, institutions are struggling with fragmented data and IT infrastructures. The path forward is “data engineering” – applying consistent semantics with scalable infrastructure to harmonize data and enable traceable and dynamic analytics. In this webinar, we hear from industry practitioners and thought leaders on how this vision is being deployed and also see it in action.
Agile Methods and Data Warehousing (2016 update)Kent Graziano
This presentation takes a look at the Agile Manifesto and the 12 Principles of Agile Development and discusses how these apply to Data Warehousing and Business Intelligence projects. Several examples and details from my past experience are included. Includes more details on using Data Vault as well. (I gave this presentation at OUGF14 in Helsinki, Finland and again in 2016 for TDWI Nashville.)
Kent Graziano is a Data Vault Master, Oracle ACE Director, former member of the Boulder BI Brain Trust (BBBT), expert data architect with over 30 years of experience. He is an internationally recognized expert in Data Modeling and Agile Data Warehousing. Kent has led many data warehouse teams, including multiple agile DW/BI teams. He has written numerous articles, authored three Kindle books, co-authored four books, and has given hundreds of presentations, nationally and internationally.
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
This is a presentation I gave at OUGF14 in Helsinki, Finland.
Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures incrementally, without constant refactoring, when using the Data Vault modeling technique. This technique works well for:
• Building the Enterprise Data Warehouse repository in a CIF architecture
• Building a Persistent Staging Area (PSA) in a Kimball Bus Architecture
• Building your data model incrementally, one sprint at a time using a repeatable technique
• Providing a model that is easily extensible without need to re-engineer existing structure or load processes
Get a clearer picture of potential cloud performance by looking beyond SPECra...Principled Technologies
When we ran various workloads on two Azure instances, the performance differences between the instances varied considerably and differed from SPECrate 2017 Integer scores
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...Knoldus Inc.
In this session, join us on a journey to elevate your data quality standards as we explore the seamless integration of Great Expectations with Databricks. Discover how this powerful combination empowers data professionals to set, test, and enforce expectations on their data pipelines, ensuring accuracy and reliability.
I gave this presentation at OUGF14 in Helsinki, Finland and again for TDWI Nashville. This presentation takes a look at the Agile Manifesto and the 12 Principles of Agile Development and discusses how these apply to Data Warehousing and Business Intelligence projects. Several examples and details from my past experience are included.
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
In Data Engineer's Lunch #60, Rahul Singh, CEO here at Anant, will discuss modern data processing/pipeline approaches.
Want to learn about modern data engineering patterns & practices for global data platforms? A high-level overview of different types, frameworks, and workflows in data processing and pipeline design.
Preventing Database Perfomance Issues | DB OptimizerMichael Findling
DB Optimizer is designed for database performance tools that focus on what is happening in the database and fixing it, rather than preventing problems. DB Optimizer (particularly when used in conjunction with J Optimizer) will help to data management groups closer together anbd collaborate.
Hadoop has become synonymous to Big Data. Oracle has release the latest standard to Java EE stack: Batch Processing JSR 352. Batch processing has been around for decades and there are many Java framework already available such Spring Batch. This talks provides a perspective about Hadoop and JSR352. Knowing when to use or the other or both together.
From our experience, performance is the most critical issue in SAP BW. Thus, those are answered with new technologies such as BWA or SAP HANA. Right data management enables you to bring your BW to the top form even before you decide to invest in new technologies, especially with nearline storage and housekeeping.
Tackle more data science challenges than ever before without the need for discrete acceleration with the 3rd Gen Intel® Xeon® Scalable processors. Learn about the built-in AI acceleration and performance optimizations for popular AI libraries, tools and models.
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
Preprocess, visualize, and Build AI Faster at-Scale on Intel Architecture. Develop end-to-end AI pipelines for inferencing including data ingestion, preprocessing, and model inferencing with tabular, NLP, RecSys, video and image using Intel oneAPI AI Analytics Toolkit and other optimized libraries. Build at-scale performant pipelines with Databricks and end-to-end Xeon optimizations. Learn how to visualize with the OmniSci Immerse Platform and experience a live demonstration of the Intel Distribution of Modin and OmniSci.
Similar to 7 key recipes for data engineering (20)
Spark-adabra, Comment Construire un DATALAKE ! (Devoxx 2017) univalence
Tallk présenté à Devoxx avec Bachir Ait M'Barek : https://www.linkedin.com/in/baitmbarek
C’est la révolution dans la BI, les zones tampon FTP laissent la place aux systèmes de fichier distribués, le SQL s'exécute sur Hadoop, les dashboard en HTML5 remplacent les clients lourds, mais ne peut-on pas rationaliser un peu l’approche ?
Comment s’y prendre pour transformer une chaine BI en datalake ?
Cette université fera le tour de l’ingénierie des données en mode BigData. Au travers d’une présentation détaillée des concepts, de retour d’expériences et d’un cas pratique, nous allons découvrir :
les technologies et l’architecture, avec Spark, Kafka, Elasticsearch, Impala et Mesos,
et les méthodes associées : cycle de développement avec Hadoop, tests unitaires, jointures, gestion de la qualité de donnée, recette en mode Big Data et gestion des métadonnées.
After the construction of several datalakes and large business intelligence pipelines, we now know that the use of Scala and its principles were essential to the success of those large undertakings.
In this talk, we will go through the 7 key scala-based architectures and methodologies that were used in real-life projects. More specifically, we will see the impact of these recipes on Spark performances, and how it enabled the rapid growth of those projects.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. 7 Key Recipes For Data Eng
Introduction
We will explore 7 key recipes
on Data Engineering.
If you could only pick one, the 5th
on joins/cogroups is essential.
2
文字数制限あり。折りたたみやエンコーディングは無し。
データ・エンジニアリングの 7大レシピ
3. 7 Key Recipes For Data Eng
About Me
Jonathan WINANDY
Scala user (6 years)
Lead Data Engineer:
- Data Lake building,
- Audit/Coaching,
- Spark/Scala/Kafka Trainings.
Founder of Univalence (BI / Big Data)
Co-Founder of CYM (Predictive Maintenance),
and Valwin (Health Care Data).
3
データエンジニアとしてデータ基盤構築やトレーニング等を実施
Univalence、CYM、Valwin などのデータ分析ビジネスを創業
4. 7 Key Recipes For Data Eng
Bachir AIT MBAREK
4
Thank you
5. 7 Key Recipes For Data Eng
Outline
1. Organisations
2. Work Optimization
3. Staging
4. RDD/Dataframe
5. Join/Cogroup
6. Data quality
7. Real Programs
5
6. 1. It’s always about our organizations!
(in Europe)
6
一に組織 (ヨーロッパはこればっかり)
7. 7 Key Recipes For Data Eng 7
1. Organisations
In Data Engineering, we tend to think our problems
come from or are solved by those tools :
データエンジニアリングではツールが問題の原因であるとか
あるいはツールによって問題を解くのだと思われがち
8. 7 Key Recipes For Data Eng
1. Organisations
However our most difficult problems or durable
solutions come from organisational contexts.
It’s true for IT at large, but it’s much more
dominant in Data Engineering.
8
IT において、最も困難な課題や持続的な解決策は組織の文脈からやってくる
この点、データエンジニアリングではさらに支配的
9. 7 Key Recipes For Data Eng
1. Organisations
9
Because Data Engineering
enables access to Data!
理由はデータ・エンジニアリングはデータへのアクセスを活性化させるから
10. 7 Key Recipes For Data Eng 10
It enables access to Data in
very complex organisations.
1. Organisations
Product BI
Your TeamMarketing
data
new data
複雑な組織においてデータアクセスを活性化させると…
11. 7 Key Recipes For Data Eng
data
11
Your Team
Global
Marketing
1. Organisations
It enables access to Data in
very complex organisations. Global
IT
Marketing IT
BI
Holding
Subsidies
Marketing IT
BI
Marketing IT
BI
「超」複雑な組織においてデータアクセスを活性化させると…
12. 7 Key Recipes For Data Eng
It happens to be very frustrating!
12
1. Organisations
By being a Data Eng, you take part in some of the most
technically diverse teams that are:
● Running cutting edge technologies,
● Solving some of the hardest problems,
while being constantly dependent on other teams that
often don’t share your vision.
先端技術を駆使して難題に取り組みつつ、ビジョンを共有しない他のチームに依存して仕
事を進めざるをえない。とてもフラストレーションが溜まる状況だ
13. 7 Key Recipes For Data Eng
1. Organisations
Small tips:
● One hadoop cluster (no Test or QA clusters).
● Document your vision, so it can be shared.
● What happens between teams matters a lot.
13
コツ: Hadoopクラスタは1つに、ビジョンは文書化して事前に根回し
チーム間の関係は大切
15. 7 Key Recipes For Data Eng
2. Work Optimization
To optimize our work, there are 3 key concerns
governing our decisions :
● Lead time,
● Impact,
● Failure management.
15
業務最適化における意思決定で大切なこと:
リードタイム、インパクト、失敗の管理
16. 7 Key Recipes For Data Eng
2. Work Optimization
Lead time:
The period of time between the
initial phase and the completion.
Impact:
Positive effects beyond the
current context.
Failure management:
Failure is the nominal case.
Unprepared failures will pile up.
16
リードタイム→企画から完成までの期間、インパクト→今の文脈を超えた良い効果失敗の
管理→想定外の失敗は積み上がる
17. 7 Key Recipes For Data Eng
2. Work Optimization
Being Proactive!
To avoid the “MapReduce then Wait”,
two methods :
● Proactive Task Simulation,
● “What will fail?”
17
先を見越して動こう!
「MapReduce を動かして待機」を回避するには?
18. 7 Key Recipes For Data Eng
2. Work Optimization
Proactive Task Simulation.
The idea is to solve a task:
● map all the possible ways,
● on each way estimate:
○ Lead time and cost,
○ Decidability,
○ Success rate,
○ Generated opportunities,
○ and other By-Products.
● then choose which way to start with.
18
解決したいタスクについて、ありうる可能性を全て挙げてリードタイムやコストなどを見積
もった上で、どの方法から始めるかを選ぶ
19. 7 Key Recipes For Data Eng
2. Work Optimization
What will fail ?
The idea is to guess what may fail on a
given component.
Then you can engage in a discussion on:
● Knowing how likely it will fail,
● Preventing that failure,
● Planning the recovery ahead.
19
あるコンポーネントで何が失敗しそうか考え、
その頻度や予防策、復旧プランを議論する
21. 7 Key Recipes For Data Eng
3. Staging
Data is moving around, freeze it!
Staging changed with Big Data. We moved from
transient staging (FTP, NFS, etc.) to persistent
staging thank to distributed solutions:
● in Kafka, we can retain logs for months,
● in HDFS, we can retain sources for years.
21
まずは、動いているデータを凍結する
Kafka や HDFS のおかげでビッグデータを長期間ステージングできるように
22. 7 Key Recipes For Data Eng
3. Staging
But there are a lot of staging
anti-patterns out there:
● Updating directories,
● Incomplete datasets,
● Short retention.
Staging should be seen as a
persistent data structure.
If you liked immutability in Scala, go for it with your Data!
22
ステージングは永続データ構造として見えるようにすべき
データは Scala のイミュータブルと同じように扱おう
23. 7 Key Recipes For Data Eng
3. Staging
Example, with HDFS:
Writing in unique directories:
/staging
|-- $tablename
|-- dtint=$dtint
|-- dsparam.name=$dsparam.value
|-- ...
|-- ...
|-- uuid=$uuid
23
UUID を使ったディレクトリに書き込む
25. 7 Key Recipes For Data Eng
4. RDD/Dataframe
Dataframes have great performance,
but are “untyped” and foreign.
RDDs have a robust Scala API,
but are a difficult to map from data sources.
SQL is the current lingua franca of Data.
25
データ操作にはなんだかんだ言っても SQL
26. 7 Key Recipes For Data Eng
4. RDD/Dataframe
Dataframe RDD
Predicate push down Types!!
Bare metal / unboxed Nested structures
Connectors Better unit tests
Pluggable Optimizer Less stages
SQL + Meta Scala * Scala
26
Comparative Advantages
27. 7 Key Recipes For Data Eng
RDD based jobs are like
marine mammals, fit for their
environnement starting from a
certain size.
RDDs are building blocks for
large jobs.
27
RDD は海獣みたいなもので、その環境に特化している
RDD は大きい仕事のビルディング・ブロック
4. RDD/Dataframe
28. 7 Key Recipes For Data Eng
4. RDD/Dataframe
RDDs are very good for ETL workloads:
● Control over shuffles,
● Unit tests are easier to write.
They can leverage Dataframe API for job
boundaries:
● Loading, storing data with Dataframe APIs,
● Map Dataframe in case classes,
● Perform type safe transformations.
28
RDD は ETL に向いている
データ順の制御や単体テストの書き易さ
29. 7 Key Recipes For Data Eng
4. RDD/Dataframe
Dataframes are perfect for:
● Data Exploration (notebook),
● Light Jobs (SQL + DF) ,
● Dynamic jobs (xlsx specs =>
spark job).
User Defined Functions improve
code reuse,
User Defined Aggregate Functions
improve performance over
Standard SQL. 29
Dataframe は Notebook を使ったデータ探索や SQL と組み合わせた軽量なジョブ、
動的なジョブに向いている
31. 7 Key Recipes For Data Eng
5. Cogroup
The cogroup is the best operation
to link data together.
31
データの連結に使える
32. 7 Key Recipes For Data Eng
Cogroup API
from (left:RDD[(K,A)],right:RDD[(K,B)])
○ join : RDD[(K,( A , B ))]
○ outerJoin : RDD[(K,(Option[A],Option[B]))]
○ cogroup : RDD[(K,( Seq[A], Seq[B]))]
from (rdd:RDD[(K,A)])
○ groupBy : RDD[(K,Seq[A])]
On cogroup and groupBy, for a given key:K, there is only
one unique row with that key in the output dataset.
5. Cogroup
32
cogroup と groupBy は任意のキーに対して単一の行を返す
33. 7 Key Recipes For Data Eng
5. Cogroup
33
rddL
.filter(pL)
.map(mL)
.keyBy(kL)
.cogroup(
rddR
.filter(pR)
.map(mR)
.keyBy(kL))
.map(mC)
34. 7 Key Recipes For Data Eng
5. Cogroup
CHECKPOINT on DISK (save)
34
rddL.keyBy(mL.andThen(kL))
.cogroup(
rddR.keyBy(mR.andThen(kR)))
.map({case (k,(ls,rs)) =>
(k,(ls.filter(pL).map(mL),
rs.filter(pR).map(mR)))})
.map(mC)
REWRITE
35. 7 Key Recipes For Data Eng
5. Cogroup
Lines of Code : 3000
Duration : 30min
(non-blocking)
Lines of Code : 15
Duration : 11h
(blocking)
35
CHECKPOINT on DISK
Moving the code after
the checkpoint allows
fast feedback loops.
ディスク書き出しの後にコードを置くことで素早くフィードバックループを回せる
36. 7 Key Recipes For Data Eng
5. Cogroup
Cogroups allow writing tests on a
minimised case.
Test workflow:
● Isolate potential cases,
● Get the smallest cogrouped row
○ output the row in test resources,
● Reproduce the bug,
● Write tests and fix code.
36
cogroup を使うと問題を最小化してテストを書けるのでバグを再現しやすい
38. 7 Key Recipes For Data Eng
6. Inline data quality
Data quality improves resilience to bad data.
However, data quality concerns often come second.
38
データ品質を高めることでバッドデータへのレジリエンスが向上するが
データ品質は二の次にされがち
39. 7 Key Recipes For Data Eng
6. Inline data quality
Our solution: Integrate Data Quality deep inside jobs, by
unifying Data quality with Data transformation.
We defined a structure Result similar to ValidationNeL
(Applicatives).
39
データ品質はジョブの奥まで統合させる
ValidationNeL的な Result というものを定義した
40. 7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
case class Annotation(path:String,
typeName:String,
msg:String,
discardedData:Seq[String],
entityIdType:Option[String],
entityId:Option[String],
level:Int,
stage:String)
6. Inline data quality
40
41. 7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
Result is either:
● Containing a value, with a list of warnings,
● Empty, with a list containing the error and
warnings.
(Serialization and Big Data don’t like Sum types, so it’s pre-projected
onto a product type)
6. Inline data quality
41
値を持つか、Empty の二値
それぞれ警告やエラーを表す注釈も持つ
42. 7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
Then we can use applicatives to combine results.
case class Person(name:String,age:Int)
def build(name:Result[String],
age:Result[Int]):Result[Person] =
...
6. Inline data quality
42
アプリカティブを使って結果を組み合わせる
43. 7 Key Recipes For Data Eng
case class Result[T](value:Option[T],
annotations:Seq[Annotation])
The annotations are accumulated at the top of
the hierarchy, and saved with the data:
6. Inline data quality
43
注釈は蓄積されて、データと一緒に保存される
44. 7 Key Recipes For Data Eng
Annotations can be aggregated on dimensions:
6. Inline data quality
Message:
● EMPTY_STRING
● MULTIPLE_VALUES
● NOT_IN_ENUM
● PARSE_ERROR
● ______________
Levels:
● WARNING
● ERROR
● CRITICAL
44
注釈は次元ごとに集約できる
45. 7 Key Recipes For Data Eng
6. Inline data quality
If you are interested by the approach, you can take a look at
this repository:
Macros based on Shapeless to build Result[T] from case classes.
https://github.com/ahoy-jon/autoBuild (~october 2015)
45
気になった人はレポジトリをみてください
47. 7 Key Recipes For Data Eng
7. Real programs
Most pipelines parts are designed as
Stateless computations.
They either require no external state (great)
or infer their state based on filesystem state (meh).
47
ステートレスな計算が基本
48. 7 Key Recipes For Data Eng
7. Real programs
Spark allows us to program inside the Driver.
We can create actual programs.
In Scala, we can use:
● Scopt to parse common args and feature flips,
● TypesafeConfig to load/overload program settings,
● EventSourcing to read/write app events,
● Sbt-Pack Coursier to package and create launchers.
48
Spark なら Scala を使ってちゃんとしたプログラムが書ける
49. 7 Key Recipes For Data Eng
Deterministic effects
We then make sure that our program are as
deterministic as possible, and are
idempotent (if possible).
Example: Storing past execution so as to not recompute
something already computed, unless forced.
49
7. Real programs
できるかぎり決定論的なプログラムを目指す
50. 7 Key Recipes For Data Eng
Level 0 Event Sourcing
Level 1 Name resolving
Level 2 Triggered exec (schema capture, deltaQA, …)
Level 3 Scheduling (replay, coherence, ...)
Level 4 “code as data” (=> continuous delivery)
7. Real programs
In progress: project Kerguelen, API for data jobs.
Enable the creation of coherent jobs, integrating different
abstraction levels:
50
プロジェクト Kerguelen というものを作っている
51. 7 Key Recipes For Data Eng
8. More
More recipes:
● Automatic QA,
● Structural Sharing for Datasets,
● Jsonoids mapping generation,
● Advanced UDAF,
● ...
But that’s it for today!
51
他にもあるけど、今日はここまで
54. 7 Key Recipes For Data Eng
PSUG Note
54
If you happen to visit Paris, don’t
hesitate to submit a talk at our Paris
Scala User Group.
パリに来たら是非 Paris 勉強会でトークしてください