Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Engineering
Patterns and Principles
Valdas Maksimavičius
Software Development Data Projects
Software Development Data Projects
Would you be
confident in a
self-driving car ...
… knowing that
there is your
software running
it?
Standardize and increase the descriptive power
of engineering processes
by applying patterns
Or in other words
stand on th...
Source: https://www.health.harvard.edu/blog/right-brainleft-brain-right-2017082512222
● Left side of your brain is respons...
About me
● IT Architect at Cognizant
● Data Engineering, Data Science,
Cloud Computing, Agile teams
● Financial, Manufactu...
Biological and Physiological needs
Basic life needs - air, food, drink, shelter, warmth, sex, sleep, etc.
Safety needs
sec...
X
Culture
Core values, way of working
Enterprise architecture
Buy vs build, cloud readiness
Data strategy & architecture
Def...
Culture
Core values, way of working
Data architecture
Ingestion, storage consumption, how data is collected,
stored, trans...
Culture, way of working, values
DevOps culture
1. Foster a Collaborative Environment
2. Impose End-to-End Responsibility - you build it you ship it
3. Enc...
Data architecture
If you are building a data platform in the
cloud, remember that ...
low barrier-to-entry overshadows
complexity
Big Data cloud architecture references
Source: https://azure.microsoft.com/en-in/solutions/architecture/modern-data-wareho...
CRM
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big da...
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data s...
Application integration approaches
File Transfer
Have each application produce files of shared data for others to consume, ...
Ingestion challenges
● Multiple data source load and prioritization -> push vs pull strategy
● Ingested data indexing and ...
Choose privacy protection patterns
Privacy protection at the ingress
Source: https://www.valdas.blog/2019/08/06/privacy-gd...
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data s...
Use cloud storage offerings instead of Hadoop
Data Warehouse vs Data Lake
Data Warehouse Data Lake
Requirements Relational requirements Diverse data, scalability, low c...
Data Warehouse vs Data Lake
Source: Microsoft
Data Warehouse vs Data Lake
Source: Microsoft
Data Warehouse vs Data Lake
Source: Microsoft
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data s...
Offer self-service tools
Self service exploration
Automated pipeline
Collect raw
data
Curate data
Train &
Score
Take Insig...
Use on-demand resources
Social
LOB
Graph
IoT
Image
CRM
Cloud
INGEST STORE PREP &
TRAIN
DEPLOY &
SERVE
Data
orchestration
and monitoring
Big data s...
Apply domain and product thinking
● Model to describe a domain
● Unified language
● Raw or transformed datasets
● Domain te...
Principles, best practices, tools
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps
Get familiar with DataOps - Examples
Delay commitments and keep important
decisions open
● The principle of Last Responsible
Moment originates from Lean
Softwa...
Why Last Responsible
Moment is important in
cloud analytics?
Expect new improvements and
upgrades all the time
valdas@maksimavicius.eu
https://www.linkedin.com/in/valdasm/
Twitter: @VMaksimavicius
Data engineering design patterns
Data engineering design patterns
Data engineering design patterns
Data engineering design patterns
Data engineering design patterns
Upcoming SlideShare
Loading in …5
×

Data engineering design patterns

There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.

But where are the data science and data engineering patterns?

Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.

  • Be the first to comment

  • Be the first to like this

Data engineering design patterns

  1. 1. Data Engineering Patterns and Principles Valdas Maksimavičius
  2. 2. Software Development Data Projects
  3. 3. Software Development Data Projects
  4. 4. Would you be confident in a self-driving car ... … knowing that there is your software running it?
  5. 5. Standardize and increase the descriptive power of engineering processes by applying patterns Or in other words stand on the shoulders of giants and stop reinventing the wheel
  6. 6. Source: https://www.health.harvard.edu/blog/right-brainleft-brain-right-2017082512222 ● Left side of your brain is responsible for analytical thinking, science, math, etc. ● It uses known building blocks to model the surrounding world ● If you like table representation of data, you will try to model everything as a table ● As an engineer, expand your tool belt by learning new patterns and new building blocks to solve business problems better. Why does my brain need patterns?
  7. 7. About me ● IT Architect at Cognizant ● Data Engineering, Data Science, Cloud Computing, Agile teams ● Financial, Manufacturing, Logistics, Retail industries ● Organizer of Vilnius Microsoft Data Platform Meetup & Hack4Vilnius Hackathon ● Blogging on www.valdas.blog
  8. 8. Biological and Physiological needs Basic life needs - air, food, drink, shelter, warmth, sex, sleep, etc. Safety needs security, employment, protection against hunger and violence Love and belonging needs Receive and give love, appreciation, friendship Esteem need Unique individual, self-respect, etc. Experience purpose and meaning Realising all inner potentials Self-actualization Personal growth and fulfillment Maslow’s hierarchy of needs
  9. 9. X
  10. 10. Culture Core values, way of working Enterprise architecture Buy vs build, cloud readiness Data strategy & architecture Defensive vs offensive strategy, use cases Existing team skillset Databases, programming, etc Design patterns, tools & principles Business drivers Business goals and objectives Maslow’s hierarchy of needs for data projects
  11. 11. Culture Core values, way of working Data architecture Ingestion, storage consumption, how data is collected, stored, transformed, distributed, and consumed Tools & principles Best practices, naming, patterns Maslow’s hierarchy of needs for data projects - simplified view for today’s presentation
  12. 12. Culture, way of working, values
  13. 13. DevOps culture 1. Foster a Collaborative Environment 2. Impose End-to-End Responsibility - you build it you ship it 3. Encourage Continuous Improvement 4. Automate (Almost) Everything 5. Focus on the Customer’s Needs 6. Embrace Failure, and Learn From it 7. Unite Teams — and Expertise Source: https://www.cmswire.com/information-management/7-key-principles-for-a-successful-devops-culture/
  14. 14. Data architecture
  15. 15. If you are building a data platform in the cloud, remember that ... low barrier-to-entry overshadows complexity
  16. 16. Big Data cloud architecture references Source: https://azure.microsoft.com/en-in/solutions/architecture/modern-data-warehouse/
  17. 17. CRM Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results External systems Digital portals Architecture example Reporting Core systems
  18. 18. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Data ingestion CRM External systems Digital portals Reporting Core systems
  19. 19. Application integration approaches File Transfer Have each application produce files of shared data for others to consume, and consume files that others have produced. Shared Database Have the applications store the data they wish to share in a common database. Remote Procedure Invocation Have each application expose some of its procedures so that they can be invoked remotely, and have applications invoke those to run behavior and exchange data. Messaging Have each application connect to a common messaging system, and exchange data and invoke behavior using messages.
  20. 20. Ingestion challenges ● Multiple data source load and prioritization -> push vs pull strategy ● Ingested data indexing and tagging -> metadata collection is mandatory ● Data validation and cleansing -> separate business from processing logic ● Data transformation and compression -> different compression and file types
  21. 21. Choose privacy protection patterns Privacy protection at the ingress Source: https://www.valdas.blog/2019/08/06/privacy-gdpr-implementation-in-azure/ Privacy protection at the egress
  22. 22. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Data storage CRM External systems Digital portals Reporting Core systems
  23. 23. Use cloud storage offerings instead of Hadoop
  24. 24. Data Warehouse vs Data Lake Data Warehouse Data Lake Requirements Relational requirements Diverse data, scalability, low cost Data Value Data of recognised high value Candidate data of potential value Data Processing Mostly refined calculated data Mostly detailed source data Business Entities Known entities, tracked over time Raw material for discovering entities and facts Data Standards Data conforms to enterprise standards Fidelity to original format and condition Data Integration Data integration upfront Data prep on demand Transformation Data transformed, in principle Data repurposed later, as needs arise Schema Definition Schema-on-write Schema-on-read Metadata Management Metadata improvement Metadata developed on read
  25. 25. Data Warehouse vs Data Lake Source: Microsoft
  26. 26. Data Warehouse vs Data Lake Source: Microsoft
  27. 27. Data Warehouse vs Data Lake Source: Microsoft
  28. 28. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Data preparation & training CRM External systems Digital portals Reporting Core systems
  29. 29. Offer self-service tools Self service exploration Automated pipeline Collect raw data Curate data Train & Score Take Insights Into Actions Make hypothesis Identify variables Split data Build model Validate model SQL
  30. 30. Use on-demand resources
  31. 31. Social LOB Graph IoT Image CRM Cloud INGEST STORE PREP & TRAIN DEPLOY & SERVE Data orchestration and monitoring Big data store Transform, Clean & Train Results Serve results to end consumers CRM External systems Digital portals Reporting Core systems
  32. 32. Apply domain and product thinking ● Model to describe a domain ● Unified language ● Raw or transformed datasets ● Domain team is responsible for its lifecycle, SLA ● Discoverable, addressable, trustworthy, self-describing, interoperable, secure ● Each producer is responsible of sharing data products to organization
  33. 33. Principles, best practices, tools
  34. 34. Get familiar with DataOps
  35. 35. Get familiar with DataOps
  36. 36. Get familiar with DataOps
  37. 37. Get familiar with DataOps
  38. 38. Get familiar with DataOps
  39. 39. Get familiar with DataOps
  40. 40. Get familiar with DataOps
  41. 41. Get familiar with DataOps
  42. 42. Get familiar with DataOps
  43. 43. Get familiar with DataOps
  44. 44. Get familiar with DataOps
  45. 45. Get familiar with DataOps - Examples
  46. 46. Delay commitments and keep important decisions open ● The principle of Last Responsible Moment originates from Lean Software Development ● It emphasises holding on taking important actions and crucial decisions for as long as possible.
  47. 47. Why Last Responsible Moment is important in cloud analytics? Expect new improvements and upgrades all the time
  48. 48. valdas@maksimavicius.eu https://www.linkedin.com/in/valdasm/ Twitter: @VMaksimavicius

×