SlideShare a Scribd company logo
The Role of Data
Engineering in Data
Science and Analytics
Practice
Joseph Benjamin ILAGAN
Ateneo de Manila University
jbilagan@ateneo.edu
Hello!
I am Joben Ilagan
Thank you for having me here.
Twitter: @jilagan
LinkedIn: https://www.linkedin.com/in/jobenilagan/
2
1.
Larger Context
Transition from previous talk
4
(Anaconda, 2020, p.11)
5
(Anaconda, 2020, p.12)
A data engineering Team is NOT a
collection of Data Engineers
A data engineering team isn’t made up of a
single type of person or title
A data engineering team is
multidisciplinary
6
Data Engineering Team
(Anderson, 2017)
Creates data pipelines
Brings together 10-30 different big data technologies
Understands and chooses the right tools for the job
Understands the various technologies and
frameworks in-depth
Combines them to create solutions to enable a
company’s business processes with data pipelines
7
Data Engineering Team
(Anderson, 2018)
Data science is an interdisciplinary field aiming to turn data into real value.
Data may be structured or unstructured, big or small, static or streaming.
Value may be provided in the form of predictions, automated decisions, models
learned from data, or any type of data visualization delivering insights.
Data science includes data extraction, data preparation, data exploration, data
transformation, storage and retrieval, computing infrastructures, various types of
mining and learning, presentation of explanations and predictions, and the
exploitation of results taking into account ethical, social, legal, and business aspects.
9
Where does data engineering fit in the
context of data analytics?
(Van Der Aalst, 2016)
Where does data engineering fit in the
context of data analytics?
10
The ingredients contributing to data science (Van Der Aalst, 2016)
Where does data engineering fit in the
context of data analytics?
11
The Internet of Events (Van Der Aalst, 2016)
12
Alluvial diagram of Big Data job families vs. Big Data skill sets
(De Mauro, Greco, Grimaldi, & Ritala, 2018)
13
Word cloud showing the top 50 words recurring in the Job Title of posts related to Big Data.
The font size of each word is proportional to the number of occurrences of each word.
(De Mauro, Greco, Grimaldi, & Ritala, 2018).
14
(Anaconda, 2020, p.13)
15
(Anaconda, 2020, p.26)
2.
How do I get
started in Data
Engineering?
Process, Skills, Tools
“A Data Engineer is
someone who has
specialized their skills
in creating software
solutions around data.
17
(Anderson, 2017)
Jesse Anderson
Data Engineer
Managing Director, Big Data Institute
“I started getting into Data about 5
years ago when everyone started
talking about it and was becoming
the new buzz word. People were
actually starting to realize how much
you can do with data. Everyone
wanted to learn Data Science and all
the companies wanted to get into
Machine Learning or Artificial
Intelligence, but there was still a
missing piece - that's when I learned
about Data Engineering.
18
Miles Ong
Data Engineer
Kumu
“There was always the problem of
collecting the data, processing the
analysis and actually implementing
the insights. It was overwhelming at
first because I had no clue on where
to begin. What helped is to focus on
things one at a time and actually
trying things out. For Data
Engineering, the only way to learn is
by doing.
19
Miles Ong
Data Engineer
Kumu
“The more I learned, the more I
realized how powerful Data
Engineering is. It would give me the
capability to simply come up with an
idea and actually implement it. It's a
very underrated field, but I love the
challenge of conceptualizing,
building and implementing concrete
solutions that make a difference.
20
Miles Ong
Data Engineer
Kumu
Steps to Data Engineering
PRE-PROJECT
(Anderson, 2017)
Steps to Data Engineering
FORM TEAM
(Anderson, 2017)
Steps to Data Engineering
USE CASE
(Anderson, 2017)
Steps to Data Engineering
KNOW GAPS
(Anderson, 2017)
Steps to Data Engineering
TRAIN, MENTOR
(Anderson, 2017)
Steps to Data Engineering
CHOOSE TECH
(Anderson, 2017)
Steps to Data Engineering
WRITE CODE
(Anderson, 2017)
Steps to Data Engineering
EVALUATE
ITERATION
(Anderson, 2017)
Steps to Data Engineering
REPEAT
(Anderson, 2017)
3.
Big Data
Concepts and Applications
Big Data (Working Definition)
Big data is a field that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
31
Big Data (Working Definition)
Big data is a field that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
32
VOLUME
VARIETY
VELOCITY
…
(more on this in the
next slide...)
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
33
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
34
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATAStructured
Unstructured
Text, Image,
video, social
relations
Multi-factor
Probabilistic
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
35
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Terabytes
Records
Architecture
Transactions
Tables, Files
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
36
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Statistical
Events
Correlations
Hypothetical
Fresh? Old?
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
37
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Batch
Real/near-time
Processes
Streams
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
38
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Trustworthiness
Authenticity
Origin, Reputation
Availability
Accountability
Workload
Management
39
Servers
Queues (or Partitions)
Service vs Wait Times
Concurrency
Serial vs Parallel
Synchronous vs Asynchronous
40
SI SISD SD
Single Instruction
Single Data Systems
Flynn's Taxonomy
41
SI
SIMD SD
Single Instruction
Multiple Data Systems
SIMD SD
SIMD SD
Flynn's Taxonomy
42
SI
SIMD SD
Multiple Instruction
Multiple Data Systems
SIMD SD
SIMD SD
SI
SI
Multiple
Instructions
Multiple DataMULTI-PROCESSORS
Flynn's Taxonomy
43
SD
SIMD SI
Multiple Instruction
Single Data Systems (Pipeline?)
SIMD SI
SIMD SI
Single Data Multiple
Instructions
Flynn's Taxonomy
Traditional Way
44
(Edureka! https://www.edureka.co/blog/mapreduce-tutorial/)
Map-Reduce
45
Map Reduce
46
(Edureka! https://www.edureka.co/blog/mapreduce-tutorial/)
Adapted from Image by Cloudera
47
Hadoop Ecosystem (Edureka!, https://www.edureka.co/blog/hadoop-ecosystem )
48
Sample AWS Data Lake Platform
(AWS, https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html) 49
Data Integration and Big Data Analytics Framework (Jimenez-Marquez,
Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019)
50
First Stage phases in detail using Yelp as example
(Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019)
51
Deep Learning Modeling Lifecycle (Miao, Li, Davis, & Deshpande, 2017)
52
Data
Lake
Concept
53
“If you think of a datamart as a
store of bottled water – cleansed
and packaged and structured for
easy consumption – the data lake
is a large body of water in a more
natural state. The contents of the
data lake stream in from a source
to fill the lake, and various users
of the lake can come to examine,
dive in, or take samples.
54
(Dixon, 2010; Miloslavskaya & Tolstoy, 2016)
James Dixon
Chief Technology Officer
Pentaho
55
A data lake refers to a massively scalable storage
repository that holds a vast amount of raw data in its
native format («as is») until it is needed plus processing
systems (engine) that can ingest data without
compromising the data structure
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
Three types of big data
processing
Batch Processing Stream Processing
(Kappa Architecture)
Hybrid Processing
(Lambda Architecture)
56
(Marz & Warren, 2015)
(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
Kappa Architecture
57(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
Lambda Architecture
58(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
Fast
Data
Concept
59
60
What is Fast Data?
Fast data corresponds to the application of
big data analytics to smaller data sets in
near-real or real-time in order to solve a
particular problem.
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
61
What is Fast Data?
The combination of in-memory databases and data
grid on top of flash devices will allow an increase in
the capacity of stream processing.
Fast data is a complementary approach to big data
for managing large quantities of «in-flight» data
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
62
Fast Data requires two
technologies
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
Streaming system capable of of delivering
events as fast as they come in
Data store capable of processing each
item as fast as it arrives
Data
Quality
63
64
(Anaconda, 2020, p.12)
65
Data Cleaning
Data cleaning, also called data cleansing or scrubbing, deals with
detecting and removing errors and inconsistencies from data in order
to improve the quality of data.
Sample inconsistencies:
● misspellings during data entry
● missing information
● other invalid data
(Wang, Kon, & Madnick, 1993)
Data Cleaning Approaches
66
Data Analysis Definition of
Transformation
Workflow
Data
Verification
Data
Transformation
Backflow of
Cleaned Data
Special domain cleaning
Specialized cleaning tools
ETL Tools
Web
Scraping
67
68
Data
Wrangling
69
70
Event
Streaming
71
72
Event streaming is the digital equivalent of the human
body's central nervous system. It is the technological
foundation for the 'always-on' world where businesses
are increasingly software-defined and automated, and
where the user of software is more software.
(Apache Software Foundation, 2017)
73
(Apache Software Foundation, 2017)
Capture data in
real-time from event
sources.
74
(Apache Software Foundation, 2017)
Capture data in
real-time from event
sources.
storing these event
streams durably for later
retrieval and manipulation
75
(Apache Software Foundation, 2017)
Capture data in
real-time from event
sources.
storing these event
streams durably for later
retrieval and manipulation
routing the event streams
to different destination
technologies as needed
Data Streams (examples)
Time Series Data Network Traffic Telecommunications
76
Video Surveillance Website Clickstreams Sensor Networks
(Miloslavskaya & Tolstoy, 2016)
77
Illustration of Data Capture Equipment (Chen, Mao, & Liu, 2014)
Big Data
Concept
78
CAP
Theorem
Consistency,
Availability,
Partition Tolerance
Trade-offs
79
80
Visualization of Eric Brewer's CAP Theorem (Brewer, 2012; Khazaei, 2016)
81
Big Data characteristics and NoSQL features (Khazaei, 2016)
“When a system processes
trillions and trillions of
requests, events that normally
have a low probability of
occurrence are now
guaranteed to happen and
must be accounted for upfront
in the design and architecture
of the system.
82
(Vogels, 2009)
Werner Vogels
Vice President &
Chief Technology Officer
Amazon.com,
Source: Wikipedia
ACID vs
BASE
Comparison
83
84
ACID vs BASE
ACID
● Atomicity
● Consistent
● Isolated
● Durable
BASE
● Basic Availability
● Soft state
● Eventual
Consistency
(Brewer, 2012; Tudorica & Bucur, 2011)
NoSQL
Survey, Comparisons,
Taxonomies
85
86
No official NoSQL Taxonomy exists:
Core NoSQL Example
Wide Column Store
Hadoop / HBase,
Cassandra, Hypertable,
Cloudata, Amazon
SimpleDB, SciDB
Document Store
CouchDB, MongoDB,
Terrastore, ThruDB,
OrientDB, RavenDB,
Citrusleaf, SisoDB, CloudKit,
Persevere, Jackrabbit
Key Value/Tuple Store
Azure Table Storage,
MEMBASE, Riak, Redis,
Chordless, GenieDB,
Scalaris, Tokyo Cabinet /
Tyrant, GT.M, Keyspace,
Berkeley DB, MemcacheDB,
HamsterDB, Faircom CTree,
Mnesia, LightCloud,
Pincaster, Hibari, Scality
Eventually-Consistent
Key Value Store
Amazon Dynamo,
Voldemort, Dynomite, KAI,
SubRecord, Mo8onDb,
Dovetaildb
Graph Database
Neo4J, Infinite Graph, Sones,
InfoGrid, HyperGraphDB,
Trinity, AllegroGraph,
Bigdata, DEX, OpenLink
Virtuoso, VertexDB, FlockDB,
Java Universal Network /
Graph Framework, Sesame,
Filament, OWLim, NetworkX,
iGraph(Tudorica & Bucur, 2011)
87
No official NoSQL Taxonomy exists:
Soft NoSQL Example
Object Databases
db4o, Versant, Objectivity,
Gemstone, Progress,
Starcounter, Perst, ZODB,
NEO, PicoLisp, Sterling,
StupidDB, KiokuDB, Durus
Grid and Cloud Database
Solutions
GigaSpaces, Queplix,
Hazelcast, Joafip, GridGain,
Infinispan, Coherence,
eXtremeScale
XML Databases
Mark Logic Server, EMC
Documentum xDB, Tamino,
eXist, Sedna, BaseX, Xindice,
Qizx, Berkeley DB XML
Multivalue Databases
U2, OpenInsight, OpenQM,
Globals
Other NoSQL related
databases
IBM Lotus/Domino,
Intersystems Cache,
eXtremeDB, ISIS Family,
Prevayler, Yserial
(Tudorica & Bucur, 2011)
Key Value
88
Simplest form of database management
systems
Can only store pairs of keys and values,
as well as retrieve values when a key is
known
Normally not adequate for complex
applications
Simplicity makes these attractive in
certain circumstances
(Khazaei, 2016)
Column-Oriented
89
Stores data in records with an ability to
hold very large numbers of dynamic
columns
Can be seen as two-dimensional
key-value stores
Schema-free like document stores,
however the implementation is
significantly different
(Khazaei, 2016)
Document Stores
90
Also known as document-oriented
database systems
Schema-free organization
Records (or "documents") do not need
to have a uniform structure
The types of the values of individual
columns can be different
Columns can have more than one value
(arrays); records can have a nested
structure
Document stores often use internal
notations, usually JSON.
(Khazaei, 2016)
Graph Oriented
91
Represent data in graph structures as
nodes and edges
Edges represent relationships between
nodes
Allow easy processing of data in that
form (graphs)
Simple calculation of specific properties
of the graph, such as the number of
steps needed to get from one node to
another node
(Khazaei, 2016)
92
NoSQL Solutions (Khazaei, 2016)
Interrelation between Big Data, Fast Data,
and Data Lake Concepts
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
94
Takeaways
Upskilling not impossible
Understand workload management and
trade-offs when making architecture
decisions
Don't be afraid to work with other people
Experiment, experiment, experiment…
Be a wide reader and hungry learner
95
Thanks!
Any questions?
You can email me at:
jbilagan@ateneo.edu
Credits
Special thanks to all the people who made
and released these awesome resources for
free:
⬡ Presentation template by SlidesCarnival
⬡ Photographs by Unsplash
96
ReferencesAnaconda (2020). 2020 State of Data Science Moving from hype toward maturity.
Anderson, J. (2017). Data Engineering Teams Creating Successful Big Data Teams and Products.
Anderson, J. (2018, April 11). Data engineers vs. data scientists. O'Reilly Media.
[https://www.oreilly.com/radar/data-engineers-vs-data-scientists/](https://www.oreilly.com/radar/data-engineers-vs-data-scientists/).
Apache Software Foundation. (2017). INTRODUCTION Everything you need to know about Kafka in 10 minutes. Apache Kafka.
[https://kafka.apache.org/intro](https://kafka.apache.org/intro).
Brewer, E. (2001). Lessons from giant-scale services IEEE Internet Computing 5(4), 46-55.
[https://dx.doi.org/10.1109/4236.939450](https://dx.doi.org/10.1109/4236.939450)
Brewer, E. (2012). CAP Twelve Years Later: How the “Rules” Have Changed Computer 45(2), 23-29.
[https://dx.doi.org/10.1109/mc.2012.37](https://dx.doi.org/10.1109/mc.2012.37)
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, 19(2), 171-209.
Dean, J., Ghemawat, S., Mehta, B. (2008). MapReduce: simplified data processing on large clusters Communications of the ACM 51(1), 107-113.
https://dx.doi.org/10.1145/1327452.1327492
De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2018). Human resources for Big Data professions: A systematic classification of job roles and required skill sets.
Information Processing & Management, 54(5), 807-817.
Devopedia. 2020. "CAP Theorem." Version 4, April 30. Accessed 2020-09-14. https://devopedia.org/cap-theorem
Dixon, J. (2010, October 14). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog.
[https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/](https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/).
Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big data cleaning algorithms in cloud computing. International Journal of Online Engineering, 9(3),
77–81. [https://doi.org/10.3991/ijoe.v9i3.2765](https://doi.org/10.3991/ijoe.v9i3.2765)
Gray, J., & Shenoy, P. (2000). Rules of thumb in data engineering. Proceedings - International Conference on Data Engineering, 3–10.
[https://doi.org/10.1109/icde.2000.839382](https://doi.org/10.1109/icde.2000.839382)
97
References
Ishwarappa, & Anuradha, J. (2015). A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Computer Science, 48(C), 319–324.
[https://doi.org/10.1016/j.procs.2015.04.188](https://doi.org/10.1016/j.procs.2015.04.188)
Jimenez-Marquez, J., Gonzalez-Carrasco, I., Lopez-Cuadrado, J., Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content International
Journal of Information Management 44(), 1-12. [https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003](https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003)
Khazaei, H. (2016). How do I choose the right NoSQL solution? Big Data, X(0), 1–33.
Laskowski, N. (2016). Data lake governance: A big data do or die. URL:
[http://searchcio](http://searchcio/).techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die (access date 28/05/2016)
Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co.
Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on
Data Engineering (ICDE) (pp. 571-582). IEEE.
Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305.
[https://doi.org/10.1016/j.procs.2016.07.439](https://doi.org/10.1016/j.procs.2016.07.439)
Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal
of Big Data, 2(1), 1.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.
Samizadeh, I. (2018, March 15). A brief introduction to two data processing architectures - Lambda and Kappa for Big Data.
[https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb](https://towardsdatasci
ence.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb).
98
References
Tudorica, B., Bucur, C. (2011). A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition:
Networking in Education and Research 1(), 1-5. [https://dx.doi.org/10.1109/roedunet.2011.5993686](https://dx.doi.org/10.1109/roedunet.2011.5993686)
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Berlin, Heidelberg.
Vogels, W. (2009). Eventually consistent Communications of the ACM 52(1), 40-44.
[https://dx.doi.org/10.1145/1435417.1435432](https://dx.doi.org/10.1145/1435417.1435432)
Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data quality requirements analysis and modeling. Proceedings - International Conference on Data Engineering, 670–677.
[https://doi.org/10.1109/icde.1993.344012](https://doi.org/10.1109/icde.1993.344012)
Yin, S., & Kaynak, O. (2015). Big data for modern industry: challenges and trends [point of view]. Proceedings of the IEEE, 103(2), 143-146.
99

More Related Content

What's hot

Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
adb.pdf
adb.pdfadb.pdf
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected ApproachBusiness Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
DATAVERSITY
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model
DATUM LLC
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
The Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindThe Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data Mind
DATAVERSITY
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Edureka!
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
Tuba Yaman Him
 
Big Data Fabric Capability Maturity Model
Big Data Fabric Capability Maturity ModelBig Data Fabric Capability Maturity Model
Big Data Fabric Capability Maturity Model
Ross Collins
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
LDM Slides: How Data Modeling Fits into an Overall Enterprise Architecture
LDM Slides: How Data Modeling Fits into an Overall Enterprise ArchitectureLDM Slides: How Data Modeling Fits into an Overall Enterprise Architecture
LDM Slides: How Data Modeling Fits into an Overall Enterprise Architecture
DATAVERSITY
 
Data Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management Initiatives
Alan McSweeney
 
Introduction to Data Management Maturity Models
Introduction to Data Management Maturity ModelsIntroduction to Data Management Maturity Models
Introduction to Data Management Maturity Models
Kingland
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
Kent Graziano
 
3D Data Strategy Framework
3D Data Strategy Framework3D Data Strategy Framework
3D Data Strategy Framework
Daniel Ren
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
DATAVERSITY
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
Databricks
 
Implementing Agile Data Governance
Implementing Agile Data GovernanceImplementing Agile Data Governance
Implementing Agile Data Governance
Tami Flowers
 
Adopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data ManagementAdopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data Management
Software AG
 

What's hot (20)

Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 
Business Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected ApproachBusiness Intelligence & Data Analytics– An Architected Approach
Business Intelligence & Data Analytics– An Architected Approach
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
The Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindThe Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data Mind
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
Big Data Fabric Capability Maturity Model
Big Data Fabric Capability Maturity ModelBig Data Fabric Capability Maturity Model
Big Data Fabric Capability Maturity Model
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
LDM Slides: How Data Modeling Fits into an Overall Enterprise Architecture
LDM Slides: How Data Modeling Fits into an Overall Enterprise ArchitectureLDM Slides: How Data Modeling Fits into an Overall Enterprise Architecture
LDM Slides: How Data Modeling Fits into an Overall Enterprise Architecture
 
Data Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management Initiatives
 
Introduction to Data Management Maturity Models
Introduction to Data Management Maturity ModelsIntroduction to Data Management Maturity Models
Introduction to Data Management Maturity Models
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
3D Data Strategy Framework
3D Data Strategy Framework3D Data Strategy Framework
3D Data Strategy Framework
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Implementing Agile Data Governance
Implementing Agile Data GovernanceImplementing Agile Data Governance
Implementing Agile Data Governance
 
Adopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data ManagementAdopting a Process-Driven Approach to Master Data Management
Adopting a Process-Driven Approach to Master Data Management
 

Similar to The role of data engineering in data science and analytics practice

Big Data for One Big Family
Big Data for One Big FamilyBig Data for One Big Family
Big Data for One Big Family
Matt Asay
 
BDA_Module1.pptx
BDA_Module1.pptxBDA_Module1.pptx
BDA_Module1.pptx
Shrinivasa6
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 
Come diventare data scientist - Paolo Pellegrini
Come diventare data scientist - Paolo PellegriniCome diventare data scientist - Paolo Pellegrini
Come diventare data scientist - Paolo Pellegrini
Donatella Cambosu
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Mr.Sameer Kumar Das
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
Mayank Kumar Sharma
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
Mohamed Magdy
 
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
DATAVERSITY
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
Peter Wang
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
Ry Walker
 
Cisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt onlyCisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt only
Arthur_Hansen
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
Information Security Awareness Group
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Cambridge Semantics
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Thomas Rones
 
The future of big data analytics
The future of big data analyticsThe future of big data analytics
The future of big data analytics
Ahmed Banafa
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
NikitaRajbhoj
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra
 

Similar to The role of data engineering in data science and analytics practice (20)

Big Data for One Big Family
Big Data for One Big FamilyBig Data for One Big Family
Big Data for One Big Family
 
BDA_Module1.pptx
BDA_Module1.pptxBDA_Module1.pptx
BDA_Module1.pptx
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Come diventare data scientist - Paolo Pellegrini
Come diventare data scientist - Paolo PellegriniCome diventare data scientist - Paolo Pellegrini
Come diventare data scientist - Paolo Pellegrini
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
big data
big databig data
big data
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Cisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt onlyCisco event 6 05 2014v3 wwt only
Cisco event 6 05 2014v3 wwt only
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Accelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data FabricAccelerate Digital Transformation with an Enterprise Big Data Fabric
Accelerate Digital Transformation with an Enterprise Big Data Fabric
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
The future of big data analytics
The future of big data analyticsThe future of big data analytics
The future of big data analytics
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

The role of data engineering in data science and analytics practice

  • 1. The Role of Data Engineering in Data Science and Analytics Practice Joseph Benjamin ILAGAN Ateneo de Manila University jbilagan@ateneo.edu
  • 2. Hello! I am Joben Ilagan Thank you for having me here. Twitter: @jilagan LinkedIn: https://www.linkedin.com/in/jobenilagan/ 2
  • 6. A data engineering Team is NOT a collection of Data Engineers A data engineering team isn’t made up of a single type of person or title A data engineering team is multidisciplinary 6 Data Engineering Team (Anderson, 2017)
  • 7. Creates data pipelines Brings together 10-30 different big data technologies Understands and chooses the right tools for the job Understands the various technologies and frameworks in-depth Combines them to create solutions to enable a company’s business processes with data pipelines 7 Data Engineering Team (Anderson, 2018)
  • 8. Data science is an interdisciplinary field aiming to turn data into real value. Data may be structured or unstructured, big or small, static or streaming. Value may be provided in the form of predictions, automated decisions, models learned from data, or any type of data visualization delivering insights. Data science includes data extraction, data preparation, data exploration, data transformation, storage and retrieval, computing infrastructures, various types of mining and learning, presentation of explanations and predictions, and the exploitation of results taking into account ethical, social, legal, and business aspects. 9 Where does data engineering fit in the context of data analytics? (Van Der Aalst, 2016)
  • 9. Where does data engineering fit in the context of data analytics? 10 The ingredients contributing to data science (Van Der Aalst, 2016)
  • 10. Where does data engineering fit in the context of data analytics? 11 The Internet of Events (Van Der Aalst, 2016)
  • 11. 12 Alluvial diagram of Big Data job families vs. Big Data skill sets (De Mauro, Greco, Grimaldi, & Ritala, 2018)
  • 12. 13 Word cloud showing the top 50 words recurring in the Job Title of posts related to Big Data. The font size of each word is proportional to the number of occurrences of each word. (De Mauro, Greco, Grimaldi, & Ritala, 2018).
  • 15. 2. How do I get started in Data Engineering? Process, Skills, Tools
  • 16. “A Data Engineer is someone who has specialized their skills in creating software solutions around data. 17 (Anderson, 2017) Jesse Anderson Data Engineer Managing Director, Big Data Institute
  • 17. “I started getting into Data about 5 years ago when everyone started talking about it and was becoming the new buzz word. People were actually starting to realize how much you can do with data. Everyone wanted to learn Data Science and all the companies wanted to get into Machine Learning or Artificial Intelligence, but there was still a missing piece - that's when I learned about Data Engineering. 18 Miles Ong Data Engineer Kumu
  • 18. “There was always the problem of collecting the data, processing the analysis and actually implementing the insights. It was overwhelming at first because I had no clue on where to begin. What helped is to focus on things one at a time and actually trying things out. For Data Engineering, the only way to learn is by doing. 19 Miles Ong Data Engineer Kumu
  • 19. “The more I learned, the more I realized how powerful Data Engineering is. It would give me the capability to simply come up with an idea and actually implement it. It's a very underrated field, but I love the challenge of conceptualizing, building and implementing concrete solutions that make a difference. 20 Miles Ong Data Engineer Kumu
  • 20. Steps to Data Engineering PRE-PROJECT (Anderson, 2017)
  • 21. Steps to Data Engineering FORM TEAM (Anderson, 2017)
  • 22. Steps to Data Engineering USE CASE (Anderson, 2017)
  • 23. Steps to Data Engineering KNOW GAPS (Anderson, 2017)
  • 24. Steps to Data Engineering TRAIN, MENTOR (Anderson, 2017)
  • 25. Steps to Data Engineering CHOOSE TECH (Anderson, 2017)
  • 26. Steps to Data Engineering WRITE CODE (Anderson, 2017)
  • 27. Steps to Data Engineering EVALUATE ITERATION (Anderson, 2017)
  • 28. Steps to Data Engineering REPEAT (Anderson, 2017)
  • 29. 3. Big Data Concepts and Applications
  • 30. Big Data (Working Definition) Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. 31
  • 31. Big Data (Working Definition) Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. 32 VOLUME VARIETY VELOCITY … (more on this in the next slide...)
  • 32. Author's reinterpretation of The 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 33 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA
  • 33. Author's reinterpretation of The 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 34 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATAStructured Unstructured Text, Image, video, social relations Multi-factor Probabilistic
  • 34. Author's reinterpretation of The 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 35 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Terabytes Records Architecture Transactions Tables, Files
  • 35. Author's reinterpretation of The 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 36 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Statistical Events Correlations Hypothetical Fresh? Old?
  • 36. Author's reinterpretation of The 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 37 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Batch Real/near-time Processes Streams
  • 37. Author's reinterpretation of The 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 38 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Trustworthiness Authenticity Origin, Reputation Availability Accountability
  • 38. Workload Management 39 Servers Queues (or Partitions) Service vs Wait Times Concurrency Serial vs Parallel Synchronous vs Asynchronous
  • 39. 40 SI SISD SD Single Instruction Single Data Systems Flynn's Taxonomy
  • 40. 41 SI SIMD SD Single Instruction Multiple Data Systems SIMD SD SIMD SD Flynn's Taxonomy
  • 41. 42 SI SIMD SD Multiple Instruction Multiple Data Systems SIMD SD SIMD SD SI SI Multiple Instructions Multiple DataMULTI-PROCESSORS Flynn's Taxonomy
  • 42. 43 SD SIMD SI Multiple Instruction Single Data Systems (Pipeline?) SIMD SI SIMD SI Single Data Multiple Instructions Flynn's Taxonomy
  • 46. Adapted from Image by Cloudera 47
  • 47. Hadoop Ecosystem (Edureka!, https://www.edureka.co/blog/hadoop-ecosystem ) 48
  • 48. Sample AWS Data Lake Platform (AWS, https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html) 49
  • 49. Data Integration and Big Data Analytics Framework (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019) 50
  • 50. First Stage phases in detail using Yelp as example (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019) 51
  • 51. Deep Learning Modeling Lifecycle (Miao, Li, Davis, & Deshpande, 2017) 52
  • 53. “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. 54 (Dixon, 2010; Miloslavskaya & Tolstoy, 2016) James Dixon Chief Technology Officer Pentaho
  • 54. 55 A data lake refers to a massively scalable storage repository that holds a vast amount of raw data in its native format («as is») until it is needed plus processing systems (engine) that can ingest data without compromising the data structure (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 55. Three types of big data processing Batch Processing Stream Processing (Kappa Architecture) Hybrid Processing (Lambda Architecture) 56 (Marz & Warren, 2015) (Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
  • 56. Kappa Architecture 57(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
  • 57. Lambda Architecture 58(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
  • 59. 60 What is Fast Data? Fast data corresponds to the application of big data analytics to smaller data sets in near-real or real-time in order to solve a particular problem. (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 60. 61 What is Fast Data? The combination of in-memory databases and data grid on top of flash devices will allow an increase in the capacity of stream processing. Fast data is a complementary approach to big data for managing large quantities of «in-flight» data (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 61. 62 Fast Data requires two technologies (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016) Streaming system capable of of delivering events as fast as they come in Data store capable of processing each item as fast as it arrives
  • 64. 65 Data Cleaning Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Sample inconsistencies: ● misspellings during data entry ● missing information ● other invalid data (Wang, Kon, & Madnick, 1993)
  • 65. Data Cleaning Approaches 66 Data Analysis Definition of Transformation Workflow Data Verification Data Transformation Backflow of Cleaned Data Special domain cleaning Specialized cleaning tools ETL Tools
  • 67. 68
  • 69. 70
  • 71. 72 Event streaming is the digital equivalent of the human body's central nervous system. It is the technological foundation for the 'always-on' world where businesses are increasingly software-defined and automated, and where the user of software is more software. (Apache Software Foundation, 2017)
  • 72. 73 (Apache Software Foundation, 2017) Capture data in real-time from event sources.
  • 73. 74 (Apache Software Foundation, 2017) Capture data in real-time from event sources. storing these event streams durably for later retrieval and manipulation
  • 74. 75 (Apache Software Foundation, 2017) Capture data in real-time from event sources. storing these event streams durably for later retrieval and manipulation routing the event streams to different destination technologies as needed
  • 75. Data Streams (examples) Time Series Data Network Traffic Telecommunications 76 Video Surveillance Website Clickstreams Sensor Networks (Miloslavskaya & Tolstoy, 2016)
  • 76. 77 Illustration of Data Capture Equipment (Chen, Mao, & Liu, 2014)
  • 79. 80 Visualization of Eric Brewer's CAP Theorem (Brewer, 2012; Khazaei, 2016)
  • 80. 81 Big Data characteristics and NoSQL features (Khazaei, 2016)
  • 81. “When a system processes trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and must be accounted for upfront in the design and architecture of the system. 82 (Vogels, 2009) Werner Vogels Vice President & Chief Technology Officer Amazon.com, Source: Wikipedia
  • 83. 84 ACID vs BASE ACID ● Atomicity ● Consistent ● Isolated ● Durable BASE ● Basic Availability ● Soft state ● Eventual Consistency (Brewer, 2012; Tudorica & Bucur, 2011)
  • 85. 86 No official NoSQL Taxonomy exists: Core NoSQL Example Wide Column Store Hadoop / HBase, Cassandra, Hypertable, Cloudata, Amazon SimpleDB, SciDB Document Store CouchDB, MongoDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB, CloudKit, Persevere, Jackrabbit Key Value/Tuple Store Azure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, Tokyo Cabinet / Tyrant, GT.M, Keyspace, Berkeley DB, MemcacheDB, HamsterDB, Faircom CTree, Mnesia, LightCloud, Pincaster, Hibari, Scality Eventually-Consistent Key Value Store Amazon Dynamo, Voldemort, Dynomite, KAI, SubRecord, Mo8onDb, Dovetaildb Graph Database Neo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, AllegroGraph, Bigdata, DEX, OpenLink Virtuoso, VertexDB, FlockDB, Java Universal Network / Graph Framework, Sesame, Filament, OWLim, NetworkX, iGraph(Tudorica & Bucur, 2011)
  • 86. 87 No official NoSQL Taxonomy exists: Soft NoSQL Example Object Databases db4o, Versant, Objectivity, Gemstone, Progress, Starcounter, Perst, ZODB, NEO, PicoLisp, Sterling, StupidDB, KiokuDB, Durus Grid and Cloud Database Solutions GigaSpaces, Queplix, Hazelcast, Joafip, GridGain, Infinispan, Coherence, eXtremeScale XML Databases Mark Logic Server, EMC Documentum xDB, Tamino, eXist, Sedna, BaseX, Xindice, Qizx, Berkeley DB XML Multivalue Databases U2, OpenInsight, OpenQM, Globals Other NoSQL related databases IBM Lotus/Domino, Intersystems Cache, eXtremeDB, ISIS Family, Prevayler, Yserial (Tudorica & Bucur, 2011)
  • 87. Key Value 88 Simplest form of database management systems Can only store pairs of keys and values, as well as retrieve values when a key is known Normally not adequate for complex applications Simplicity makes these attractive in certain circumstances (Khazaei, 2016)
  • 88. Column-Oriented 89 Stores data in records with an ability to hold very large numbers of dynamic columns Can be seen as two-dimensional key-value stores Schema-free like document stores, however the implementation is significantly different (Khazaei, 2016)
  • 89. Document Stores 90 Also known as document-oriented database systems Schema-free organization Records (or "documents") do not need to have a uniform structure The types of the values of individual columns can be different Columns can have more than one value (arrays); records can have a nested structure Document stores often use internal notations, usually JSON. (Khazaei, 2016)
  • 90. Graph Oriented 91 Represent data in graph structures as nodes and edges Edges represent relationships between nodes Allow easy processing of data in that form (graphs) Simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node (Khazaei, 2016)
  • 92. Interrelation between Big Data, Fast Data, and Data Lake Concepts (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 93. 94 Takeaways Upskilling not impossible Understand workload management and trade-offs when making architecture decisions Don't be afraid to work with other people Experiment, experiment, experiment… Be a wide reader and hungry learner
  • 94. 95 Thanks! Any questions? You can email me at: jbilagan@ateneo.edu
  • 95. Credits Special thanks to all the people who made and released these awesome resources for free: ⬡ Presentation template by SlidesCarnival ⬡ Photographs by Unsplash 96
  • 96. ReferencesAnaconda (2020). 2020 State of Data Science Moving from hype toward maturity. Anderson, J. (2017). Data Engineering Teams Creating Successful Big Data Teams and Products. Anderson, J. (2018, April 11). Data engineers vs. data scientists. O'Reilly Media. [https://www.oreilly.com/radar/data-engineers-vs-data-scientists/](https://www.oreilly.com/radar/data-engineers-vs-data-scientists/). Apache Software Foundation. (2017). INTRODUCTION Everything you need to know about Kafka in 10 minutes. Apache Kafka. [https://kafka.apache.org/intro](https://kafka.apache.org/intro). Brewer, E. (2001). Lessons from giant-scale services IEEE Internet Computing 5(4), 46-55. [https://dx.doi.org/10.1109/4236.939450](https://dx.doi.org/10.1109/4236.939450) Brewer, E. (2012). CAP Twelve Years Later: How the “Rules” Have Changed Computer 45(2), 23-29. [https://dx.doi.org/10.1109/mc.2012.37](https://dx.doi.org/10.1109/mc.2012.37) Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, 19(2), 171-209. Dean, J., Ghemawat, S., Mehta, B. (2008). MapReduce: simplified data processing on large clusters Communications of the ACM 51(1), 107-113. https://dx.doi.org/10.1145/1327452.1327492 De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2018). Human resources for Big Data professions: A systematic classification of job roles and required skill sets. Information Processing & Management, 54(5), 807-817. Devopedia. 2020. "CAP Theorem." Version 4, April 30. Accessed 2020-09-14. https://devopedia.org/cap-theorem Dixon, J. (2010, October 14). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog. [https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/](https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/). Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big data cleaning algorithms in cloud computing. International Journal of Online Engineering, 9(3), 77–81. [https://doi.org/10.3991/ijoe.v9i3.2765](https://doi.org/10.3991/ijoe.v9i3.2765) Gray, J., & Shenoy, P. (2000). Rules of thumb in data engineering. Proceedings - International Conference on Data Engineering, 3–10. [https://doi.org/10.1109/icde.2000.839382](https://doi.org/10.1109/icde.2000.839382) 97
  • 97. References Ishwarappa, & Anuradha, J. (2015). A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Computer Science, 48(C), 319–324. [https://doi.org/10.1016/j.procs.2015.04.188](https://doi.org/10.1016/j.procs.2015.04.188) Jimenez-Marquez, J., Gonzalez-Carrasco, I., Lopez-Cuadrado, J., Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content International Journal of Information Management 44(), 1-12. [https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003](https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003) Khazaei, H. (2016). How do I choose the right NoSQL solution? Big Data, X(0), 1–33. Laskowski, N. (2016). Data lake governance: A big data do or die. URL: [http://searchcio](http://searchcio/).techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die (access date 28/05/2016) Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co. Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (pp. 571-582). IEEE. Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305. [https://doi.org/10.1016/j.procs.2016.07.439](https://doi.org/10.1016/j.procs.2016.07.439) Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13. Samizadeh, I. (2018, March 15). A brief introduction to two data processing architectures - Lambda and Kappa for Big Data. [https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb](https://towardsdatasci ence.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb). 98
  • 98. References Tudorica, B., Bucur, C. (2011). A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research 1(), 1-5. [https://dx.doi.org/10.1109/roedunet.2011.5993686](https://dx.doi.org/10.1109/roedunet.2011.5993686) Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Berlin, Heidelberg. Vogels, W. (2009). Eventually consistent Communications of the ACM 52(1), 40-44. [https://dx.doi.org/10.1145/1435417.1435432](https://dx.doi.org/10.1145/1435417.1435432) Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data quality requirements analysis and modeling. Proceedings - International Conference on Data Engineering, 670–677. [https://doi.org/10.1109/icde.1993.344012](https://doi.org/10.1109/icde.1993.344012) Yin, S., & Kaynak, O. (2015). Big data for modern industry: challenges and trends [point of view]. Proceedings of the IEEE, 103(2), 143-146. 99