The role of data engineering in data science and analytics practice
The document discusses the role of data engineering in the context of data science and analytics, highlighting the multidisciplinary nature of data engineering teams and the significance of creating data pipelines. It outlines essential skills, processes, and big data concepts, including the importance of data cleaning, event streaming, and the distinctions between big data and fast data. The document emphasizes continuous learning and collaboration within the field of data engineering.
A data engineeringTeam is NOT a
collection of Data Engineers
A data engineering team isn’t made up of a
single type of person or title
A data engineering team is
multidisciplinary
6
Data Engineering Team
(Anderson, 2017)
7.
Creates data pipelines
Bringstogether 10-30 different big data technologies
Understands and chooses the right tools for the job
Understands the various technologies and
frameworks in-depth
Combines them to create solutions to enable a
company’s business processes with data pipelines
7
Data Engineering Team
(Anderson, 2018)
8.
Data science isan interdisciplinary field aiming to turn data into real value.
Data may be structured or unstructured, big or small, static or streaming.
Value may be provided in the form of predictions, automated decisions, models
learned from data, or any type of data visualization delivering insights.
Data science includes data extraction, data preparation, data exploration, data
transformation, storage and retrieval, computing infrastructures, various types of
mining and learning, presentation of explanations and predictions, and the
exploitation of results taking into account ethical, social, legal, and business aspects.
9
Where does data engineering fit in the
context of data analytics?
(Van Der Aalst, 2016)
9.
Where does dataengineering fit in the
context of data analytics?
10
The ingredients contributing to data science (Van Der Aalst, 2016)
10.
Where does dataengineering fit in the
context of data analytics?
11
The Internet of Events (Van Der Aalst, 2016)
11.
12
Alluvial diagram ofBig Data job families vs. Big Data skill sets
(De Mauro, Greco, Grimaldi, & Ritala, 2018)
12.
13
Word cloud showingthe top 50 words recurring in the Job Title of posts related to Big Data.
The font size of each word is proportional to the number of occurrences of each word.
(De Mauro, Greco, Grimaldi, & Ritala, 2018).
2.
How do Iget
started in Data
Engineering?
Process, Skills, Tools
16.
“A Data Engineeris
someone who has
specialized their skills
in creating software
solutions around data.
17
(Anderson, 2017)
Jesse Anderson
Data Engineer
Managing Director, Big Data Institute
17.
“I started gettinginto Data about 5
years ago when everyone started
talking about it and was becoming
the new buzz word. People were
actually starting to realize how much
you can do with data. Everyone
wanted to learn Data Science and all
the companies wanted to get into
Machine Learning or Artificial
Intelligence, but there was still a
missing piece - that's when I learned
about Data Engineering.
18
Miles Ong
Data Engineer
Kumu
18.
“There was alwaysthe problem of
collecting the data, processing the
analysis and actually implementing
the insights. It was overwhelming at
first because I had no clue on where
to begin. What helped is to focus on
things one at a time and actually
trying things out. For Data
Engineering, the only way to learn is
by doing.
19
Miles Ong
Data Engineer
Kumu
19.
“The more Ilearned, the more I
realized how powerful Data
Engineering is. It would give me the
capability to simply come up with an
idea and actually implement it. It's a
very underrated field, but I love the
challenge of conceptualizing,
building and implementing concrete
solutions that make a difference.
20
Miles Ong
Data Engineer
Kumu
20.
Steps to DataEngineering
PRE-PROJECT
(Anderson, 2017)
Big Data (WorkingDefinition)
Big data is a field that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
31
31.
Big Data (WorkingDefinition)
Big data is a field that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
32
VOLUME
VARIETY
VELOCITY
…
(more on this in the
next slide...)
32.
Author's reinterpretation ofThe 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
33
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
33.
Author's reinterpretation ofThe 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
34
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATAStructured
Unstructured
Text, Image,
video, social
relations
Multi-factor
Probabilistic
34.
Author's reinterpretation ofThe 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
35
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Terabytes
Records
Architecture
Transactions
Tables, Files
35.
Author's reinterpretation ofThe 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
36
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Statistical
Events
Correlations
Hypothetical
Fresh? Old?
36.
Author's reinterpretation ofThe 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
37
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Batch
Real/near-time
Processes
Streams
37.
Author's reinterpretation ofThe 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
38
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Trustworthiness
Authenticity
Origin, Reputation
Availability
Accountability
“If you thinkof a datamart as a
store of bottled water – cleansed
and packaged and structured for
easy consumption – the data lake
is a large body of water in a more
natural state. The contents of the
data lake stream in from a source
to fill the lake, and various users
of the lake can come to examine,
dive in, or take samples.
54
(Dixon, 2010; Miloslavskaya & Tolstoy, 2016)
James Dixon
Chief Technology Officer
Pentaho
54.
55
A data lakerefers to a massively scalable storage
repository that holds a vast amount of raw data in its
native format («as is») until it is needed plus processing
systems (engine) that can ingest data without
compromising the data structure
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
55.
Three types ofbig data
processing
Batch Processing Stream Processing
(Kappa Architecture)
Hybrid Processing
(Lambda Architecture)
56
(Marz & Warren, 2015)
(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
60
What is FastData?
Fast data corresponds to the application of
big data analytics to smaller data sets in
near-real or real-time in order to solve a
particular problem.
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
60.
61
What is FastData?
The combination of in-memory databases and data
grid on top of flash devices will allow an increase in
the capacity of stream processing.
Fast data is a complementary approach to big data
for managing large quantities of «in-flight» data
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
61.
62
Fast Data requirestwo
technologies
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
Streaming system capable of of delivering
events as fast as they come in
Data store capable of processing each
item as fast as it arrives
65
Data Cleaning
Data cleaning,also called data cleansing or scrubbing, deals with
detecting and removing errors and inconsistencies from data in order
to improve the quality of data.
Sample inconsistencies:
● misspellings during data entry
● missing information
● other invalid data
(Wang, Kon, & Madnick, 1993)
65.
Data Cleaning Approaches
66
DataAnalysis Definition of
Transformation
Workflow
Data
Verification
Data
Transformation
Backflow of
Cleaned Data
Special domain cleaning
Specialized cleaning tools
ETL Tools
72
Event streaming isthe digital equivalent of the human
body's central nervous system. It is the technological
foundation for the 'always-on' world where businesses
are increasingly software-defined and automated, and
where the user of software is more software.
(Apache Software Foundation, 2017)
74
(Apache Software Foundation,2017)
Capture data in
real-time from event
sources.
storing these event
streams durably for later
retrieval and manipulation
74.
75
(Apache Software Foundation,2017)
Capture data in
real-time from event
sources.
storing these event
streams durably for later
retrieval and manipulation
routing the event streams
to different destination
technologies as needed
75.
Data Streams (examples)
TimeSeries Data Network Traffic Telecommunications
76
Video Surveillance Website Clickstreams Sensor Networks
(Miloslavskaya & Tolstoy, 2016)
“When a systemprocesses
trillions and trillions of
requests, events that normally
have a low probability of
occurrence are now
guaranteed to happen and
must be accounted for upfront
in the design and architecture
of the system.
82
(Vogels, 2009)
Werner Vogels
Vice President &
Chief Technology Officer
Amazon.com,
Source: Wikipedia
87
No official NoSQLTaxonomy exists:
Soft NoSQL Example
Object Databases
db4o, Versant, Objectivity,
Gemstone, Progress,
Starcounter, Perst, ZODB,
NEO, PicoLisp, Sterling,
StupidDB, KiokuDB, Durus
Grid and Cloud Database
Solutions
GigaSpaces, Queplix,
Hazelcast, Joafip, GridGain,
Infinispan, Coherence,
eXtremeScale
XML Databases
Mark Logic Server, EMC
Documentum xDB, Tamino,
eXist, Sedna, BaseX, Xindice,
Qizx, Berkeley DB XML
Multivalue Databases
U2, OpenInsight, OpenQM,
Globals
Other NoSQL related
databases
IBM Lotus/Domino,
Intersystems Cache,
eXtremeDB, ISIS Family,
Prevayler, Yserial
(Tudorica & Bucur, 2011)
87.
Key Value
88
Simplest formof database management
systems
Can only store pairs of keys and values,
as well as retrieve values when a key is
known
Normally not adequate for complex
applications
Simplicity makes these attractive in
certain circumstances
(Khazaei, 2016)
88.
Column-Oriented
89
Stores data inrecords with an ability to
hold very large numbers of dynamic
columns
Can be seen as two-dimensional
key-value stores
Schema-free like document stores,
however the implementation is
significantly different
(Khazaei, 2016)
89.
Document Stores
90
Also knownas document-oriented
database systems
Schema-free organization
Records (or "documents") do not need
to have a uniform structure
The types of the values of individual
columns can be different
Columns can have more than one value
(arrays); records can have a nested
structure
Document stores often use internal
notations, usually JSON.
(Khazaei, 2016)
90.
Graph Oriented
91
Represent datain graph structures as
nodes and edges
Edges represent relationships between
nodes
Allow easy processing of data in that
form (graphs)
Simple calculation of specific properties
of the graph, such as the number of
steps needed to get from one node to
another node
(Khazaei, 2016)
Interrelation between BigData, Fast Data,
and Data Lake Concepts
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
93.
94
Takeaways
Upskilling not impossible
Understandworkload management and
trade-offs when making architecture
decisions
Don't be afraid to work with other people
Experiment, experiment, experiment…
Be a wide reader and hungry learner
Credits
Special thanks toall the people who made
and released these awesome resources for
free:
⬡ Presentation template by SlidesCarnival
⬡ Photographs by Unsplash
96
96.
ReferencesAnaconda (2020). 2020State of Data Science Moving from hype toward maturity.
Anderson, J. (2017). Data Engineering Teams Creating Successful Big Data Teams and Products.
Anderson, J. (2018, April 11). Data engineers vs. data scientists. O'Reilly Media.
[https://www.oreilly.com/radar/data-engineers-vs-data-scientists/](https://www.oreilly.com/radar/data-engineers-vs-data-scientists/).
Apache Software Foundation. (2017). INTRODUCTION Everything you need to know about Kafka in 10 minutes. Apache Kafka.
[https://kafka.apache.org/intro](https://kafka.apache.org/intro).
Brewer, E. (2001). Lessons from giant-scale services IEEE Internet Computing 5(4), 46-55.
[https://dx.doi.org/10.1109/4236.939450](https://dx.doi.org/10.1109/4236.939450)
Brewer, E. (2012). CAP Twelve Years Later: How the “Rules” Have Changed Computer 45(2), 23-29.
[https://dx.doi.org/10.1109/mc.2012.37](https://dx.doi.org/10.1109/mc.2012.37)
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, 19(2), 171-209.
Dean, J., Ghemawat, S., Mehta, B. (2008). MapReduce: simplified data processing on large clusters Communications of the ACM 51(1), 107-113.
https://dx.doi.org/10.1145/1327452.1327492
De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2018). Human resources for Big Data professions: A systematic classification of job roles and required skill sets.
Information Processing & Management, 54(5), 807-817.
Devopedia. 2020. "CAP Theorem." Version 4, April 30. Accessed 2020-09-14. https://devopedia.org/cap-theorem
Dixon, J. (2010, October 14). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog.
[https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/](https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/).
Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big data cleaning algorithms in cloud computing. International Journal of Online Engineering, 9(3),
77–81. [https://doi.org/10.3991/ijoe.v9i3.2765](https://doi.org/10.3991/ijoe.v9i3.2765)
Gray, J., & Shenoy, P. (2000). Rules of thumb in data engineering. Proceedings - International Conference on Data Engineering, 3–10.
[https://doi.org/10.1109/icde.2000.839382](https://doi.org/10.1109/icde.2000.839382)
97
97.
References
Ishwarappa, & Anuradha,J. (2015). A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Computer Science, 48(C), 319–324.
[https://doi.org/10.1016/j.procs.2015.04.188](https://doi.org/10.1016/j.procs.2015.04.188)
Jimenez-Marquez, J., Gonzalez-Carrasco, I., Lopez-Cuadrado, J., Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content International
Journal of Information Management 44(), 1-12. [https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003](https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003)
Khazaei, H. (2016). How do I choose the right NoSQL solution? Big Data, X(0), 1–33.
Laskowski, N. (2016). Data lake governance: A big data do or die. URL:
[http://searchcio](http://searchcio/).techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die (access date 28/05/2016)
Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co.
Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on
Data Engineering (ICDE) (pp. 571-582). IEEE.
Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305.
[https://doi.org/10.1016/j.procs.2016.07.439](https://doi.org/10.1016/j.procs.2016.07.439)
Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal
of Big Data, 2(1), 1.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.
Samizadeh, I. (2018, March 15). A brief introduction to two data processing architectures - Lambda and Kappa for Big Data.
[https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb](https://towardsdatasci
ence.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb).
98
98.
References
Tudorica, B., Bucur,C. (2011). A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition:
Networking in Education and Research 1(), 1-5. [https://dx.doi.org/10.1109/roedunet.2011.5993686](https://dx.doi.org/10.1109/roedunet.2011.5993686)
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Berlin, Heidelberg.
Vogels, W. (2009). Eventually consistent Communications of the ACM 52(1), 40-44.
[https://dx.doi.org/10.1145/1435417.1435432](https://dx.doi.org/10.1145/1435417.1435432)
Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data quality requirements analysis and modeling. Proceedings - International Conference on Data Engineering, 670–677.
[https://doi.org/10.1109/icde.1993.344012](https://doi.org/10.1109/icde.1993.344012)
Yin, S., & Kaynak, O. (2015). Big data for modern industry: challenges and trends [point of view]. Proceedings of the IEEE, 103(2), 143-146.
99