The role of data engineering in data science and analytics practice

The Role of Data
Engineering in Data
Science and Analytics
Practice
Joseph Benjamin ILAGAN
Ateneo de Manila University
jbilagan@ateneo.edu

Hello!
I am Joben Ilagan
Thank you for having me here.
Twitter: @jilagan
LinkedIn: https://www.linkedin.com/in/jobenilagan/
2

1.
Larger Context
Transition from previous talk

A data engineering Team is NOT a
collection of Data Engineers
A data engineering team isn’t made up of a
single type of person or title
A data engineering team is
multidisciplinary
6
Data Engineering Team
(Anderson, 2017)

Creates data pipelines
Brings together 10-30 different big data technologies
Understands and chooses the right tools for the job
Understands the various technologies and
frameworks in-depth
Combines them to create solutions to enable a
company’s business processes with data pipelines
7
Data Engineering Team
(Anderson, 2018)

Data science is an interdisciplinary ﬁeld aiming to turn data into real value.
Data may be structured or unstructured, big or small, static or streaming.
Value may be provided in the form of predictions, automated decisions, models
learned from data, or any type of data visualization delivering insights.
Data science includes data extraction, data preparation, data exploration, data
transformation, storage and retrieval, computing infrastructures, various types of
mining and learning, presentation of explanations and predictions, and the
exploitation of results taking into account ethical, social, legal, and business aspects.
9
Where does data engineering ﬁt in the
context of data analytics?
(Van Der Aalst, 2016)

10
The ingredients contributing to data science (Van Der Aalst, 2016)

11
The Internet of Events (Van Der Aalst, 2016)

12
Alluvial diagram of Big Data job families vs. Big Data skill sets
(De Mauro, Greco, Grimaldi, & Ritala, 2018)

13
Word cloud showing the top 50 words recurring in the Job Title of posts related to Big Data.
The font size of each word is proportional to the number of occurrences of each word.
(De Mauro, Greco, Grimaldi, & Ritala, 2018).

2.
How do I get
started in Data
Engineering?
Process, Skills, Tools

“A Data Engineer is
someone who has
specialized their skills
in creating software
solutions around data.
17
(Anderson, 2017)
Jesse Anderson
Data Engineer
Managing Director, Big Data Institute

“I started getting into Data about 5
years ago when everyone started
talking about it and was becoming
the new buzz word. People were
actually starting to realize how much
you can do with data. Everyone
wanted to learn Data Science and all
the companies wanted to get into
Machine Learning or Artiﬁcial
Intelligence, but there was still a
missing piece - that's when I learned
about Data Engineering.
18
Miles Ong
Data Engineer
Kumu

“There was always the problem of
collecting the data, processing the
analysis and actually implementing
the insights. It was overwhelming at
ﬁrst because I had no clue on where
to begin. What helped is to focus on
things one at a time and actually
trying things out. For Data
Engineering, the only way to learn is
by doing.
19
Miles Ong
Data Engineer
Kumu

“The more I learned, the more I
realized how powerful Data
Engineering is. It would give me the
capability to simply come up with an
idea and actually implement it. It's a
very underrated ﬁeld, but I love the
challenge of conceptualizing,
building and implementing concrete
solutions that make a difference.
20
Miles Ong
Data Engineer
Kumu

Steps to Data Engineering
PRE-PROJECT
(Anderson, 2017)

FORM TEAM
(Anderson, 2017)

USE CASE
(Anderson, 2017)

KNOW GAPS
(Anderson, 2017)

TRAIN, MENTOR
(Anderson, 2017)

CHOOSE TECH
(Anderson, 2017)

WRITE CODE
(Anderson, 2017)

EVALUATE
ITERATION
(Anderson, 2017)

REPEAT
(Anderson, 2017)

3.
Big Data
Concepts and Applications

Big Data (Working Deﬁnition)
Big data is a ﬁeld that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
31

Big Data (Working Deﬁnition)
Big data is a ﬁeld that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
32
VOLUME
VARIETY
VELOCITY
…
(more on this in the
next slide...)

Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
33
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA

34
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATAStructured
Unstructured
Text, Image,
video, social
relations
Multi-factor
Probabilistic

35
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Terabytes
Records
Architecture
Transactions
Tables, Files

36
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Statistical
Events
Correlations
Hypothetical
Fresh? Old?

37
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Batch
Real/near-time
Processes
Streams

38
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Trustworthiness
Authenticity
Origin, Reputation
Availability
Accountability

Workload
Management
39
Servers
Queues (or Partitions)
Service vs Wait Times
Concurrency
Serial vs Parallel
Synchronous vs Asynchronous

40
SI SISD SD
Single Instruction
Single Data Systems
Flynn's Taxonomy

41
SI
SIMD SD
Single Instruction
Multiple Data Systems
SIMD SD
SIMD SD
Flynn's Taxonomy

42
SI
SIMD SD
Multiple Instruction
Multiple Data Systems
SIMD SD
SIMD SD
SI
SI
Multiple
Instructions
Multiple DataMULTI-PROCESSORS
Flynn's Taxonomy

43
SD
SIMD SI
Multiple Instruction
Single Data Systems (Pipeline?)
SIMD SI
SIMD SI
Single Data Multiple
Instructions
Flynn's Taxonomy

Traditional Way
44
(Edureka! https://www.edureka.co/blog/mapreduce-tutorial/)

Map Reduce
46
(Edureka! https://www.edureka.co/blog/mapreduce-tutorial/)

Adapted from Image by Cloudera
47

Hadoop Ecosystem (Edureka!, https://www.edureka.co/blog/hadoop-ecosystem )
48

Sample AWS Data Lake Platform
(AWS, https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html) 49

Data Integration and Big Data Analytics Framework (Jimenez-Marquez,
Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019)
50

First Stage phases in detail using Yelp as example
(Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019)
51

Deep Learning Modeling Lifecycle (Miao, Li, Davis, & Deshpande, 2017)
52

“If you think of a datamart as a
store of bottled water – cleansed
and packaged and structured for
easy consumption – the data lake
is a large body of water in a more
natural state. The contents of the
data lake stream in from a source
to ﬁll the lake, and various users
of the lake can come to examine,
dive in, or take samples.
54
(Dixon, 2010; Miloslavskaya & Tolstoy, 2016)
James Dixon
Chief Technology Ofﬁcer
Pentaho

55
A data lake refers to a massively scalable storage
repository that holds a vast amount of raw data in its
native format («as is») until it is needed plus processing
systems (engine) that can ingest data without
compromising the data structure
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)

Three types of big data
processing
Batch Processing Stream Processing
(Kappa Architecture)
Hybrid Processing
(Lambda Architecture)
56
(Marz & Warren, 2015)
(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)

Kappa Architecture
57(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)

Lambda Architecture
58(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)

60
What is Fast Data?
Fast data corresponds to the application of
big data analytics to smaller data sets in
near-real or real-time in order to solve a
particular problem.

61
What is Fast Data?
The combination of in-memory databases and data
grid on top of ﬂash devices will allow an increase in
the capacity of stream processing.
Fast data is a complementary approach to big data
for managing large quantities of «in-ﬂight» data

62
Fast Data requires two
technologies
Streaming system capable of of delivering
events as fast as they come in
Data store capable of processing each
item as fast as it arrives

65
Data Cleaning
Data cleaning, also called data cleansing or scrubbing, deals with
detecting and removing errors and inconsistencies from data in order
to improve the quality of data.
Sample inconsistencies:
● misspellings during data entry
● missing information
● other invalid data
(Wang, Kon, & Madnick, 1993)

Data Cleaning Approaches
66
Data Analysis Definition of
Transformation
Workflow
Data
Verification
Data
Transformation
Backflow of
Cleaned Data
Special domain cleaning
Specialized cleaning tools
ETL Tools

72
Event streaming is the digital equivalent of the human
body's central nervous system. It is the technological
foundation for the 'always-on' world where businesses
are increasingly software-deﬁned and automated, and
where the user of software is more software.
(Apache Software Foundation, 2017)

73
Capture data in
real-time from event
sources.

74
Capture data in
sources.
storing these event
streams durably for later
retrieval and manipulation

75
Capture data in
sources.
storing these event
streams durably for later
retrieval and manipulation
routing the event streams
to different destination
technologies as needed

Data Streams (examples)
Time Series Data Network Trafﬁc Telecommunications
76
Video Surveillance Website Clickstreams Sensor Networks
(Miloslavskaya & Tolstoy, 2016)

77
Illustration of Data Capture Equipment (Chen, Mao, & Liu, 2014)

CAP
Theorem
Consistency,
Availability,
Partition Tolerance
Trade-offs
79

80
Visualization of Eric Brewer's CAP Theorem (Brewer, 2012; Khazaei, 2016)

81
Big Data characteristics and NoSQL features (Khazaei, 2016)

“When a system processes
trillions and trillions of
requests, events that normally
have a low probability of
occurrence are now
guaranteed to happen and
must be accounted for upfront
in the design and architecture
of the system.
82
(Vogels, 2009)
Werner Vogels
Vice President &
Chief Technology Ofﬁcer
Amazon.com,
Source: Wikipedia

84
ACID vs BASE
ACID
● Atomicity
● Consistent
● Isolated
● Durable
BASE
● Basic Availability
● Soft state
● Eventual
Consistency
(Brewer, 2012; Tudorica & Bucur, 2011)

NoSQL
Survey, Comparisons,
Taxonomies
85

86
No official NoSQL Taxonomy exists:
Core NoSQL Example
Wide Column Store
Hadoop / HBase,
Cassandra, Hypertable,
Cloudata, Amazon
SimpleDB, SciDB
Document Store
CouchDB, MongoDB,
Terrastore, ThruDB,
OrientDB, RavenDB,
Citrusleaf, SisoDB, CloudKit,
Persevere, Jackrabbit
Key Value/Tuple Store
Azure Table Storage,
MEMBASE, Riak, Redis,
Chordless, GenieDB,
Scalaris, Tokyo Cabinet /
Tyrant, GT.M, Keyspace,
Berkeley DB, MemcacheDB,
HamsterDB, Faircom CTree,
Mnesia, LightCloud,
Pincaster, Hibari, Scality
Eventually-Consistent
Key Value Store
Amazon Dynamo,
Voldemort, Dynomite, KAI,
SubRecord, Mo8onDb,
Dovetaildb
Graph Database
Neo4J, Inﬁnite Graph, Sones,
InfoGrid, HyperGraphDB,
Trinity, AllegroGraph,
Bigdata, DEX, OpenLink
Virtuoso, VertexDB, FlockDB,
Java Universal Network /
Graph Framework, Sesame,
Filament, OWLim, NetworkX,
iGraph(Tudorica & Bucur, 2011)

87
No official NoSQL Taxonomy exists:
Soft NoSQL Example
Object Databases
db4o, Versant, Objectivity,
Gemstone, Progress,
Starcounter, Perst, ZODB,
NEO, PicoLisp, Sterling,
StupidDB, KiokuDB, Durus
Grid and Cloud Database
Solutions
GigaSpaces, Queplix,
Hazelcast, Joaﬁp, GridGain,
Inﬁnispan, Coherence,
eXtremeScale
XML Databases
Mark Logic Server, EMC
Documentum xDB, Tamino,
eXist, Sedna, BaseX, Xindice,
Qizx, Berkeley DB XML
Multivalue Databases
U2, OpenInsight, OpenQM,
Globals
Other NoSQL related
databases
IBM Lotus/Domino,
Intersystems Cache,
eXtremeDB, ISIS Family,
Prevayler, Yserial
(Tudorica & Bucur, 2011)

Key Value
88
Simplest form of database management
systems
Can only store pairs of keys and values,
as well as retrieve values when a key is
known
Normally not adequate for complex
applications
Simplicity makes these attractive in
certain circumstances
(Khazaei, 2016)

Column-Oriented
89
Stores data in records with an ability to
hold very large numbers of dynamic
columns
Can be seen as two-dimensional
key-value stores
Schema-free like document stores,
however the implementation is
signiﬁcantly different
(Khazaei, 2016)

Document Stores
90
Also known as document-oriented
database systems
Schema-free organization
Records (or "documents") do not need
to have a uniform structure
The types of the values of individual
columns can be different
Columns can have more than one value
(arrays); records can have a nested
structure
Document stores often use internal
notations, usually JSON.
(Khazaei, 2016)

Graph Oriented
91
Represent data in graph structures as
nodes and edges
Edges represent relationships between
nodes
Allow easy processing of data in that
form (graphs)
Simple calculation of speciﬁc properties
of the graph, such as the number of
steps needed to get from one node to
another node
(Khazaei, 2016)

92
NoSQL Solutions (Khazaei, 2016)

Interrelation between Big Data, Fast Data,
and Data Lake Concepts

94
Takeaways
Upskilling not impossible
Understand workload management and
trade-offs when making architecture
decisions
Don't be afraid to work with other people
Experiment, experiment, experiment…
Be a wide reader and hungry learner

95
Thanks!
Any questions?
You can email me at:
jbilagan@ateneo.edu

Credits
Special thanks to all the people who made
and released these awesome resources for
free:
⬡ Presentation template by SlidesCarnival
⬡ Photographs by Unsplash
96

ReferencesAnaconda (2020). 2020 State of Data Science Moving from hype toward maturity.
Anderson, J. (2017). Data Engineering Teams Creating Successful Big Data Teams and Products.
Anderson, J. (2018, April 11). Data engineers vs. data scientists. O'Reilly Media.
[https://www.oreilly.com/radar/data-engineers-vs-data-scientists/](https://www.oreilly.com/radar/data-engineers-vs-data-scientists/).
Apache Software Foundation. (2017). INTRODUCTION Everything you need to know about Kafka in 10 minutes. Apache Kafka.
[https://kafka.apache.org/intro](https://kafka.apache.org/intro).
Brewer, E. (2001). Lessons from giant-scale services IEEE Internet Computing 5(4), 46-55.
[https://dx.doi.org/10.1109/4236.939450](https://dx.doi.org/10.1109/4236.939450)
Brewer, E. (2012). CAP Twelve Years Later: How the “Rules” Have Changed Computer 45(2), 23-29.
[https://dx.doi.org/10.1109/mc.2012.37](https://dx.doi.org/10.1109/mc.2012.37)
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, 19(2), 171-209.
Dean, J., Ghemawat, S., Mehta, B. (2008). MapReduce: simpliﬁed data processing on large clusters Communications of the ACM 51(1), 107-113.
https://dx.doi.org/10.1145/1327452.1327492
De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2018). Human resources for Big Data professions: A systematic classiﬁcation of job roles and required skill sets.
Information Processing & Management, 54(5), 807-817.
Devopedia. 2020. "CAP Theorem." Version 4, April 30. Accessed 2020-09-14. https://devopedia.org/cap-theorem
Dixon, J. (2010, October 14). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog.
[https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/](https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/).
Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big data cleaning algorithms in cloud computing. International Journal of Online Engineering, 9(3),
77–81. [https://doi.org/10.3991/ijoe.v9i3.2765](https://doi.org/10.3991/ijoe.v9i3.2765)
Gray, J., & Shenoy, P. (2000). Rules of thumb in data engineering. Proceedings - International Conference on Data Engineering, 3–10.
[https://doi.org/10.1109/icde.2000.839382](https://doi.org/10.1109/icde.2000.839382)
97

References
Ishwarappa, & Anuradha, J. (2015). A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Computer Science, 48(C), 319–324.
[https://doi.org/10.1016/j.procs.2015.04.188](https://doi.org/10.1016/j.procs.2015.04.188)
Jimenez-Marquez, J., Gonzalez-Carrasco, I., Lopez-Cuadrado, J., Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content International
Journal of Information Management 44(), 1-12. [https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003](https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003)
Khazaei, H. (2016). How do I choose the right NoSQL solution? Big Data, X(0), 1–33.
Laskowski, N. (2016). Data lake governance: A big data do or die. URL:
[http://searchcio](http://searchcio/).techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die (access date 28/05/2016)
Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co.
Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Towards uniﬁed data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on
Data Engineering (ICDE) (pp. 571-582). IEEE.
Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305.
[https://doi.org/10.1016/j.procs.2016.07.439](https://doi.org/10.1016/j.procs.2016.07.439)
Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal
of Big Data, 2(1), 1.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.
Samizadeh, I. (2018, March 15). A brief introduction to two data processing architectures - Lambda and Kappa for Big Data.
[https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb](https://towardsdatasci
ence.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb).
98

References
Tudorica, B., Bucur, C. (2011). A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition:
Networking in Education and Research 1(), 1-5. [https://dx.doi.org/10.1109/roedunet.2011.5993686](https://dx.doi.org/10.1109/roedunet.2011.5993686)
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Berlin, Heidelberg.
Vogels, W. (2009). Eventually consistent Communications of the ACM 52(1), 40-44.
[https://dx.doi.org/10.1145/1435417.1435432](https://dx.doi.org/10.1145/1435417.1435432)
Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data quality requirements analysis and modeling. Proceedings - International Conference on Data Engineering, 670–677.
[https://doi.org/10.1109/icde.1993.344012](https://doi.org/10.1109/icde.1993.344012)
Yin, S., & Kaynak, O. (2015). Big data for modern industry: challenges and trends [point of view]. Proceedings of the IEEE, 103(2), 143-146.
99

The role of data engineering in data science and analytics practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The role of data engineering in data science and analytics practice

Similar to The role of data engineering in data science and analytics practice (20)

Recently uploaded

Recently uploaded (20)

The role of data engineering in data science and analytics practice