The Role of Data
Engineering in Data
Science and Analytics
Practice
Joseph Benjamin ILAGAN
Ateneo de Manila University
jbilagan@ateneo.edu
Hello!
I am Joben Ilagan
Thank you for having me here.
Twitter: @jilagan
LinkedIn: https://www.linkedin.com/in/jobenilagan/
2
1.
Larger Context
Transition from previous talk
4
(Anaconda, 2020, p.11)
5
(Anaconda, 2020, p.12)
A data engineering Team is NOT a
collection of Data Engineers
A data engineering team isn’t made up of a
single type of person or title
A data engineering team is
multidisciplinary
6
Data Engineering Team
(Anderson, 2017)
Creates data pipelines
Brings together 10-30 different big data technologies
Understands and chooses the right tools for the job
Understands the various technologies and
frameworks in-depth
Combines them to create solutions to enable a
company’s business processes with data pipelines
7
Data Engineering Team
(Anderson, 2018)
Data science is an interdisciplinary field aiming to turn data into real value.
Data may be structured or unstructured, big or small, static or streaming.
Value may be provided in the form of predictions, automated decisions, models
learned from data, or any type of data visualization delivering insights.
Data science includes data extraction, data preparation, data exploration, data
transformation, storage and retrieval, computing infrastructures, various types of
mining and learning, presentation of explanations and predictions, and the
exploitation of results taking into account ethical, social, legal, and business aspects.
9
Where does data engineering fit in the
context of data analytics?
(Van Der Aalst, 2016)
Where does data engineering fit in the
context of data analytics?
10
The ingredients contributing to data science (Van Der Aalst, 2016)
Where does data engineering fit in the
context of data analytics?
11
The Internet of Events (Van Der Aalst, 2016)
12
Alluvial diagram of Big Data job families vs. Big Data skill sets
(De Mauro, Greco, Grimaldi, & Ritala, 2018)
13
Word cloud showing the top 50 words recurring in the Job Title of posts related to Big Data.
The font size of each word is proportional to the number of occurrences of each word.
(De Mauro, Greco, Grimaldi, & Ritala, 2018).
14
(Anaconda, 2020, p.13)
15
(Anaconda, 2020, p.26)
2.
How do I get
started in Data
Engineering?
Process, Skills, Tools
“A Data Engineer is
someone who has
specialized their skills
in creating software
solutions around data.
17
(Anderson, 2017)
Jesse Anderson
Data Engineer
Managing Director, Big Data Institute
“I started getting into Data about 5
years ago when everyone started
talking about it and was becoming
the new buzz word. People were
actually starting to realize how much
you can do with data. Everyone
wanted to learn Data Science and all
the companies wanted to get into
Machine Learning or Artificial
Intelligence, but there was still a
missing piece - that's when I learned
about Data Engineering.
18
Miles Ong
Data Engineer
Kumu
“There was always the problem of
collecting the data, processing the
analysis and actually implementing
the insights. It was overwhelming at
first because I had no clue on where
to begin. What helped is to focus on
things one at a time and actually
trying things out. For Data
Engineering, the only way to learn is
by doing.
19
Miles Ong
Data Engineer
Kumu
“The more I learned, the more I
realized how powerful Data
Engineering is. It would give me the
capability to simply come up with an
idea and actually implement it. It's a
very underrated field, but I love the
challenge of conceptualizing,
building and implementing concrete
solutions that make a difference.
20
Miles Ong
Data Engineer
Kumu
Steps to Data Engineering
PRE-PROJECT
(Anderson, 2017)
Steps to Data Engineering
FORM TEAM
(Anderson, 2017)
Steps to Data Engineering
USE CASE
(Anderson, 2017)
Steps to Data Engineering
KNOW GAPS
(Anderson, 2017)
Steps to Data Engineering
TRAIN, MENTOR
(Anderson, 2017)
Steps to Data Engineering
CHOOSE TECH
(Anderson, 2017)
Steps to Data Engineering
WRITE CODE
(Anderson, 2017)
Steps to Data Engineering
EVALUATE
ITERATION
(Anderson, 2017)
Steps to Data Engineering
REPEAT
(Anderson, 2017)
3.
Big Data
Concepts and Applications
Big Data (Working Definition)
Big data is a field that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
31
Big Data (Working Definition)
Big data is a field that treats ways to
analyze, systematically extract information
from, or otherwise deal with data sets that
are too large or complex to be dealt with
by traditional data-processing
application software.
32
VOLUME
VARIETY
VELOCITY
…
(more on this in the
next slide...)
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
33
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
34
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATAStructured
Unstructured
Text, Image,
video, social
relations
Multi-factor
Probabilistic
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
35
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Terabytes
Records
Architecture
Transactions
Tables, Files
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
36
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Statistical
Events
Correlations
Hypothetical
Fresh? Old?
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
37
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Batch
Real/near-time
Processes
Streams
Author's reinterpretation of The 5Vs of Big Data
(Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015)
38
VOLUME
VARIETY VERACITY
VELOCITY
VALUE
THE 5Vs
OF BIG
DATA
Trustworthiness
Authenticity
Origin, Reputation
Availability
Accountability
Workload
Management
39
Servers
Queues (or Partitions)
Service vs Wait Times
Concurrency
Serial vs Parallel
Synchronous vs Asynchronous
40
SI SISD SD
Single Instruction
Single Data Systems
Flynn's Taxonomy
41
SI
SIMD SD
Single Instruction
Multiple Data Systems
SIMD SD
SIMD SD
Flynn's Taxonomy
42
SI
SIMD SD
Multiple Instruction
Multiple Data Systems
SIMD SD
SIMD SD
SI
SI
Multiple
Instructions
Multiple DataMULTI-PROCESSORS
Flynn's Taxonomy
43
SD
SIMD SI
Multiple Instruction
Single Data Systems (Pipeline?)
SIMD SI
SIMD SI
Single Data Multiple
Instructions
Flynn's Taxonomy
Traditional Way
44
(Edureka! https://www.edureka.co/blog/mapreduce-tutorial/)
Map-Reduce
45
Map Reduce
46
(Edureka! https://www.edureka.co/blog/mapreduce-tutorial/)
Adapted from Image by Cloudera
47
Hadoop Ecosystem (Edureka!, https://www.edureka.co/blog/hadoop-ecosystem )
48
Sample AWS Data Lake Platform
(AWS, https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html) 49
Data Integration and Big Data Analytics Framework (Jimenez-Marquez,
Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019)
50
First Stage phases in detail using Yelp as example
(Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019)
51
Deep Learning Modeling Lifecycle (Miao, Li, Davis, & Deshpande, 2017)
52
Data
Lake
Concept
53
“If you think of a datamart as a
store of bottled water – cleansed
and packaged and structured for
easy consumption – the data lake
is a large body of water in a more
natural state. The contents of the
data lake stream in from a source
to fill the lake, and various users
of the lake can come to examine,
dive in, or take samples.
54
(Dixon, 2010; Miloslavskaya & Tolstoy, 2016)
James Dixon
Chief Technology Officer
Pentaho
55
A data lake refers to a massively scalable storage
repository that holds a vast amount of raw data in its
native format («as is») until it is needed plus processing
systems (engine) that can ingest data without
compromising the data structure
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
Three types of big data
processing
Batch Processing Stream Processing
(Kappa Architecture)
Hybrid Processing
(Lambda Architecture)
56
(Marz & Warren, 2015)
(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
Kappa Architecture
57(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
Lambda Architecture
58(Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
Fast
Data
Concept
59
60
What is Fast Data?
Fast data corresponds to the application of
big data analytics to smaller data sets in
near-real or real-time in order to solve a
particular problem.
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
61
What is Fast Data?
The combination of in-memory databases and data
grid on top of flash devices will allow an increase in
the capacity of stream processing.
Fast data is a complementary approach to big data
for managing large quantities of «in-flight» data
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
62
Fast Data requires two
technologies
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
Streaming system capable of of delivering
events as fast as they come in
Data store capable of processing each
item as fast as it arrives
Data
Quality
63
64
(Anaconda, 2020, p.12)
65
Data Cleaning
Data cleaning, also called data cleansing or scrubbing, deals with
detecting and removing errors and inconsistencies from data in order
to improve the quality of data.
Sample inconsistencies:
● misspellings during data entry
● missing information
● other invalid data
(Wang, Kon, & Madnick, 1993)
Data Cleaning Approaches
66
Data Analysis Definition of
Transformation
Workflow
Data
Verification
Data
Transformation
Backflow of
Cleaned Data
Special domain cleaning
Specialized cleaning tools
ETL Tools
Web
Scraping
67
68
Data
Wrangling
69
70
Event
Streaming
71
72
Event streaming is the digital equivalent of the human
body's central nervous system. It is the technological
foundation for the 'always-on' world where businesses
are increasingly software-defined and automated, and
where the user of software is more software.
(Apache Software Foundation, 2017)
73
(Apache Software Foundation, 2017)
Capture data in
real-time from event
sources.
74
(Apache Software Foundation, 2017)
Capture data in
real-time from event
sources.
storing these event
streams durably for later
retrieval and manipulation
75
(Apache Software Foundation, 2017)
Capture data in
real-time from event
sources.
storing these event
streams durably for later
retrieval and manipulation
routing the event streams
to different destination
technologies as needed
Data Streams (examples)
Time Series Data Network Traffic Telecommunications
76
Video Surveillance Website Clickstreams Sensor Networks
(Miloslavskaya & Tolstoy, 2016)
77
Illustration of Data Capture Equipment (Chen, Mao, & Liu, 2014)
Big Data
Concept
78
CAP
Theorem
Consistency,
Availability,
Partition Tolerance
Trade-offs
79
80
Visualization of Eric Brewer's CAP Theorem (Brewer, 2012; Khazaei, 2016)
81
Big Data characteristics and NoSQL features (Khazaei, 2016)
“When a system processes
trillions and trillions of
requests, events that normally
have a low probability of
occurrence are now
guaranteed to happen and
must be accounted for upfront
in the design and architecture
of the system.
82
(Vogels, 2009)
Werner Vogels
Vice President &
Chief Technology Officer
Amazon.com,
Source: Wikipedia
ACID vs
BASE
Comparison
83
84
ACID vs BASE
ACID
● Atomicity
● Consistent
● Isolated
● Durable
BASE
● Basic Availability
● Soft state
● Eventual
Consistency
(Brewer, 2012; Tudorica & Bucur, 2011)
NoSQL
Survey, Comparisons,
Taxonomies
85
86
No official NoSQL Taxonomy exists:
Core NoSQL Example
Wide Column Store
Hadoop / HBase,
Cassandra, Hypertable,
Cloudata, Amazon
SimpleDB, SciDB
Document Store
CouchDB, MongoDB,
Terrastore, ThruDB,
OrientDB, RavenDB,
Citrusleaf, SisoDB, CloudKit,
Persevere, Jackrabbit
Key Value/Tuple Store
Azure Table Storage,
MEMBASE, Riak, Redis,
Chordless, GenieDB,
Scalaris, Tokyo Cabinet /
Tyrant, GT.M, Keyspace,
Berkeley DB, MemcacheDB,
HamsterDB, Faircom CTree,
Mnesia, LightCloud,
Pincaster, Hibari, Scality
Eventually-Consistent
Key Value Store
Amazon Dynamo,
Voldemort, Dynomite, KAI,
SubRecord, Mo8onDb,
Dovetaildb
Graph Database
Neo4J, Infinite Graph, Sones,
InfoGrid, HyperGraphDB,
Trinity, AllegroGraph,
Bigdata, DEX, OpenLink
Virtuoso, VertexDB, FlockDB,
Java Universal Network /
Graph Framework, Sesame,
Filament, OWLim, NetworkX,
iGraph(Tudorica & Bucur, 2011)
87
No official NoSQL Taxonomy exists:
Soft NoSQL Example
Object Databases
db4o, Versant, Objectivity,
Gemstone, Progress,
Starcounter, Perst, ZODB,
NEO, PicoLisp, Sterling,
StupidDB, KiokuDB, Durus
Grid and Cloud Database
Solutions
GigaSpaces, Queplix,
Hazelcast, Joafip, GridGain,
Infinispan, Coherence,
eXtremeScale
XML Databases
Mark Logic Server, EMC
Documentum xDB, Tamino,
eXist, Sedna, BaseX, Xindice,
Qizx, Berkeley DB XML
Multivalue Databases
U2, OpenInsight, OpenQM,
Globals
Other NoSQL related
databases
IBM Lotus/Domino,
Intersystems Cache,
eXtremeDB, ISIS Family,
Prevayler, Yserial
(Tudorica & Bucur, 2011)
Key Value
88
Simplest form of database management
systems
Can only store pairs of keys and values,
as well as retrieve values when a key is
known
Normally not adequate for complex
applications
Simplicity makes these attractive in
certain circumstances
(Khazaei, 2016)
Column-Oriented
89
Stores data in records with an ability to
hold very large numbers of dynamic
columns
Can be seen as two-dimensional
key-value stores
Schema-free like document stores,
however the implementation is
significantly different
(Khazaei, 2016)
Document Stores
90
Also known as document-oriented
database systems
Schema-free organization
Records (or "documents") do not need
to have a uniform structure
The types of the values of individual
columns can be different
Columns can have more than one value
(arrays); records can have a nested
structure
Document stores often use internal
notations, usually JSON.
(Khazaei, 2016)
Graph Oriented
91
Represent data in graph structures as
nodes and edges
Edges represent relationships between
nodes
Allow easy processing of data in that
form (graphs)
Simple calculation of specific properties
of the graph, such as the number of
steps needed to get from one node to
another node
(Khazaei, 2016)
92
NoSQL Solutions (Khazaei, 2016)
Interrelation between Big Data, Fast Data,
and Data Lake Concepts
(Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
94
Takeaways
Upskilling not impossible
Understand workload management and
trade-offs when making architecture
decisions
Don't be afraid to work with other people
Experiment, experiment, experiment…
Be a wide reader and hungry learner
95
Thanks!
Any questions?
You can email me at:
jbilagan@ateneo.edu
Credits
Special thanks to all the people who made
and released these awesome resources for
free:
⬡ Presentation template by SlidesCarnival
⬡ Photographs by Unsplash
96
ReferencesAnaconda (2020). 2020 State of Data Science Moving from hype toward maturity.
Anderson, J. (2017). Data Engineering Teams Creating Successful Big Data Teams and Products.
Anderson, J. (2018, April 11). Data engineers vs. data scientists. O'Reilly Media.
[https://www.oreilly.com/radar/data-engineers-vs-data-scientists/](https://www.oreilly.com/radar/data-engineers-vs-data-scientists/).
Apache Software Foundation. (2017). INTRODUCTION Everything you need to know about Kafka in 10 minutes. Apache Kafka.
[https://kafka.apache.org/intro](https://kafka.apache.org/intro).
Brewer, E. (2001). Lessons from giant-scale services IEEE Internet Computing 5(4), 46-55.
[https://dx.doi.org/10.1109/4236.939450](https://dx.doi.org/10.1109/4236.939450)
Brewer, E. (2012). CAP Twelve Years Later: How the “Rules” Have Changed Computer 45(2), 23-29.
[https://dx.doi.org/10.1109/mc.2012.37](https://dx.doi.org/10.1109/mc.2012.37)
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, 19(2), 171-209.
Dean, J., Ghemawat, S., Mehta, B. (2008). MapReduce: simplified data processing on large clusters Communications of the ACM 51(1), 107-113.
https://dx.doi.org/10.1145/1327452.1327492
De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2018). Human resources for Big Data professions: A systematic classification of job roles and required skill sets.
Information Processing & Management, 54(5), 807-817.
Devopedia. 2020. "CAP Theorem." Version 4, April 30. Accessed 2020-09-14. https://devopedia.org/cap-theorem
Dixon, J. (2010, October 14). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog.
[https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/](https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/).
Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big data cleaning algorithms in cloud computing. International Journal of Online Engineering, 9(3),
77–81. [https://doi.org/10.3991/ijoe.v9i3.2765](https://doi.org/10.3991/ijoe.v9i3.2765)
Gray, J., & Shenoy, P. (2000). Rules of thumb in data engineering. Proceedings - International Conference on Data Engineering, 3–10.
[https://doi.org/10.1109/icde.2000.839382](https://doi.org/10.1109/icde.2000.839382)
97
References
Ishwarappa, & Anuradha, J. (2015). A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Computer Science, 48(C), 319–324.
[https://doi.org/10.1016/j.procs.2015.04.188](https://doi.org/10.1016/j.procs.2015.04.188)
Jimenez-Marquez, J., Gonzalez-Carrasco, I., Lopez-Cuadrado, J., Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content International
Journal of Information Management 44(), 1-12. [https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003](https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003)
Khazaei, H. (2016). How do I choose the right NoSQL solution? Big Data, X(0), 1–33.
Laskowski, N. (2016). Data lake governance: A big data do or die. URL:
[http://searchcio](http://searchcio/).techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die (access date 28/05/2016)
Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co.
Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on
Data Engineering (ICDE) (pp. 571-582). IEEE.
Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305.
[https://doi.org/10.1016/j.procs.2016.07.439](https://doi.org/10.1016/j.procs.2016.07.439)
Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal
of Big Data, 2(1), 1.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13.
Samizadeh, I. (2018, March 15). A brief introduction to two data processing architectures - Lambda and Kappa for Big Data.
[https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb](https://towardsdatasci
ence.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb).
98
References
Tudorica, B., Bucur, C. (2011). A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition:
Networking in Education and Research 1(), 1-5. [https://dx.doi.org/10.1109/roedunet.2011.5993686](https://dx.doi.org/10.1109/roedunet.2011.5993686)
Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Berlin, Heidelberg.
Vogels, W. (2009). Eventually consistent Communications of the ACM 52(1), 40-44.
[https://dx.doi.org/10.1145/1435417.1435432](https://dx.doi.org/10.1145/1435417.1435432)
Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data quality requirements analysis and modeling. Proceedings - International Conference on Data Engineering, 670–677.
[https://doi.org/10.1109/icde.1993.344012](https://doi.org/10.1109/icde.1993.344012)
Yin, S., & Kaynak, O. (2015). Big data for modern industry: challenges and trends [point of view]. Proceedings of the IEEE, 103(2), 143-146.
99

The role of data engineering in data science and analytics practice

  • 1.
    The Role ofData Engineering in Data Science and Analytics Practice Joseph Benjamin ILAGAN Ateneo de Manila University jbilagan@ateneo.edu
  • 2.
    Hello! I am JobenIlagan Thank you for having me here. Twitter: @jilagan LinkedIn: https://www.linkedin.com/in/jobenilagan/ 2
  • 3.
  • 4.
  • 5.
  • 6.
    A data engineeringTeam is NOT a collection of Data Engineers A data engineering team isn’t made up of a single type of person or title A data engineering team is multidisciplinary 6 Data Engineering Team (Anderson, 2017)
  • 7.
    Creates data pipelines Bringstogether 10-30 different big data technologies Understands and chooses the right tools for the job Understands the various technologies and frameworks in-depth Combines them to create solutions to enable a company’s business processes with data pipelines 7 Data Engineering Team (Anderson, 2018)
  • 8.
    Data science isan interdisciplinary field aiming to turn data into real value. Data may be structured or unstructured, big or small, static or streaming. Value may be provided in the form of predictions, automated decisions, models learned from data, or any type of data visualization delivering insights. Data science includes data extraction, data preparation, data exploration, data transformation, storage and retrieval, computing infrastructures, various types of mining and learning, presentation of explanations and predictions, and the exploitation of results taking into account ethical, social, legal, and business aspects. 9 Where does data engineering fit in the context of data analytics? (Van Der Aalst, 2016)
  • 9.
    Where does dataengineering fit in the context of data analytics? 10 The ingredients contributing to data science (Van Der Aalst, 2016)
  • 10.
    Where does dataengineering fit in the context of data analytics? 11 The Internet of Events (Van Der Aalst, 2016)
  • 11.
    12 Alluvial diagram ofBig Data job families vs. Big Data skill sets (De Mauro, Greco, Grimaldi, & Ritala, 2018)
  • 12.
    13 Word cloud showingthe top 50 words recurring in the Job Title of posts related to Big Data. The font size of each word is proportional to the number of occurrences of each word. (De Mauro, Greco, Grimaldi, & Ritala, 2018).
  • 13.
  • 14.
  • 15.
    2. How do Iget started in Data Engineering? Process, Skills, Tools
  • 16.
    “A Data Engineeris someone who has specialized their skills in creating software solutions around data. 17 (Anderson, 2017) Jesse Anderson Data Engineer Managing Director, Big Data Institute
  • 17.
    “I started gettinginto Data about 5 years ago when everyone started talking about it and was becoming the new buzz word. People were actually starting to realize how much you can do with data. Everyone wanted to learn Data Science and all the companies wanted to get into Machine Learning or Artificial Intelligence, but there was still a missing piece - that's when I learned about Data Engineering. 18 Miles Ong Data Engineer Kumu
  • 18.
    “There was alwaysthe problem of collecting the data, processing the analysis and actually implementing the insights. It was overwhelming at first because I had no clue on where to begin. What helped is to focus on things one at a time and actually trying things out. For Data Engineering, the only way to learn is by doing. 19 Miles Ong Data Engineer Kumu
  • 19.
    “The more Ilearned, the more I realized how powerful Data Engineering is. It would give me the capability to simply come up with an idea and actually implement it. It's a very underrated field, but I love the challenge of conceptualizing, building and implementing concrete solutions that make a difference. 20 Miles Ong Data Engineer Kumu
  • 20.
    Steps to DataEngineering PRE-PROJECT (Anderson, 2017)
  • 21.
    Steps to DataEngineering FORM TEAM (Anderson, 2017)
  • 22.
    Steps to DataEngineering USE CASE (Anderson, 2017)
  • 23.
    Steps to DataEngineering KNOW GAPS (Anderson, 2017)
  • 24.
    Steps to DataEngineering TRAIN, MENTOR (Anderson, 2017)
  • 25.
    Steps to DataEngineering CHOOSE TECH (Anderson, 2017)
  • 26.
    Steps to DataEngineering WRITE CODE (Anderson, 2017)
  • 27.
    Steps to DataEngineering EVALUATE ITERATION (Anderson, 2017)
  • 28.
    Steps to DataEngineering REPEAT (Anderson, 2017)
  • 29.
  • 30.
    Big Data (WorkingDefinition) Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. 31
  • 31.
    Big Data (WorkingDefinition) Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. 32 VOLUME VARIETY VELOCITY … (more on this in the next slide...)
  • 32.
    Author's reinterpretation ofThe 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 33 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA
  • 33.
    Author's reinterpretation ofThe 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 34 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATAStructured Unstructured Text, Image, video, social relations Multi-factor Probabilistic
  • 34.
    Author's reinterpretation ofThe 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 35 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Terabytes Records Architecture Transactions Tables, Files
  • 35.
    Author's reinterpretation ofThe 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 36 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Statistical Events Correlations Hypothetical Fresh? Old?
  • 36.
    Author's reinterpretation ofThe 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 37 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Batch Real/near-time Processes Streams
  • 37.
    Author's reinterpretation ofThe 5Vs of Big Data (Ishwarappa & Anuradha, 2015; Yin & Kaynak, 2015) 38 VOLUME VARIETY VERACITY VELOCITY VALUE THE 5Vs OF BIG DATA Trustworthiness Authenticity Origin, Reputation Availability Accountability
  • 38.
    Workload Management 39 Servers Queues (or Partitions) Servicevs Wait Times Concurrency Serial vs Parallel Synchronous vs Asynchronous
  • 39.
    40 SI SISD SD SingleInstruction Single Data Systems Flynn's Taxonomy
  • 40.
    41 SI SIMD SD Single Instruction MultipleData Systems SIMD SD SIMD SD Flynn's Taxonomy
  • 41.
    42 SI SIMD SD Multiple Instruction MultipleData Systems SIMD SD SIMD SD SI SI Multiple Instructions Multiple DataMULTI-PROCESSORS Flynn's Taxonomy
  • 42.
    43 SD SIMD SI Multiple Instruction SingleData Systems (Pipeline?) SIMD SI SIMD SI Single Data Multiple Instructions Flynn's Taxonomy
  • 43.
  • 44.
  • 45.
  • 46.
    Adapted from Imageby Cloudera 47
  • 47.
    Hadoop Ecosystem (Edureka!,https://www.edureka.co/blog/hadoop-ecosystem ) 48
  • 48.
    Sample AWS DataLake Platform (AWS, https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html) 49
  • 49.
    Data Integration andBig Data Analytics Framework (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019) 50
  • 50.
    First Stage phasesin detail using Yelp as example (Jimenez-Marquez, Gonzalez-Carrasco, Lopez-Cuadrado, Ruiz-Mezcua 2019) 51
  • 51.
    Deep Learning ModelingLifecycle (Miao, Li, Davis, & Deshpande, 2017) 52
  • 52.
  • 53.
    “If you thinkof a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. 54 (Dixon, 2010; Miloslavskaya & Tolstoy, 2016) James Dixon Chief Technology Officer Pentaho
  • 54.
    55 A data lakerefers to a massively scalable storage repository that holds a vast amount of raw data in its native format («as is») until it is needed plus processing systems (engine) that can ingest data without compromising the data structure (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 55.
    Three types ofbig data processing Batch Processing Stream Processing (Kappa Architecture) Hybrid Processing (Lambda Architecture) 56 (Marz & Warren, 2015) (Miloslavskaya & Tolstoy, 2016; Samizadeh, 2018, March 15)
  • 56.
    Kappa Architecture 57(Miloslavskaya &Tolstoy, 2016; Samizadeh, 2018, March 15)
  • 57.
    Lambda Architecture 58(Miloslavskaya &Tolstoy, 2016; Samizadeh, 2018, March 15)
  • 58.
  • 59.
    60 What is FastData? Fast data corresponds to the application of big data analytics to smaller data sets in near-real or real-time in order to solve a particular problem. (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 60.
    61 What is FastData? The combination of in-memory databases and data grid on top of flash devices will allow an increase in the capacity of stream processing. Fast data is a complementary approach to big data for managing large quantities of «in-flight» data (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 61.
    62 Fast Data requirestwo technologies (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016) Streaming system capable of of delivering events as fast as they come in Data store capable of processing each item as fast as it arrives
  • 62.
  • 63.
  • 64.
    65 Data Cleaning Data cleaning,also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Sample inconsistencies: ● misspellings during data entry ● missing information ● other invalid data (Wang, Kon, & Madnick, 1993)
  • 65.
    Data Cleaning Approaches 66 DataAnalysis Definition of Transformation Workflow Data Verification Data Transformation Backflow of Cleaned Data Special domain cleaning Specialized cleaning tools ETL Tools
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
    72 Event streaming isthe digital equivalent of the human body's central nervous system. It is the technological foundation for the 'always-on' world where businesses are increasingly software-defined and automated, and where the user of software is more software. (Apache Software Foundation, 2017)
  • 72.
    73 (Apache Software Foundation,2017) Capture data in real-time from event sources.
  • 73.
    74 (Apache Software Foundation,2017) Capture data in real-time from event sources. storing these event streams durably for later retrieval and manipulation
  • 74.
    75 (Apache Software Foundation,2017) Capture data in real-time from event sources. storing these event streams durably for later retrieval and manipulation routing the event streams to different destination technologies as needed
  • 75.
    Data Streams (examples) TimeSeries Data Network Traffic Telecommunications 76 Video Surveillance Website Clickstreams Sensor Networks (Miloslavskaya & Tolstoy, 2016)
  • 76.
    77 Illustration of DataCapture Equipment (Chen, Mao, & Liu, 2014)
  • 77.
  • 78.
  • 79.
    80 Visualization of EricBrewer's CAP Theorem (Brewer, 2012; Khazaei, 2016)
  • 80.
    81 Big Data characteristicsand NoSQL features (Khazaei, 2016)
  • 81.
    “When a systemprocesses trillions and trillions of requests, events that normally have a low probability of occurrence are now guaranteed to happen and must be accounted for upfront in the design and architecture of the system. 82 (Vogels, 2009) Werner Vogels Vice President & Chief Technology Officer Amazon.com, Source: Wikipedia
  • 82.
  • 83.
    84 ACID vs BASE ACID ●Atomicity ● Consistent ● Isolated ● Durable BASE ● Basic Availability ● Soft state ● Eventual Consistency (Brewer, 2012; Tudorica & Bucur, 2011)
  • 84.
  • 85.
    86 No official NoSQLTaxonomy exists: Core NoSQL Example Wide Column Store Hadoop / HBase, Cassandra, Hypertable, Cloudata, Amazon SimpleDB, SciDB Document Store CouchDB, MongoDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB, CloudKit, Persevere, Jackrabbit Key Value/Tuple Store Azure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, Tokyo Cabinet / Tyrant, GT.M, Keyspace, Berkeley DB, MemcacheDB, HamsterDB, Faircom CTree, Mnesia, LightCloud, Pincaster, Hibari, Scality Eventually-Consistent Key Value Store Amazon Dynamo, Voldemort, Dynomite, KAI, SubRecord, Mo8onDb, Dovetaildb Graph Database Neo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, AllegroGraph, Bigdata, DEX, OpenLink Virtuoso, VertexDB, FlockDB, Java Universal Network / Graph Framework, Sesame, Filament, OWLim, NetworkX, iGraph(Tudorica & Bucur, 2011)
  • 86.
    87 No official NoSQLTaxonomy exists: Soft NoSQL Example Object Databases db4o, Versant, Objectivity, Gemstone, Progress, Starcounter, Perst, ZODB, NEO, PicoLisp, Sterling, StupidDB, KiokuDB, Durus Grid and Cloud Database Solutions GigaSpaces, Queplix, Hazelcast, Joafip, GridGain, Infinispan, Coherence, eXtremeScale XML Databases Mark Logic Server, EMC Documentum xDB, Tamino, eXist, Sedna, BaseX, Xindice, Qizx, Berkeley DB XML Multivalue Databases U2, OpenInsight, OpenQM, Globals Other NoSQL related databases IBM Lotus/Domino, Intersystems Cache, eXtremeDB, ISIS Family, Prevayler, Yserial (Tudorica & Bucur, 2011)
  • 87.
    Key Value 88 Simplest formof database management systems Can only store pairs of keys and values, as well as retrieve values when a key is known Normally not adequate for complex applications Simplicity makes these attractive in certain circumstances (Khazaei, 2016)
  • 88.
    Column-Oriented 89 Stores data inrecords with an ability to hold very large numbers of dynamic columns Can be seen as two-dimensional key-value stores Schema-free like document stores, however the implementation is significantly different (Khazaei, 2016)
  • 89.
    Document Stores 90 Also knownas document-oriented database systems Schema-free organization Records (or "documents") do not need to have a uniform structure The types of the values of individual columns can be different Columns can have more than one value (arrays); records can have a nested structure Document stores often use internal notations, usually JSON. (Khazaei, 2016)
  • 90.
    Graph Oriented 91 Represent datain graph structures as nodes and edges Edges represent relationships between nodes Allow easy processing of data in that form (graphs) Simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node (Khazaei, 2016)
  • 91.
  • 92.
    Interrelation between BigData, Fast Data, and Data Lake Concepts (Laskowski, 2016, as cited by Miloslavskaya & Tolstoy, 2016)
  • 93.
    94 Takeaways Upskilling not impossible Understandworkload management and trade-offs when making architecture decisions Don't be afraid to work with other people Experiment, experiment, experiment… Be a wide reader and hungry learner
  • 94.
    95 Thanks! Any questions? You canemail me at: jbilagan@ateneo.edu
  • 95.
    Credits Special thanks toall the people who made and released these awesome resources for free: ⬡ Presentation template by SlidesCarnival ⬡ Photographs by Unsplash 96
  • 96.
    ReferencesAnaconda (2020). 2020State of Data Science Moving from hype toward maturity. Anderson, J. (2017). Data Engineering Teams Creating Successful Big Data Teams and Products. Anderson, J. (2018, April 11). Data engineers vs. data scientists. O'Reilly Media. [https://www.oreilly.com/radar/data-engineers-vs-data-scientists/](https://www.oreilly.com/radar/data-engineers-vs-data-scientists/). Apache Software Foundation. (2017). INTRODUCTION Everything you need to know about Kafka in 10 minutes. Apache Kafka. [https://kafka.apache.org/intro](https://kafka.apache.org/intro). Brewer, E. (2001). Lessons from giant-scale services IEEE Internet Computing 5(4), 46-55. [https://dx.doi.org/10.1109/4236.939450](https://dx.doi.org/10.1109/4236.939450) Brewer, E. (2012). CAP Twelve Years Later: How the “Rules” Have Changed Computer 45(2), 23-29. [https://dx.doi.org/10.1109/mc.2012.37](https://dx.doi.org/10.1109/mc.2012.37) Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, 19(2), 171-209. Dean, J., Ghemawat, S., Mehta, B. (2008). MapReduce: simplified data processing on large clusters Communications of the ACM 51(1), 107-113. https://dx.doi.org/10.1145/1327452.1327492 De Mauro, A., Greco, M., Grimaldi, M., & Ritala, P. (2018). Human resources for Big Data professions: A systematic classification of job roles and required skill sets. Information Processing & Management, 54(5), 807-817. Devopedia. 2020. "CAP Theorem." Version 4, April 30. Accessed 2020-09-14. https://devopedia.org/cap-theorem Dixon, J. (2010, October 14). Pentaho, Hadoop, and Data Lakes. James Dixon’s Blog. [https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/](https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/). Feng, Z., Hui-Feng, X., Dong-Sheng, X., Yong-Heng, Z., & Fei, Y. (2013). Big data cleaning algorithms in cloud computing. International Journal of Online Engineering, 9(3), 77–81. [https://doi.org/10.3991/ijoe.v9i3.2765](https://doi.org/10.3991/ijoe.v9i3.2765) Gray, J., & Shenoy, P. (2000). Rules of thumb in data engineering. Proceedings - International Conference on Data Engineering, 3–10. [https://doi.org/10.1109/icde.2000.839382](https://doi.org/10.1109/icde.2000.839382) 97
  • 97.
    References Ishwarappa, & Anuradha,J. (2015). A brief introduction on big data 5Vs characteristics and hadoop technology. Procedia Computer Science, 48(C), 319–324. [https://doi.org/10.1016/j.procs.2015.04.188](https://doi.org/10.1016/j.procs.2015.04.188) Jimenez-Marquez, J., Gonzalez-Carrasco, I., Lopez-Cuadrado, J., Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content International Journal of Information Management 44(), 1-12. [https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003](https://dx.doi.org/10.1016/j.ijinfomgt.2018.09.003) Khazaei, H. (2016). How do I choose the right NoSQL solution? Big Data, X(0), 1–33. Laskowski, N. (2016). Data lake governance: A big data do or die. URL: [http://searchcio](http://searchcio/).techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die (access date 28/05/2016) Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co. Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (pp. 571-582). IEEE. Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305. [https://doi.org/10.1016/j.procs.2016.07.439](https://doi.org/10.1016/j.procs.2016.07.439) Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13. Samizadeh, I. (2018, March 15). A brief introduction to two data processing architectures - Lambda and Kappa for Big Data. [https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb](https://towardsdatasci ence.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb). 98
  • 98.
    References Tudorica, B., Bucur,C. (2011). A comparison between several NoSQL databases with comments and notes 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research 1(), 1-5. [https://dx.doi.org/10.1109/roedunet.2011.5993686](https://dx.doi.org/10.1109/roedunet.2011.5993686) Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23). Springer, Berlin, Heidelberg. Vogels, W. (2009). Eventually consistent Communications of the ACM 52(1), 40-44. [https://dx.doi.org/10.1145/1435417.1435432](https://dx.doi.org/10.1145/1435417.1435432) Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data quality requirements analysis and modeling. Proceedings - International Conference on Data Engineering, 670–677. [https://doi.org/10.1109/icde.1993.344012](https://doi.org/10.1109/icde.1993.344012) Yin, S., & Kaynak, O. (2015). Big data for modern industry: challenges and trends [point of view]. Proceedings of the IEEE, 103(2), 143-146. 99