2. Modern Data Ecosystem
“The constant increase in data
processing speeds and
bandwidth, the nonstop
invention of new tools for
creating, sharing, and consuming
data, and the steady addition of
new data creators and consumers
around the world, ensure that
data growth continues
unabated. Data begets more data
in a constant virtuous cycle.”
Forbes 2020 report on data in the coming decade
ENTERPRISE DATA ENVIOREMENT
• Data integrated from disparate sources
• Different types of analysis and skills to generate
insights
• Active stakeholders to collaborate and act on
insights generated
• Tools, applications, and infrastructure to store,
process, and disseminate data as required.
INTERCONNECTED INDEPENDENT CONTINUALLY EVOLVING
S 1 1 1
3. Data sources
text, images, videos,
clickstreams, user
conversations,
social media
platforms, the
Internet of Things (or
IoT) devices, real-
time events that
stream data, legacy
databases, and data
sourced from
professional data
providers and
agencies
Data is available
in a variety of
structured and
unstructured
datasets.
Pull a copy of the data from the
original sources into a data repository.
At this stage, you’re only looking at acquiring the data you
need—working with data formats, sources, and interfaces
through which this data can be pulled in.
Challenges!
• Reliability
• Security
• Integrity
1 2 3
4. Organized Cleaned up
Optimized for
access by end
users
The data will need to conform to compliances and
standards enforced in the organization. For example:
• Conforming to guidelines that regulate the storage and
use of personal data (health, biometrics, or household
data IOT)
• Master data tables within the organization, to ensure
standardization of master data across all applications
and systems.
Challenges!
• Data management
• Data repositories
that provide
high availability,
flexibility,
accessibility,
and security.
Data management
Enterprise data
repositories
1 2 3
5. Data management
Enterprise data
repositories
Business Stakeholders
APPs
Programmers Analysts
Data science use cases
Users
Challenges!
Interfaces, APIs,
and applications
that can get this
data to the end-
users in line with
their specific
needs.
For example, Data Analysts may need the raw data to
work with, business stakeholders may need reports
and dashboards, applications may need custom APIs
to pull this data.
1 2 3
6. Emerging
technologies
Machine
Learning
Data Scientists are creating
predictive models by training
machine learning algorithms
on past data.
Cloud
computing
Every enterprise today has
access to limitless storage, high
performance computing, open
source technologies, machine
learning technologies, and the
latest tools and libraries.
Big data
Is paving the way for new tools
and techniques and also new
knowledge and insights.
7. Key Players in the Data
Ecosystem
Organizations that are using data
to uncover opportunities and are
applying that knowledge to
differentiate themselves are the
ones leading into the future.
• Identifying patterns in financial transactions
to detect fraud
• Using recommendation engines to drive
conversion
• Mining social media posts for customer
voice, or brands
• Personalizing their offers based on
customer behavior analysis, business
leaders realize that data holds the key to
competitive advantage.
S 1 1 2
Data Engineers, Data Analysts, Data Scientists, Business
Analysts, and Business Intelligence (or BI) Analysts
9. Summary
• Data Engineering converts raw data into usable
data.
• Data Analytics uses this data to generate insights.
• Data Scientists use Data Analytics and Data
Engineering to predict the future using data from
the past.
• Business Analysts and Business Intelligence
Analysts use these insights and predictions to drive
decisions that benefit and grow their business.
10. Functions:
• Develop and maintain
data architectures
• Make data available for
business operations and
analysis
Data Engineers
Data Engineers work within the data ecosystem
to:
• Extract, integrate, and organize data
from disparate sources
• Clean, transform, and prepare data
• Design, store, and manage data in
data repositories.
They enable data to be
accessible in formats and
systems that the various
business applications, as
well as stakeholders like
Data Analysts and Data
Scientists, can utilize.
Skills:
• Good knowledge of programming
• Sound knowledge of systems and technology
architectures
• In-depth understanding of relational
databases and non-relational datastores.
11. Functions:
Translates data and numbers
into plain language, so
organizations can make
decisions.
Responsabilities:
• Inspect, and clean data for deriving insights;
• Identify correlations, find patterns, and apply
statistical methods to analyze and
mine data;
• Visualize data to interpret and present the
findings of data analysis.
“Are the users’ search
experiences generally good or
bad with the search
functionality on our site”
“What is the popular
perception of people regarding
our rebranding initiatives”
“Is there a correlation between
sales of one product and
another."
Skills:
• Good knowledge of spreadsheets, writing
queries, and using statistical tools to create
charts and dashboards.
• Modern data analysts also need to have some
programming skills.
• They need strong analytical and story-telling
skills.
Data Analysts
12. Responsabilities:
• Analyze data for actionable insights
• Build Machine Learning or Deep Learning
models that train on past data to create
predictive models
“How many new social
media followers am I
likely to get next month?”
“What percentage of my
customers am I likely
to lose to competition in
the next quarter”
“Is this financial
transaction unusual for
this customer?”.
Skills:
• Knowledge of Mathematics, Statistics
• Understanding of programming languages,
databases, and building data models.
• They also need to have domain knowledge.
Data Scientists
13. Responsabilities:
Leverage the work of Data
Analysts and Data Scientists to
look at possible implications
their business and the actions
they need to take or
recommend.
Business Analysts Business Intelligence Analysts
Responsabilities:
• Focus on market forces and
external influences that shape
their business.
• Organize and monitor data on
different business functions
• Explore that data to extract
insights and actionables that
improve business performance.
14. What is Data Engineering?
S 1 1 3
• The field of Data Engineering concerns itself with the mechanics for the flow and
access of data
• Its goal is to make quality data available for fact-finding and data-driven
decision making
• The world of data has grown to include wide-ranging sources, structures, and
types of data.
15. Collecting
source data
Processing
data
Storing data
Making data
available to
users securely
The field of Data Engineering involves:
Develop tools, workflows, and processes that help you acquire data from
multiple sources.
• Design, build, and maintain scalable data architecture to store data.
• Data could be stored in databases, data warehouses, data lakes, or any
other type of data repository.
16. Collecting
source data
Processing
data
Storing data
Making data
available to
users securely
The field of Data Engineering involves:
Processing data includes cleaning, transforming, and preparing data so that
it is usable.
• Implement and maintain distributed systems for large-scale processing of
data.
• Design pipelines for the extraction, transformation, and loading of data
into data repositories.
• Design or implement solutions for validating and safeguarding quality,
privacy, and security
• of data.
• Optimize tools, systems, and workflows for performance, reliability, and
scalability.
• Ensure data meets all regulatory and compliance guidelines.
17. Collecting
source data
Processing
data
Storing data
Making data
available to
users securely
The field of Data Engineering involves:
• Architect or implement data stores for the storage of processed data.
• Ensure systems are scalable, keeping in mind the evolving nature of data
and business needs.
• Ensure tools and systems are in place that take care of data privacy,
security, compliance,
• monitoring, backup, and recovery.
18. Collecting
source data
Processing
data
Storing data
Making data
available to
users securely
The field of Data Engineering involves:
• APIs, services, and programs that retrieve data on defined parameters for
use by end-users.
• Interfaces and dashboards that present data to users so they can derive
insights from the data.
• Ensure the right measures and checks and balances are in place to keep
data secure and provide rights-based access to users.
19. Architect:
Skills: To architect
any data
management
system, be it for
collating source
data or storing
processed
analysis-ready
data
To ensure data
stores are available
and optimized for
use, you need to
have expertise in
databases.
Proficiency in
database tools,
programming
languages, and
distributed
systems all
come under
data
engineering, but
they may
require different
skill sets.
20. • More than any other data profession, data
engineering is about the tools and
technologies involved in data manipulation.
• But it is also about understanding the complexities
of data and how it is ultimately leveraged for fact-
finding and decision-making.
Summary
21. S 1 1 4
Viewpoints: Defining Data
Engineering
How do you define data engineering?
Data engineering
involves
•Designing
•Building
•Maintaining data
infrastructures
and platforms
Data infrastructures
include
•Databases
•Big Data
Repositories
• DataPipelines for
transforming and
moving data
between data
systems
Data Engineers
•Develop and
optimize data
systems
•Make data
available for
analysis
Data Analysts
•Analyze data in
data systems to
report and derive
insights
Data Scientists
•Perform deeper
analysis on data
•Develop
predective models
to solve more
complex data
problems
22. How do you define data engineering?
Data Engineers ensure
• Data is highly
available
• Consistent
• Secure
• Recoverable
Data Analysts and
Data Scientists
• Make use of the
data that Data
Engineers provide
Data Engineers work
• Ensure data matches
their needs
23. How do you define data engineering?
Data Engineers involves
•Designing
•Maintaining
•Optimizing data systems
•Selecting the right databases, storage systems and
cloud architecture
•Ensuring data flow inside an organization is
seamless
•Ensuring data can be delivered to stakeholders
when they need it
Data Analysts and Data Scientists
•Make use of data provided by Data Engineers for
purposes of análisis and predictions
How do you define data engineering?
24. How do you define data engineering?
Data Engineers
• Extract and collect raw data
from multiple sources
• Transform and store data in
table form
Data Analysts and Data
Scientists
• Perform analysis on data
prepared by Data Engineers
How do you define data engineering?
25. How do you define data engineering?
Data Engineers
• Precursor to Data
Analysis and Data
Science
Data Engineers enable
Data Analysts and
Data Scientists
• Picking the right
databases and tools
• Building the right
data pipelines
Data Analysts and
Data Scientists use
this data to
• Build reports
• Perform statistical
analysis
How do you define data engineering?
How do you define data engineering?
26. S 1 1 5
Viewpoints: Evolution of
Data Engineering
Can you talk about how data
engineering has evolved over the
past couple of decades?
• Data engineering landscape is very different
• The quantity of data that is handled was
unthinkable
• Variety and formats of data available have
evolved
• Big Data was unheard of
• Expectations have also evolved
• Today, there is widespread use of
automation tools
Can you talk about how data
engineering has evolved over the
past couple of decades?
• With cloud computing, data infrastructure is
now available as a service
• A DE, today, needs to do a lot les from
scratch and spend les time on setting up and
managing systems
• Before: Only relational databases and data
warehouses. Today: NoSQL and other types
of data repositories
• DE need to be able to work with Big Data
systems and pipelines
• DE need to know a variety of tolos, data
systems and how to suse them for solving
data rpoblems
27. Can you talk about how data
engineering has evolved over the
past couple of decades?
• Before: The approach hierarchical – Enterprise Architects of Data Architects made
decisions regarding data storage
• DE work closely with developers to see how best to meet their requirements
• Availability of different types of data sources, such as, the IoT and API feeds
• Variety of databases:
• Wide column stores (Cassandra and Hbase)
• Key-value pair databases
• Hadoop for Big Data
• Doc stores such as MongoDB and Coachbase
• Evolution of data sources and the variety of data have contributed
• Before: Focused on database management, ETL pipelines and data visualization
• Now: Distributed computing, DevOps and Implementation of Machine Learnign models
29. Responsabilities and Skillsets of a
Data Engineer
At a broad level, Data Engineers:
• Extract, organize, and integrate data from disparate sources
• Prepare data for analysis and reporting by transforming and
cleansing it
• Design and manage data pipelines that encompass the
journey of data from source to destination systems
• Setup and manage the infrastructure required for the
ingestion, processing, and storage of data: Data Platforms,
Data Stores, Distributed systems, Data Repositories
S 1 2 1
The overarching responsibility of a data engineer is to
provide analytics-ready data to data consumers
Data is
analytics-ready
when:
• Accurate
• Reliable
• Complies to
regulations it
is governed
• Accessible to
consumers
when they
need it.
32. Experience of working with ETL tools such as IBM
Infosphere Information Server, AWS Glue, and
Improvado.
33.
34. Functional Skills
• The ability to convert business requirements into technical specifications.
• The ability to work with the complete software development lifecycle which includes
ideation, architecture, design, prototyping, testing, deployment, and monitoring.
• An understanding of data’s potential application in business.
• And an understanding of the risks of poor data management which essentially covers
data quality, privacy, security, and compliance.
software
engineering
data science
DATA
ENGI
NEER
ING
35. • Data engineering is a team sport.
• Interpersonal skills, teamwork, and collaboration
• As a data engineer, you need to be able to
communicate effectively with both technical and
non-technical stakeholders in a manner that a clear
understanding can be established.
Soft Skills
DATA
ENGINEER
ANALYSTS
DATA
SCIENTISTS
BUSINESS
USERS
TECHNICAL
TEAMS
36. S 1 2 2
Viewpoints: Skills and
Qualities to be a Data
Engineer
What are some of the technical skills
required to be a successful Data
Engineer?
Data engineering
is
• A fast-moving
technology
Data Engineers
need
• To love data
Technical skills
depend
• On the job role
Experience
required in Retail
• Relational
databases
• Cassandra
• Google
Bigtable
• Kafka Stream
• WebSphere
MQ
In Healthcare or
Socical Media
• Need of
different
skillsets
37. What are some of the technical skills
required to be a successful Data
Engineer?
• Acquiring technical skills is the easier part of
the overall skills required
• Important to understant the basic theory
about data structures and how to work with
data
• Willing to roll with change
What are some of the technical skills
required to be a successful Data
Engineer?
• Good understanding of networking, both
LAN and WAN
• Good understanding of different types of
storage, local as well as cloud-based
• Very Good knowledge of operating systems
• Expertise in databases, both RDBMS and
NoSQL
• Programming skills in any language such as
Java, C, or Python
• Automation skills
38. Knowledge of
•One or more
programming
languages and
operating systems
Knowledge of IT
infrastructure and
landscape:
•Computer
architecture
•Cloud
•Virtual machines
•Different types of
storage
Proficiency in SQL Working knowledge of
•One or more
databases, both
relational and NoSQL
Knowledge and
experience of:
•ETL
•Data Pipelines
•Data Warehouses
•Data Lakes
•Other Big Data
systems
What are some of the technical skills required
to be a successful Data Engineer?
39. • Curiosity
• Questioning skills – asking questions to business and
technical users
• Enjoy problema-solving and troubleshooting
• Being Good at teamwork, collaboration, and
communication
• Being Good at logic and having an interest in coding
What are some of the qualities and soft skills
required to be a successful Data Engineer?
• Good problem solving and communication skills
• Detail-oriented control-freak
• There is a lot of detail involved in working with data
• Its important to get into the details and make sure that
you don´t leave any of the boxes unchecked
• Being in control and aware of the choices that are being
made in the data environment, and their trade-offs
• Interact with developers
• Defend your choices to management
• Advocate for the choices you make
• Good work ethic
• Passion to learn
What are some of the qualities and soft skills
required to be a successful Data Engineer?
What are some of the qualities and soft skills
required to be a successful Data Engineer?
What are some of the qualities and soft skills
required to be a successful Data Engineer?
• Curiosity
• Good communication
• Eagerness to learn
What are some of the qualities and soft skills
required to be a successful Data Engineer?
40. S 1 2 3
Viewpoints: A Day in the
Life of a Data Engineer
The need
Launching a new shampoo in the
market
How a product gets talked about on social media has an immediate impact on sales numbers
and brand perception (customer sentiment).
They want to keep a close eye on all social media and know what online communities say about
their product—positive feedback, negative comments, suggestions, even comparisons with
their current favorites.
41. The Goal • Prototype of a dashboard using a
sentiment analysis algorithm and
dummy data to serve this goal.
• The dashboard displayed graphs of
different customer sentiments
plotted as scores across social media
sources and consumer
demographics.
• The idea was an instant hit and got a
go-ahead for implementation by the
business team.
• This is where our team of data
engineers stepped in—to make this
prototype a reality.
42. The Solution
My first task was to pull in data from all the social
media sites and online sources identified by the
business teams into the organization’s
environment.
I started with APIs to pull the tweets and posts with
our product’s hashtag into temporary storage. I
then turned to the eCommerce portals and product
review blogs to collect our product-specific data
and move that into temporary storage as well—an
activity known as web scraping. Tweets, posts,
comments, articles, even memes—there was quite a
mix of formats and structures.
Once I had all the data I needed, I inspected it to
assess the transformations that will need to be
performed on this data before it can be loaded into
the database.
We decided to create a Python program for
processing the data and loading it into
the database. I cleaned up the data and
transformed it into a format that would be stored in
the database. This is the database from which the
dashboard is pulling the data to display the reports.
43. Next Steps
Build a data pipeline that extracts, transforms, and loads the data on an
ongoing basis.
Once this process is in place, business teams will be able to see a real-time
projection each time they log into the dashboard.