Introduction to Data Analytics and data analytics life cycle

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Data Analytics (KIT-601)
Unit-1: Introduction to Data Analytics & Data
Analytics Lifecycle
Dr. Radhey Shyam
Professor
Department of Information Technology
SRMCEM Lucknow
(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)
Unit-1 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who
made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
study material for your own academic purposes. For any query, communication can be made through this
email : shyam0058@gmail.com.
February 19, 2024

Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)
At the end of course , the student will be able to
CO 1 Discuss various concepts of data analytics pipeline K1, K2
CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3
CO 4 Compare different clustering and frequent pattern mining algorithms K4
CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
I
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications of data analytics.
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
08
II
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
08
III
Mining Data Streams: Introduction to streams concepts, stream data model and
architecture, stream computing, sampling data in a stream, filtering streams, counting
distinct elements in a stream, estimating moments, counting oneness in a window,
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
time sentiment analysis, stock market predictions.
08
IV
Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means,
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
08
V
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
08
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Part-I: Introduction to Data Analytics
1 Introduction To Big Data
What Is Big Data?
ˆ Big Data is often described as extremely large data sets that have grown beyond the ability to manage
and analyze them with traditional data processing tools.
ˆ Big Data defines a situation in which data sets have grown to such enormous sizes that conventional
information technologies can no longer effectively handle either the size of the data set or the scale and
growth of the data set.
ˆ In other words, the data set has grown so large that it is difficult to manage and even harder to garner
value out of it.
ˆ The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visual-
ization of data.
ˆ Big Data has its roots in the scientific and medical communities, where the complex analysis of massive
amounts of data has been done for drug development, physics modeling, and other forms of research,
all of which involve large data sets.
These 4Vs (See Figure 1) [13] 1
of Big Data lay out the path to analytics, with each having intrinsic value
in the process of discovering value. Nevertheless, the complexity of Big Data does not end with just four
1Volume—Organizations collect data from a variety of sources, including transactions, smart (IoT) devices, industrial
equipment, videos, images, audio, social media and more. In the past, storing all that data would have been too costly – but
cheaper storage using data lakes, Hadoop and the cloud have eased the burden.
Velocity—With the growth in the Internet of Things, data streams into businesses at an unprecedented speed and must be
handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in
near-real time.
Variety—Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text
documents, emails, videos, audios, stock ticker data and financial transactions.
Veracity—Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to link,
match, cleanse and transform data across systems. Businesses need to connect and correlate relationships, hierarchies and
multiple data linkages. Otherwise, their data can quickly spiral out of control.
Value—This refers to the value that the big data can provide and it relates directly to what organizations can do with that
collected data. It is often quantified as the potential social or economic value that the data might create.
3

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Figure 1: Illustration of Big Data [14].
dimensions. There are other factors at work as well: the processes that Big Data drives. These processes
are a conglomeration of technologies and analytics that are used to define the value of data sources, which
translates to actionable elements that move businesses forward.
Many of those technologies or concepts are not new but have come to fall under the umbrella of Big
Volatility—It deals with “How long the data is valid?”
Validity—It refers to accuracy and correctness of data. Any data picked up for analysis needs to be accurate.
Variability—In addition to the increasing velocities and varieties of data, data flows are unpredictable – changing often and
varying greatly. It’s challenging, but businesses need to know when something is trending in social media, and how to manage
daily, seasonal and event-triggered peak data loads.
4

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Data. Best defined as analysis categories, these technologies and concepts include the following:
Traditional business intelligence (BI): This consists of a broad category of applications and tech-
nologies for gathering, storing, analyzing, and providing access to data. BI delivers actionable information,
which helps enterprise users make better business decisions using fact-based support systems. BI works by
using an in-depth analysis of detailed business data, provided by databases, application data, and other
tangible data sources.
In some circles, BI can provide historical, current, and predictive views of business operations.
Data mining: This is a process in which data are analyzed from different perspectives and then turned
into summary data that are deemed useful. Data mining is normally used with data at rest or with archival
data. Data mining techniques focus on modeling and knowledge discovery for predictive, rather than purely
descriptive, purposes—an ideal process for uncovering new patterns from large data sets.
Statistical applications: These look at data using algorithms based on statistical principles and nor-
mally concentrate on data sets related to polls, census, and other static data sets. Statistical applications
ideally deliver sample observations that can be used to study populated data sets for the purpose of esti-
mating, testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are
the primary sources for analyzable information.
Predictive analysis: This is a subset of statistical applications in which data sets are examined to
come up with predictions, based on trends and information gleaned from databases. Predictive analysis
tends to be big in the financial and scientific worlds, where trending tends to drive predictions, once external
elements are added to the data set. One of the main goals of predictive analysis is to identify the risks and
opportunities for business process, markets, and manufacturing.
Data modeling: This is a conceptual application of analytics in which multiple “what-if” scenarios
can be applied via algorithms to multiple data sets. Ideally, the modeled information changes based on the
information made available to the algorithms, which then provide insight to the effects of the change on the
data sets. Data modeling works hand in hand with data visualization, in which uncovering information can
5

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
help with a particular business endeavor.
The preceding analysis categories constitute only a portion of where Big Data is headed and why it has
intrinsic value to business. That value is driven by the never ending quest for a competitive advantage,
encouraging organizations to turn to large repositories of corporate and external data to uncover trends,
statistics, and other actionable information to help them decide on their next move. This has helped the
concept of Big Data to gain popularity with technologists and executives alike, along with its associated
tools, platforms, and analytics.
1.1 ARRIVAL OF ANALYTICS
ˆ As analytics and research were applied to large data sets, scientists came to the conclusion that more
is better—in this case, more data, more analysis, and more results.
ˆ Researchers started to incorporate related data sets, unstructured data, archival data, and
real-time data into the process
ˆ In the business world, Big Data is all about opportunity.
ˆ According to IBM, every day we create 2.5 quintillion (2.5 × 1018
) bytes of data, so much that 90
percent of the data in the world today has been created in the last two years.
ˆ These data come from everywhere: sensors used to gather climate information, posts to so-
cial media sites, digital pictures and videos posted online, transaction records of online
purchases, and cell phone GPS signals, to name just a few.
ˆ That is the catalyst for Big Data, along with the more important fact that all of these data have
intrinsic value that can be extrapolated using analytics, algorithms, and other techniques.
ˆ National Oceanic and Atmospheric Administration (NOAA) uses Big Data approaches to aid in climate,
ecosystem, weather, and commercial research, while National Aeronautics and Space Administration
(NASA) uses Big Data for aeronautical and other research. Pharmaceutical companies and energy
companies have leveraged Big Data for more tangible results, such as drug testing and geophysical
analysis.
6

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
ˆ New York Times has used Big Data tools for text analysis and Web mining, while the Walt Disney
Company uses them to correlate and understand customer behavior in all of its stores, theme parks,
and Web properties.
ˆ Big Data is full of challenges, ranging from the technical to the conceptual to the operational, any of
which can derail the ability to discover value and leverage what Big Data is all about.
2 Characteristics of Data
ˆ Data is a collection of details in the form of either figures or texts or symbols, or descriptions etc.
ˆ Data contains raw figures and facts. Information unlike data provides insights analyzed through the
data collected. Data has three characteristics:
1. Composition:— The composition of data deals with the structure of data, i.e; the sources of
data, the granularity, the types and nature of data as to whether it is static or real time streaming.
2. Condition:—The condition of data deals with the state of data, i.e; “Can one use this data as is
for analysis?” or “Does it require cleaning for further enhancement and enrichment?” data?”
3. Context:— The context of data deals with “Where has this data been generated?”. “Why was
this data generated?”, “How sensitive is this data?”,“What are the events associated with this”.
3 Data Classification
The volume and overall size of the data set is only one portion of the Big Data equation. There is a growing
consensus that both semi-structured and unstructured data sources contain business-critical information and
must therefore be made accessible for both BI and operational needs. It is also clear that the amount of
relevant unstructured business data is not only growing but will continue to grow for the foreseeable future.
Data can be classified under several categories:
1. Structured data:—Structured data are normally found in traditional databases (SQL or others)
where data are organized into tables based on defined business rules. Structured data usually prove
to be the easiest type of data to work with, simply because the data are defined and indexed, making
access and filtering easier. For example, Database, Spread sheets, OLTP systems, etc.
7

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
2. Semi-structured data:—Sem-istructured data fall between unstructured and structured data. Semi-
structured data do not have a formal structure like a database with tables and relationships. However,
unlike unstructured data, semi-structured data have tags or other markers to separate the elements
and provide a hierarchy of records and fields, which define the data. For example, XML, JSON, E-mail,
etc.
3. Unstructured data:—unstructured data, in contrast, normally have no BI behind them. Unstruc-
tured data are not organized into tables and cannot be natively used by applications or interpreted
by a database. A good example of unstructured data would be a collection of binary image files.
For example, memos, chat-rooms, PowerPoint presentations, images, videos, letters, researches, white
papers, body of an email, etc.
4 Introduction to Big Data Platform
Big data platforms refer to software technologies that are designed to manage and process large volumes of
data, often in real-time or near-real-time. These platforms are typically used by businesses and organizations
that generate or collect massive amounts of data, such as social media companies, financial institutions, and
healthcare providers.
There are several key components of big data platforms, including:
ˆ Data storage: Big data platforms provide large-scale data storage capabilities, often utilizing dis-
tributed file systems or NoSQL 2
databases to accommodate large amounts of data.
ˆ Data processing: Big data platforms offer powerful data processing capabilities, often utilizing par-
allel processing, distributed computing, and real-time streaming processing to analyze and transform
data.
ˆ Data analytics: Big data platforms provide advanced analytics capabilities, often utilizing machine
learning algorithms, statistical models, and visualization tools to extract insights from large datasets.
ˆ Data integration: Big data platforms allow for integration with other data sources, such as databases,
APIs, and streaming data sources, to provide a unified view of data.
2To overcome the rigidity of normalized RDBMS schemas, big data system accepts NoSQL. NOSQL is a method to manage
and store unstructured and non-relational data, also known as “Not Only SQL” [15], for example, HBase database.
8

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Some of the most popular big data platforms include Hadoop, Apache Spark, Apache Cassandra, Apache
Storm, and Apache Kafka. These platforms are open source and freely available, making them accessible to
organizations of all sizes.
5 Need of Data Analytics
Data analytics is the process of examining and analyzing large sets of data to uncover useful insights, patterns,
and trends. There are several reasons why organizations and businesses need data analytics:
1. Better decision-making: Data analytics can provide valuable insights that enable organizations to
make better-informed decisions. By analyzing data, organizations can identify patterns and trends
that may not be visible through intuition or traditional methods of analysis.
2. Improved efficiency: Data analytics can help organizations optimize their operations and improve
efficiency. By analyzing data on business processes, organizations can identify areas for improvement
and streamline operations to reduce costs and increase productivity.
3. Enhanced customer experience: Data analytics can help organizations gain a better understanding
of their customers and their preferences. By analyzing customer data, organizations can tailor their
products and services to better meet customer needs, resulting in a more satisfying customer experience.
4. Competitive advantage: Data analytics can provide organizations with a competitive advantage
by enabling them to make better-informed decisions and identify new opportunities for growth. By
leveraging data analytics, organizations can stay ahead of their competitors and position themselves
for success.
5. Risk management: Data analytics can help organizations identify potential risks and mitigate them
before they become major issues. By analyzing data on business processes and operations, organizations
can identify potential areas of risk and take steps to prevent them from occurring.
In summary, data analytics is essential for organizations looking to improve their decision-making, ef-
ficiency, customer experience, competitive advantage, and risk management. By leveraging the insights
provided by data analytics, organizations can stay ahead of the curve and position themselves for long-term
success.
9

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
6 Evolution of Data Analytics Scalability
The evolution of data analytics scalability has been driven by the need to process and analyze ever-increasing
volumes of data. Here are some of the key stages in the evolution of data analytics scalability:
1. Traditional databases: In the early days of data analytics, traditional databases were used to store
and analyze data. These databases were limited in their ability to handle large volumes of data, which
made them unsuitable for many analytics use cases.
2. Data warehouses: To address the limitations of traditional databases, data warehouses were devel-
oped in the 1990s. Data warehouses were designed to store and manage large volumes of structured
data, providing a more scalable solution for data analytics.
3. Hadoop and MapReduce: In the mid-2000s, Hadoop and MapReduce were developed as open-
source solutions for big data processing. These technologies enabled organizations to store and analyze
massive volumes of data in a distributed computing environment, making data analytics more scalable
and cost-effective.
4. Cloud computing: With the rise of cloud computing in the 2010s, organizations were able to scale
their data analytics infrastructure more easily and cost-effectively. Cloud-based data analytics plat-
forms such as Amazon Web Services (AWS) and Microsoft Azure provided scalable storage and pro-
cessing capabilities for big data.
5. Real-time analytics: With the growth of the Internet of Things (IoT) and other real-time data
sources, the need for real-time analytics capabilities became increasingly important. Technologies such
as Apache Kafka and Apache Spark Streaming were developed to enable real-time processing and
analysis of streaming data.
6. Machine learning and AI: In recent years, machine learning and artificial intelligence (AI) have
become key components of data analytics scalability. These technologies enable organizations to analyze
and make predictions based on massive volumes of data, providing valuable insights for decision-making
and business optimization.
Overall, the evolution of data analytics scalability has been driven by the need to process and analyze
increasingly large and complex datasets. With the development of new technologies and approaches, orga-
nizations are now able to derive insights from data at a scale that would have been unimaginable just a few
10

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
decades ago.
Figure 2: Illustration of types of analytics.
7 What is Data Analytics?
Data analytics is the process of examining large sets of data to extract insights, identify patterns, and make
informed decisions. It involves using various techniques, including statistical analysis, machine learning, and
data visualization, to analyze data and draw conclusions from it.
Data analytics can be applied to different types of data, including structured data (e.g., data stored in
databases) and unstructured data (e.g., social media posts, emails, and images). The goal of data analytics
is to turn raw data into meaningful and actionable insights that can help organizations make better decisions
and improve their operations.
Data analytics is used in many different fields, including business, healthcare, finance, marketing, and
social sciences. It can help businesses identify opportunities for growth, optimize their marketing strategies,
reduce costs, and improve customer experiences. In healthcare, data analytics can be used to predict and
prevent diseases, improve patient outcomes, and optimize resource allocation.
Overall, data analytics is a powerful tool that enables organizations to make informed decisions and gain
11

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
a competitive edge in today’s data-driven world.
7.1 Types of Data Analytics
There are five types of data analytics (See Figure 2):
1. Descriptive Analytics:—what is happening in your business? it gives us only insight about every-
thing is going well or not in our business without explaining the root cause.
2. Diagnostic Analytics:—why it is happening in your business? it explain the root cause behind the
outcome of descriptive analytic.
3. Predictive Analytics:—explains what likely to happen in the future based on previous trends and
patterns. By utilizing various statistical and machine learning algorithms to provide recommendations
and provide answers to questions related to what might happen in the future, that can be answer BI.
4. Prescriptive Analytics:—helps you to determine the best course of action to choose to bypass or
eliminate future issues. You can use prescriptive analytics to advise users on possible outcomes and
what should they do to maximize their key metrics i.e., business metrics.
5. Cognitive Analytics:—it combines a number of intelligent techniques like AI, ML, DL, etc. to apply
human brain like intelligence to perform certain task.
8 Analytic processes and tools
There are several analytic processes and tools used in data analytics to extract insights from data. Here are
some of the most commonly used:
1. Data collection: This involves gathering relevant data from various sources, including databases,
data warehouses, and data lakes.
2. Data cleaning: Once the data is collected, it needs to be cleaned and preprocessed to remove any
errors, duplicates, or inconsistencies.
3. Data integration: This involves combining data from different sources into a single, unified dataset
that can be used for analysis.
12

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
4. Data analysis: This is the core of data analytics, where various techniques such as statistical analysis,
machine learning, and data mining are used to extract insights from the data.
5. Data visualization: Once the data has been analyzed, it is often visualized using graphs, charts, and
other visual aids to make it easier to understand and communicate the findings.
6. Business intelligence (BI) tools: These are software tools that help organizations make sense of
their data by providing dashboards, reports, and other tools for data visualization and analysis.
7. Big data tools: These are specialized tools designed to handle large volumes of data and process it
efficiently. Examples include Apache Hadoop, Apache Spark, and Apache Storm.
8. Machine learning tools: These are tools that use algorithms to learn from data and make predictions
or decisions based on that learning. Examples include scikit-learn, TensorFlow, and Keras.
Overall, the tools and processes used in data analytics are constantly evolving, driven by advances in tech-
nology and the increasing demand for data-driven insights in various industries.
9 Analysis vs Reporting
Analysis and reporting are two important aspects of data management and interpretation, but they serve
different purposes.
Reporting involves the presentation of information in a standardized format, typically using charts,
graphs, or tables. The purpose of reporting is to provide a clear and concise overview of data and to
communicate key insights to stakeholders. Reporting is often used to provide regular updates on business
performance, highlight trends, or share key metrics with stakeholders.
Analysis, on the other hand, involves the exploration and interpretation of data to gain insights and make
informed decisions. Analysis involves digging deeper into the data to identify patterns, relationships, and
trends that may not be immediately apparent from simple reporting. Analysis often involves using statistical
techniques, modeling, and machine learning to extract insights from the data.
In summary, reporting is focused on presenting data in a clear and concise way, while analysis is focused
on exploring and interpreting the data to gain insights and make decisions. Both reporting and analysis are
important for effective data management, but they serve different purposes and require different skills and
tools.
13

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
10 Modern Data Analytic Tools
There are many modern data analytic tools available today that are designed to help organizations analyze
and interpret large volumes of data. Here are some of the most popular ones:
1. Tableau: This is a popular data visualization tool that allows users to create interactive dashboards
and reports from their data. It supports a wide range of data sources and is used by many organizations
to quickly visualize and explore data.
2. Power BI: This is a business analytics service provided by Microsoft that allows users to create
interactive visualizations and reports from their data. It integrates with other Microsoft products like
Excel and SharePoint, making it a popular choice for organizations that use these tools.
3. Google Analytics: This is a free web analytics service provided by Google that allows users to track
and analyze website traffic. It provides a wealth of data on user behavior, including pageviews, bounce
rates, and conversion rates.
4. Apache Spark: This is a fast and powerful open-source data processing engine that can be used for
large-scale data processing, machine learning, and graph processing. It supports multiple programming
languages, including Java, Scala, and Python.
5. Python: This is a popular programming language for data analysis and machine learning. It has a
large and active community that has developed many libraries and tools for data analysis, including
pandas, NumPy, and scikit-learn.
6. R: This is another popular programming language for data analysis and statistical computing. It has a
large library of statistical and graphical techniques and is used by many researchers and data analysts.
Overall, these are just a few examples of the many modern data analytic tools available today. Organizations
can choose the tools that best fit their needs and use them to gain insights and make informed decisions
based on their data.
11 Applications of Data Analytics
Data analytics has a wide range of applications across industries and organizations. Here are some of the
most common applications:
14

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
1. Business intelligence: Data analytics is used to analyze data and generate insights that help orga-
nizations make data-driven decisions. Business intelligence tools and techniques are used to track key
performance indicators (KPIs), monitor business processes, and identify trends and patterns.
2. Marketing: Data analytics is used to analyze customer behavior, preferences, and demographics to de-
velop targeted marketing campaigns. This includes analyzing website traffic, social media engagement,
and email marketing campaigns.
3. Healthcare: Data analytics is used in healthcare to analyze patient data and improve patient out-
comes. This includes analyzing electronic health records (EHRs) to identify disease patterns and
improve treatment plans, as well as analyzing clinical trial data to develop new treatments and drugs.
4. Finance: Data analytics is used in finance to analyze financial data and identify trends and patterns.
This includes analyzing stock prices, predicting market trends, and identifying fraudulent activity.
5. Manufacturing: Data analytics is used in manufacturing to optimize production processes and im-
prove product quality. This includes analyzing sensor data from production lines, predicting equipment
failures, and identifying quality issues.
6. Human resources: Data analytics is used in human resources to analyze employee data and identify
areas for improvement. This includes analyzing employee performance, identifying training needs, and
predicting employee turnover.
7. Transportation: Data analytics is used in transportation to optimize logistics and improve cus-
tomer service. This includes analyzing shipping data to optimize routes and delivery times, as well as
analyzing customer data to improve the customer experience.
Overall, data analytics has a wide range of applications across industries and organizations, and is increas-
ingly seen as a critical tool for success in the modern business world.
15

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Part-II: Data Analytics Life-cycle
1 What is Data Analytics Life Cycle?
Data is precious in today’s digital environment. It goes through several life stages, including creation,
testing, processing, consumption, and reuse. These stages are mapped out in the Data Analytics Life Cycle
for professionals working on data analytics initiatives. Each stage has its significance and characteristics.
1.1 key roles for successful analytic projects
There are several key roles that are essential for successful analytic projects. These roles are:
ˆ Project Sponsor: The project sponsor is the person who champions the project and is responsible
for securing funding and resources. They are the driving force behind the project and are accountable
for its success.
ˆ Project Manager: The project manager is responsible for the overall planning, coordination, and
execution of the project. They ensure that the project is completed on time, within budget, and meets
the required quality standards.
ˆ Data Analyst: The data analyst is responsible for collecting, analyzing, and interpreting data. They
use statistical methods and software tools to identify patterns and relationships in the data, and to
develop insights and recommendations.
ˆ Data Scientist: The data scientist is responsible for developing predictive models and algorithms.
They use machine learning and other advanced techniques to analyze complex data sets and to uncover
hidden patterns and trends.
ˆ Subject Matter Expert: The subject matter expert (SME) is an individual who has deep knowledge
and expertise in a particular domain. They provide insights into the context and meaning of the data,
and help to ensure that the project aligns with the business objectives.
ˆ IT Specialist: The IT specialist is responsible for managing the technical infrastructure that supports
the project. They ensure that the necessary hardware and software are in place, and that the system
is secure, scalable, and reliable.
16

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
ˆ Business Analyst: The business analyst is responsible for understanding the business requirements
and translating them into technical specifications. They work closely with the project manager and
data analyst to ensure that the project meets the needs of the business.
ˆ Quality Assurance Specialist: The quality assurance specialist is responsible for testing the project
deliverables to ensure that they meet the required quality standards. They perform various tests and
evaluations to identify defects and ensure that the system is functioning as intended.
Each of these roles is essential for the success of analytic projects, and the team must work together closely
to achieve the project objectives.
1.2 Importance of Data Analytics Life Cycle
In today’s digital-first world, data is of immense importance. It undergoes various stages throughout its life,
during its creation, testing, processing, consumption, and reuse. Data Analytics Lifecycle maps out these
stages for professionals working on data analytics projects. These phases are arranged in a circular structure
that forms a Data Analytics Lifecycle. (See Figure 3). Each step has its significance and characteristics.
The Data Analytics Lifecycle is designed to be used with significant big data projects. It is used to
portray the actual project correctly; the cycle is iterative. A step-by-step technique is needed to arrange the
actions and tasks involved in gathering, processing, analyzing, and reusing data to explore the various needs
for assessing the information on big data. Data analysis is modifying, processing, and cleaning raw data to
obtain useful, significant information that supports business decision-making.
1.3 Data Analytics Lifecycle Phases
There’s no defined structure of the phases in the life cycle of Data Analytics; thus, there may not be
uniformity in these steps. There can be some data professionals that follow additional steps, while there
may be some who skip some stages altogether or work on different phases simultaneously. Let us discuss the
various phases of the data analytics life cycle.
This guide talks about the fundamental phases of each data analytics process. Hence, they are more likely
to be present in most data analytics projects’ lifecycles. The Data Analytics lifecycle primarily consists of 6
phases.
17

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
Figure 3: Illustration of phases of data analytics lifecycle [12].
1.3.1 Phase 1: Data Discovery
This phase is all about defining the data’s purpose and how to achieve it by the end of the data analytics
lifecycle. The stage consists of identifying critical objectives a business is trying to discover by mapping out
the data. During this process, the team learns about the business domain and checks whether the business
unit or organization has worked on similar projects to refer to any learnings.
The team also evaluates technology, people, data, and time in this phase. For example, the team can
use Excel while dealing with a small dataset. However, heftier tasks demand more rigid tools for data
preparation and exploration. The team will need to use Python, R, Tableau Desktop or Tableau Prep, and
other data-cleaning tools in such scenarios.
This phase’s critical activities include framing the business problem, formulating initial hypotheses to
test, and beginning data learning.
18

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
1.3.2 Phase 2: Data Preparation
In this phase, the experts’ focus shifts from business requirements to information requirements. One of the
essential aspects of this phase is ensuring data availability for processing. The stage encompasses collecting,
processing, and cleansing the accumulated data.
1.3.3 Phase 3: Model Planning
This phase needs the availability of an analytic sandbox for the team to work with data and perform analytics
throughout the project duration. The team can load data in several ways.
ˆ Extract, Transform, Load (ETL) – It transforms the data based on a set of business rules before loading
it into the sandbox.
ˆ Extract, Load, Transform (ELT) – It loads the data into the sandbox and then transforms it based on
a set of business rules.
ˆ Extract, Transform, Load, Transform (ETLT) – It’s the combination of ETL and ELT and has two
transformation levels.
The team identifies variables for categorizing data, and identifies and amends data errors. Data errors can
be anything, including missing data, illogical values, duplicates, and spelling errors. For example, the team
imputes the average data score for categories for missing values. It enables more efficient data processing
without skewing the data.
After cleaning the data, the team determines the techniques, methods, and workflow for building a model
in the next phase. The team explores the data, identifies relations between data points to select the key
variables, and eventually devises a suitable model.
1.3.4 Phase 4: Model Building
The team develops testing, training, and production datasets in this phase. Further, the team builds and
executes models meticulously as planned during the model planning phase. They test data and try to find out
answers to the given objectives. They use various statistical modeling methods such as regression techniques,
decision trees, random forest modeling, and neural networks and perform a trial run to determine whether
it corresponds to the datasets.
19

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
1.3.5 Phase 5: Communication and Publication of Results
This phase aims to determine whether the project results are a success or failure and start collaborating
with significant stakeholders. The team identifies the vital findings of their analysis, measures the associated
business value, and creates a summarized narrative to convey the stakeholders’ results.
1.3.6 Phase 6: Operationalize/Measuring of Effectiveness
In this final phase, the team presents an in-depth report with coding, briefing, key findings, and technical
documents and papers to the stakeholders. Besides this, the data is moved to a live environment and
monitored to measure the analysis’s effectiveness. If the findings are in line with the objective, the results
and reports are finalized. On the other hand, if they deviate from the set intent, the team moves backward
in the lifecycle to any previous phase to change the input and get a different outcome.
1.4 Data Analytics Lifecycle Example
Consider an example of a retail store chain that wants to optimize its products’ prices to boost its revenue.
The store chain has thousands of products over hundreds of outlets, making it a highly complex scenario.
Once you identify the store chain’s objective, you find the data you need, prepare it, and go through the
Data Analytics lifecycle process.
You observe different types of customers, such as ordinary customers and customers like contractors who
buy in bulk. According to you, treating various types of customers differently can give you the solution.
However, you don’t have enough information about it and need to discuss this with the client team.
In this case, you need to get the definition, find data, and conduct hypothesis testing to check whether
various customer types impact the model results and get the right output. Once you are convinced with the
model results, you can deploy the model, and integrate it into the business, and you are all set to deploy the
prices you think are the most optimal across the outlets of the store.
20

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
*Additional Study Material*
21

U N I T 1
INTRODUCTION TO
BIG DATAANALYTICS
WHAT IS BIG DATA
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data
with so large size and complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.
WHAT IS BIG DATA
• As Gartner defines it – “Big Data are high volume, high velocity, or high-variety information assets
that require new forms of processing to enable enhanced decision making, insight discovery, and
process optimization.”
• The term ‘big data’ is self-explanatory − a collection of huge data sets that normal computing
techniques cannot process.
• The term not only refers to the data, but also to the various frameworks, tools, and techniques
involved.
• Technological advancement and the advent of new channels of communication (like social
networking) and new, stronger devices have presented a challenge to industry players in the sense
that they have to find other ways to handle the data.
• Big data is an all-inclusive term, representing the enormous volume of complex data sets that
companies and governments generate in the present-day digital environment.
• Big data, typically measured in petabytes or terabytes, materializes from three major sources—
transactional data, machine data, and social data.
TYPES OF BIG-DATA
Big Data is generally categorized into three different varieties. They are as shown below:
•Structured Data
•Semi-Structured Data
•Unstructured Data
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

TYPES OF BIG-DATA
•Structured Data owns a dedicated data model, It also has a well-defined structure,
it follows a consistent order and it is designed in such a way that it can be easily
accessed and used by a person or a computer. Structured data is usually stored in
well-defined columns and also Databases.
Example: Database Management Systems(DBMS)
•Semi-Structured Data can be considered as another form of Structured Data. It
inherits a few properties of Structured Data, but the major part of this kind of data
fails to have a definite structure and also, it does not obey the formal structure of
data models such as an RDBMS.
Example: Comma Separated Values(CSV) File.
•Unstructured Data is completely a different type of which neither has a structure
nor obeys to follow the formal structural rules of data models. It does not even
have a consistent format and it found to be varying all the time. But, rarely it may
have information related to data and time.
Example: Audio Files, Images etc
TYPES OF BIG-DATA
THE CHARACTERISTICS OF BIG DATA THE CHARACTERISTICS OF BIG DATA
Volume
Volume refers to the unimaginable amounts of information generated every second from social media, cell
phones, cars, credit cards, M2M sensors, images, video, and whatnot. We are currently using distributed
systems, to store data in several locations and brought together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like” button is recorded, and
over 350 million new posts are uploaded each day. Such a huge amount of data can only be handled by Big
Data Technologies
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

THE CHARACTERISTICS OF BIG DATA
Variety
As Discussed before, Big Data is generated in multiple varieties. Compared to the traditional data like phone
numbers and addresses, the latest trend of data is in the form of photos, videos, and audios and many more,
making about 80% of the data to be completely unstructured
Veracity
Veracity basically means the degree of reliability that the data has to offer. Since a major part of the data is
unstructured and irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the
data is crucial in business developments
Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or
process. It is actually the amount of valuable, reliable and trustworthy data that needs to be stored, processed,
analyzed to find insights.
Velocity
Last but never least, Velocity plays a major role compared to the others, there is no point in investing so much to
end up waiting for the data. So, the major aspect of Big Data is to provide data on demand and at a faster pace.
APPLICATIONS OF BIG DATA
 Retail
 Leading online retail platforms are wholeheartedly deploying big data throughout a
customer’s purchase journey, to predict trends, forecast demands, optimize pricing, and
identify customer behavioral patterns.
 Big data is helping retailers implement clear strategies that minimize risk and maximize
profit.
 Healthcare
 Big data is revolutionizing the healthcare industry, especially the way medical professionals in
the past diagnosed and treated diseases.
 In recent times, effective analysis and processing of big data by machine learning algorithms
provide significant advantages for the evaluation and assimilation of complex clinical data,
which prevent deaths and improve the quality of life by enabling healthcare workers to detect
early warning signs and symptoms.
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

 Financial Services and Insurance
 The increased ability to analyze and process big data is dramatically impacting the financial
services, banking, and insurance landscape.
 In addition to using big data for swift detection of fraudulent transactions, lowering risks, and
supercharging marketing efforts, few companies are taking the applications to the next levels.
 Manufacturing
 Advancements in robotics and automation technologies, modern-day manufacturers are
becoming more and more data focused, heavily investing in automated factories that exploit
big data to streamline production and lower operational costs.
 Top global manufacturers are also integrating sensors into their products, capturing big data
to provide valuable insights on product performance and its usage.
 Energy
 To combat the rising costs of oil extraction and exploration difficulties because of economic
and political turmoil, the energy industry is turning toward data-driven solutions to increase
profitability.
 Big data is optimizing every process while cutting down energy waste from drilling to
exploring new reserves, production, and distribution.
 Logistics & Transportation
 State-of-the-art warehouses use digital cameras to capture stock level data, which, when fed
into ML algorithms, facilitates intelligent inventory management with prediction capabilities
that indicate when restocking is required.
 In the transportation industry, leading transport companies now promote the collection and
analysis of vehicle telematics data, using big data to optimize routes, driving behavior, and
maintenance.
 Government
 Cities worldwide are undergoing large-scale transformations to become “smart”, through the
use of data collected from various Internet of Things (IoT) sensors.
 Governments are leveraging this big data to ensure good governance via the efficient
management of resources and assets, which increases urban mobility, improves solid waste
management, and facilitates better delivery of public utility services.
WHAT IS ANALYTICS
Data analytics is a discipline focused on extracting insights from data, including the analysis, collection,
organization, and storage of data, as well as the tools and techniques used to do so.
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

DATAANALYTICS DEFINITION
• Data analytics is a discipline focused on extracting insights from
data.
• It comprises the processes, tools and techniques of data analysis
and management, including the collection, organization, and
storage of data.
• The chief aim of data analytics is to apply statistical analysis and
technologies on data to find trends and solve problems.
• Data analytics has become increasingly important in the
enterprise as a means for analyzing and shaping business
processes and improving decision-making and business results.
• Data analytics draws from a range of disciplines — including
computer programming, mathematics, and statistics — to
perform analysis on data in an effort to describe, predict, and
improve performance.
• To ensure robust analysis, data analytics teams leverage a range
of data management techniques, including data mining, data
cleansing, data transformation, data modeling, and more.
DATAANALYTICS VS. DATAANALYSIS
• While the terms data analytics and data analysis are frequently used interchangeably, data analysis is a
subset of data analytics concerned with examining, cleansing, transforming, and modeling data to derive
conclusions.
• Data analytics includes the tools and techniques used to perform data analysis.
DATAANALYTICS VS. DATA SCIENCE
• Data analytics and data science are closely related.
• Data analytics is a component of data science, used
to understand what an organization’s data looks like.
• Generally, the output of data analytics are reports and
visualizations.
• Data science takes the output of analytics to study
and solve problems.
• The difference between data analytics and data
science is often seen as one of timescale.
• Data analytics describes the current or historical state
of reality, whereas data science uses that data to
predict and/or understand the future.
DATAANALYTICS VS. BUSINESS
ANALYTICS
• Business analytics is another subset of data analytics.
• Business analytics uses data analytics techniques,
including data mining, statistical analysis, and
predictive modeling, to drive better business
decisions.
• Gartner defines business analytics as “solutions used
to build analysis models and simulations to create
scenarios, understand realities, and predict future
states.”
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

TYPES OF DATAANALYTICS TYPES OF DATAANALYTICS
1.Descriptive analytics: What has happened and
what is happening right now? Descriptive analytics
uses historical and current data from multiple
sources to describe the present state by identifying
trends and patterns. In business analytics, this is the
purview of business intelligence (BI).
2.Diagnostic analytics: Why is it happening?
Diagnostic analytics uses data (often generated via
descriptive analytics) to discover the factors or
reasons for past performance.
TYPES OF DATAANALYTICS
3. Predictive analytics: What is likely to happen in the
future? Predictive analytics applies techniques such as
statistical modeling, forecasting, and machine learning
to the output of descriptive and diagnostic analytics to
make predictions about future outcomes. Predictive
analytics is often considered a type of “advanced
analytics,” and frequently depends on machine
learning and/or deep learning.
4.Prescriptive analytics: What do we need to do?
Prescriptive analytics is a type of advanced analytics that
involves the application of testing and other techniques to
recommend specific solutions that will deliver desired
outcomes. In business, predictive analytics uses machine
learning, business rules, and algorithms.
DATAANALYTICS METHODS AND
TECHNIQUES
1.Regression analysis: Regression analysis is a set of statistical
processes used to estimate the relationships between variables to
determine how changes to one or more variables might affect another.
For example, how might social media spending affect sales?
2.Monte Carlo simulation: “Monte Carlo simulations are used to
model the probability of different outcomes in a process that cannot
easily be predicted due to the intervention of random variables.” It is
frequently used for risk analysis.
3.Factor analysis: Factor analysis is a statistical method for taking a
massive data set and reducing it to a smaller, more manageable one.
This has the added benefit of often uncovering hidden patterns. In a
business setting, factor analysis is often used to explore things like
customer loyalty.
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

TECHNIQUES
3.Cohort analysis: Cohort analysis is used to break a dataset down
into groups that share common characteristics, or cohorts, for analysis.
This is often used to understand customer segments.
4.Cluster analysis: Cluster analysis as “a class of techniques that are
used to classify objects or cases into relative groups called clusters.” It
can be used to reveal structures in data — insurance firms might use
cluster analysis to investigate why certain locations are associated with
particular insurance claims, for instance.
5.Time series analysis: Time series analysis as “a statistical technique
that deals with time series data, or trend analysis. Time series data
means that data is in a series of particular time periods or intervals.
Time series analysis can be used to identify trends and cycles over
time, e.g., weekly sales numbers. It is frequently used for economic and
sales forecasting.
TECHNIQUES
6. Sentiment analysis: Sentiment analysis uses tools such as natural language processing, text analysis,
computational linguistics, and so on, to understand the feelings expressed in the data. While the previous six
methods seek to analyze quantitative data (data that can be measured), sentiment analysis seeks to
interpret and classify qualitative data by organizing it into themes. It is often used to understand how
customers feel about a brand, product, or service.
BIG DATAANALYTICS
• Big data analytics describes the process of uncovering trends,
patterns, and correlations in large amounts of raw data to help make
data-informed decisions.
• Big data analytics is the use of advanced analytic techniques against
very large, diverse data sets that include structured, semi-structured
and unstructured data, from different sources, and in different sizes
from terabytes to zettabytes.
• Analysis of big data allows analysts, researchers and business users
to make better and faster decisions using data that was previously
inaccessible or unusable.
• Businesses can use advanced analytics techniques such as text
analytics, machine learning, predictive analytics, data mining, statistics
and natural language processing to gain new insights from previously
untapped data sources independently or together with existing
HOW BIG DATAANALYTICS WORKS
Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to help organizations
operationalize their big data.
1. Collect Data
• Data collection looks different for every organization. With today’s technology, organizations can gather both
structured and unstructured data from a variety of sources — from cloud storage to mobile applications to in-store
IoT sensors and beyond.
• Some data will be stored in data warehouses where business intelligence tools and solutions can access it easily.
• Raw or unstructured data that is too diverse or complex for a warehouse may be assigned metadata and stored in a
data lake.
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

2. Process Data
• Once data is collected and stored, it must be organized properly to get accurate results on analytical queries,
especially when it’s large and unstructured.
• Available data is growing exponentially, making data processing a challenge for organizations.
• One processing option is batch processing, which looks at large data blocks over time.
• Batch processing is useful when there is a longer turnaround time between collecting and analyzing data.
• Stream processing looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making.
• Stream processing is more complex and often more expensive.
3. Clean Data
• Data big or small requires scrubbing to improve data quality and get stronger results; all data must be
formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted for.
• Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can turn big data
into big insights. Some of these big data analysis methods include:
•Data mining sorts through large datasets to identify patterns and relationships by identifying anomalies
and creating data clusters.
•Predictive analytics uses an organization’s historical data to make predictions about the future,
identifying upcoming risks and opportunities.
•Deep learning imitates human learning patterns by using artificial intelligence and machine learning to
layer algorithms and find patterns in the most complex and abstract data.
WHAT IS DATA LAKE?
• A Data Lake is a storage repository that can store large
amount of structured, semi-structured, and unstructured
data.
• It is a place to store every type of data in its native format
with no fixed limits on account size or file.
• It offers high data quantity to increase analytic
performance and native integration.
• Data Lake is like a large container which is very similar to
real lake and rivers.
• Just like in a lake you have multiple tributaries coming in,
a data lake has structured data, unstructured data,
machine to machine, logs flowing through in real-time.
WHAT IS DATA LAKE?
• The Data Lake democratizes data and is a cost-effective way to store all data of an organization for later
processing.
• Research Analyst can focus on finding meaning patterns in data and not data itself.
• Unlike a hierarchal Dataware house where data is stored in Files and Folder, Data lake has a flat architecture.
Every data elements in a Data Lake is given a unique identifier and tagged with a set of metadata
information.
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

WHY DATA LAKE?
The main objective of building a data lake is to offer an unrefined view of data to data scientists.
Reasons for using Data Lake are:
 With the onset of storage engines like Hadoop storing disparate information has become easy. There is no
need to model data into an enterprise-wide schema with a Data Lake.
 With the increase in data volume, data quality, and metadata, the quality of analyses also increases.
 Data Lake offers business Agility
 Machine Learning and Artificial Intelligence can be used to make profitable predictions.
 It offers a competitive advantage to the implementing organization.
 There is no data silo structure. Data Lake gives 360 degrees view of customers and makes analysis more
robust.
DATA LAKE ARCHITECTURE
DATA LAKE ARCHITECTURE
The figure shows the architecture of a Business Data Lake. The lower levels represent data that is mostly at rest
while the upper levels show real-time transactional data. This data flow through the system with no or little
latency.
Following are important tiers in Data Lake Architecture:
1. Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake
in batches or in real-time
2. Insights Tier: The tiers on the right represent the research side where insights from the system are used.
SQL, NoSQL queries, or even excel could be used for data analysis.
3. HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data
that is at rest in the system.
4. Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
5. Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to
generate structured data for easier analysis.
6. Unified operations tier governs system management and monitoring. It includes auditing and proficiency
management, data management, workflow management.
KEY DATA LAKE CONCEPTS
Following are Key Data Lake concepts that one needs to understand to completely understand the Data Lake
Architecture
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

• Data Ingestion
Data Ingestion allows connectors to get data from a different data sources and load into the Data lake.
Data Ingestion supports:
1. All types of Structured, Semi-Structured, and Unstructured data.
2. Multiple ingestions like Batch, Real-Time, One-time load.
3. Many types of data sources like Databases, Webservers, Emails, IoT, and FTP.
• Data Storage
Data storage should be scalable, offers cost-effective storage and allow fast access to data exploration. It should
support various data formats.
• Data Governance
Data governance is a process of managing availability, usability, security, and integrity of data used in an
organization.
• Security
Security needs to be implemented in every layer of the Data lake. It starts with Storage, Unearthing, and
Consumption. The basic need is to stop access for unauthorized users. It should support different tools to access
data with easy to navigate GUI and Dashboards.
Authentication, Accounting, Authorization and Data Protection are some important features of data lake security.
• Data Quality:
Data quality is an essential component of Data Lake architecture. Data is used to exact business value. Extracting
insights from poor quality data will lead to poor quality insights.
• Data Discovery
Data Discovery is another important stage before you can begin preparing data or analysis. In this stage, tagging
technique is used to express the data understanding, by organizing and interpreting the data ingested in the Data
lake.
• Data Auditing
Two major Data auditing tasks are tracking changes to the key dataset.
1. Tracking changes to important dataset elements
2. Captures how/ when/ and who changes to these elements.
3. Data auditing helps to evaluate risk and compliance.
• Data Lineage
This component deals with data's origins. It mainly deals with where it movers over time and what happens to it. It
eases errors corrections in a data analytics process from origin to destination.
• Data Exploration
It is the beginning stage of data analysis. It helps to identify right dataset is vital before starting Data Exploration.
All given components need to work together to play an important part in Data lake building easily evolve and
explore the environment.
MATURITY STAGES OF DATA LAKE
Stage 1: Handle and ingest data at scale
This first stage of Data Maturity Involves improving the ability to transform and analyze data. Here, business
owners need to find the tools according to their skillset for obtaining more data and build analytical applications.
Stage 2: Building the analytical muscle
This is a second stage which involves improving the ability to transform and analyze data. In this stage,
companies use the tool which is most appropriate to their skillset. They start acquiring more data and building
applications. Here, capabilities of the enterprise data warehouse and data lake are used together.
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

MATURITY STAGES OF DATA LAKE
Stage 3: EDW and Data Lake work in unison
This step involves getting data and analytics into the hands of as many people as possible. In this stage, the
data lake and the enterprise data warehouse start to work in a union. Both playing their part in analytics
Stage 4: Enterprise capability in the lake
In this maturity stage of the data lake, enterprise capabilities are added to the Data Lake. Adoption of information
governance, information lifecycle management capabilities, and Metadata management. However, very few
organizations can reach this level of maturity, but this tally will increase in the future.
BEST PRACTICES FOR DATA LAKE
IMPLEMENTATION
• Architectural components, their interaction and identified products should support native data types
• Design of Data Lake should be driven by what is available instead of what is required. The schema and data
requirement is not defined until it is queried
• Design should be guided by disposable components integrated with service API.
• Data discovery, ingestion, storage, administration, quality, transformation, and visualization should be
managed independently.
• The Data Lake architecture should be tailored to a specific industry. It should ensure that capabilities
necessary for that domain are an inherent part of the design
• Faster on-boarding of newly discovered data sources is important
• Data Lake helps customized management to extract maximum value
• The Data Lake should support existing enterprise data management techniques and methods
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Time: 3 Hours Total Marks: 100
Note: Attempt all Sections. If you require any missing data, then choose suitably.
SECTION A
1. Attempt all questions in brief. 2*10 = 20
Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i) List five R functions used in descriptive statistics. 5
(j) List the names of any 2 visualization tools. 5
SECTION B
2. Attempt any three of the following: 10*3 = 30
Qno Questions CO
(a) Explain the process model and computation model for Big data
platform.
1
(b) Explain the use and advantages of decision trees. 2
(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
(e) Differentiate between NoSQL and RDBMS databases. 5
SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1
4. Attempt any one part of the following: 10 *1 = 10
Qno Questions CO
(a) Compare various types of support vector and kernel methods of data
analysis.
2
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal
component using PCA algorithm.
2
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0
BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS
Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a
data stream.
3
(b) Discuss the case study of stock market predictions in detail. 3
Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%.
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
i) Find all frequent itemsets using Apriori algorithm.
ii) List all the strong association rules (with support s and confidence
c).
4
Qno Questions CO
(a) Explain the HIVE architecture with its features in detail. 5
(b) Write R function to check whether the given number is prime or not. 5
F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S

F
e
b
r
u
a
r
y
1
9
,
2
0
2
4
/
D
r
.
R
S
2 Reference
[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/
[2] https://statacumen.com/teach/ADA1/ADA1_notes_F14.pdf
[3] https://www.youtube.com/watch?v=fDRa82lxzaU
[4] https://www.investopedia.com/terms/d/data-analytics.asp
[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf
[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp_content/computer_science/16._data_analytics/
03._evolution_of_analytical_scalability/et/9280_et_3_et.pdf
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_
book.pdf
[8] https://www.researchgate.net/publication/317214679_Sentiment_Analysis_for_Effective_Stock_
Market_Prediction
[9] https://snscourseware.org/snscenew/files/1569681518.pdf
[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience&BigDataAnalytics.pdf
[11] https://www.youtube.com/watch?v=mccsmoh2_3c
[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-analytics-li
[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20data%
20refers%20to%20data,around%20for%20a%20long%20time.
[14] https://www.javatpoint.com/big-data-characteristics
[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-
level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597.
http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x
********************
35

Introduction to Data Analytics and data analytics life cycle

Recommended

Recommended

More Related Content

Similar to Introduction to Data Analytics and data analytics life cycle

Similar to Introduction to Data Analytics and data analytics life cycle (20)

More from Dr. Radhey Shyam

More from Dr. Radhey Shyam (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Analytics and data analytics life cycle