Data Science and Analysis.pptx

SEMINAR
D E P A R T M E N T O F C O M P U T E R S C I E N C E
Data Science
& Analysis
TOPIC
Prashant Yadav
M.Tech (CS)
Roll No. 223410
PRESENTED BY
BABASAHEB BHIMRAO AMBEDKAR UNIVERSITY
LUCKNOW UTTAR PRADESH

CONTENTS
 Introduction
 Open Source Tools
 Methodology
 Python For Data Science
 Data Analysis
 Applications
 Challenges

INTRODUCTION
• Data science is an interdisciplinary field that involves the use of statistical and
computational methods to extract insights and knowledge from data. It combines
techniques from mathematics, statistics, computer science, and domain-specific
knowledge to analyze and interpret complex data sets.
• Data science involves various stages, including data collection, data cleaning, data
analysis, and data visualization.
• The goal of data science is to uncover patterns, trends, and insights that can be used to
inform decision-making and solve real-world problems. It has applications in a wide range
of fields, including business, healthcare, finance, and social sciences.

• NEED OF DATA SCIENCE
PARAMETER Description
Data-driven decision making
Data science enables organizations to make informed
decisions based on data insights, rather than relying on
intuition or guesswork.
Predictive analytics
Data science allows organizations to use historical data
to make predictions about future events or trends, such
as customer behavior or market trends.
Improved efficiency and productivity
By automating repetitive tasks and streamlining
processes, data science can help organizations improve
efficiency and productivity.
Personalization
Data science enables organizations to personalize their
products or services to individual customers, based on
their preferences and behavior.
Fraud detection
Data science can be used to detect fraudulent activity,
such as credit card fraud or insurance fraud, by
analyzing patterns and anomalies in data.
Risk management
Data science can help organizations identify and
mitigate risks, such as financial risks or cybersecurity
risks, by analyzing data and identifying potential
threats.
Improved customer experience
By analyzing customer data, data science can help
organizations improve the customer experience by
identifying pain points and areas for improvement.
Competitive advantage
Data science can provide organizations with a
competitive advantage by enabling them to make data-

• Real World Example Of Data Science
1. Credit Risk Assessment:
• A bank uses data science to analyze customer data and credit history to assess
the risk of default on loans.
• Machine learning algorithms are used to identify patterns in customer behavior
and credit history that are associated with higher risk.
• Based on these insights, the bank can make informed decisions about loan
approvals and interest rates.
2. Predictive Maintenance:
• A manufacturing company uses data science to predict when equipment is likely to
fail.
• Sensor data is collected from the equipment and analyzed using machine learning
algorithms to identify patterns that are associated with equipment failure.
• Based on these insights, the company can schedule maintenance before equipment
failure occurs, reducing downtime and maintenance costs.

OPEN SOURCE TOOLS
Tool Description Suitable for
Python
A popular programming language for data
science, with a wide range of libraries and
frameworks for data analysis, machine
learning, and visualization.
Programmers
R
A programming language and environment for
statistical computing and graphics, with a wide
range of packages for data analysis and
visualization.
Programmers
Jupyter
Notebook
An open-source web application that allows
users to create and share documents that
contain live code, equations, visualizations,
and narrative text.
Both

Apache Spark
An open-source distributed computing system
for big data processing, with support for data
analysis, machine learning, and graph
processing.
Programmers
Apache
Hadoop
An open-source distributed computing system
for storing and processing large data sets, with
support for data analysis and machine
learning.
Programmers
Tableau
A data visualization tool that allows users to
create interactive dashboards and reports.
Non-
programmers
KNIME
An open-source data analytics platform that
allows users to create workflows for data
analysis, machine learning, and visualization.
Both

RapidMiner
An open-source data science platform
that allows users to create workflows for
data analysis, machine learning, and
visualization.
Both
Orange
An open-source data visualization and
analysis tool that allows users to create
workflows for data analysis and machine
learning.
Both
Weka
An open-source machine learning tool
that allows users to create and apply
machine learning models to data sets.
Both

METHODOLOGY
• The Business Understanding stage is crucial because it helps to clarify the goal of the customer. In this
stage, we have to ask a lot of questions to the customer about every single aspect of the problem.
• The next step is the Analytic Approach, where, once the business problem has been clearly stated, the
data scientist can define the analytic approach to solve the problem.
• Data Requirements is the stage where
we identify the necessary data content,
formats, and sources for initial data
collection, and we use this data inside the
algorithm of the approach we chose.
• In the Data Collection Stage, data
scientists identify the available data
resources relevant to the problem
domain. To retrieve data, we can do web
scraping on a related website, or we can
use repository with premade datasets
ready to use.
( Decision Tree)

METHODOLOGY
• In the Data Understanding stage, data scientists try to understand more about the data collected before.
We have to check the type of each data and to learn more about the attributes and their names.
• In the Data Preparation stage, data scientists prepare data for modeling, which is one of the most crucial
steps because the model has to be clean and without errors.
• In the Modeling stage, the data scientist has the chance to understand if his work is ready to go or if it
needs review. Modeling focuses on developing models that are either descriptive or predictive, and these
models are based on the analytic approach that was taken statistically or through machine learning.

METHODOLOGY
• In the Model Evaluation stage, data scientists can evaluate the model in two ways: Hold-Out
and Cross-Validation. In the Hold-Out method, the dataset is divided into three subsets:
a training set as we said in the modeling stage; a validation set that is a subset used to
assess the performance of the model built in the training phase; a test set is a subset to
evaluate the likely future performance of a model.
• The Deployment stage depends on the purpose of the model, and it may be rolled out to a
limited group of users or in a test environment.
• The Feedback stage is usually made the most from the customer.

METHODOLOGY
Common Algorithms :
1 .Linear Regression
A statistical method used to model the relationship between a dependent variable and one or
more independent variables.
We can use simple linear regression when you want to know:
1. How strong the relationship is between two variables (e.g., the relationship between rainfall
and soil erosion).
2. The value of the dependent variable at a certain value of the independent variable (e.g., the
amount of soil erosion at a certain level of rainfall).

METHODOLOGY
Simple linear regression formula :
y is the predicted value of the dependent
variable (y) for any given value of the
independent variable (x).
B0 is the intercept, the predicted value
of y when the x is 0.
B1 is the regression coefficient – how much
we expect y to change as x increases.
x is the independent variable ( the variable
we expect is influencing y).
e is the error of the estimate, or how much
variation there is in our estimate of the
regression coefficient.

METHODOLOGY
2. Decision Tree :
• A decision tree is a machine learning
algorithm that uses a tree-like model of
decisions and their possible consequences
to predict outcomes. It is a supervised
learning algorithm that can be used for both
classification and regression tasks.
• The decision tree algorithm works by
recursively splitting the data into subsets
based on the values of the input features.
The goal is to create a tree that predicts the
target variable with high accuracy.

METHODOLOGY
( Example of decision tree )

PYTHON FOR DATA SCIENCE
Python is a popular programming language for data science due to its simplicity, versatility, and
extensive libraries and frameworks for data analysis, machine learning, and visualization.
Here are some of the key libraries and frameworks in Python for data science:
• NumPy: A library for numerical computing in Python, with support for arrays, matrices, and
mathematical functions.
• Pandas: A library for data manipulation and analysis in Python, with support for data structures
such as data frames and series.
• Matplotlib: A library for data visualization in Python, with support for creating a wide range of
charts and graphs.
• Scikit-learn: A library for machine learning in Python, with support for a wide range of algorithms
for classification, regression, clustering, and more.
• TensorFlow: A library for machine learning and deep learning in Python, with support for building
and training neural networks.

DATA ANALYSIS
• Data analysis using Python involves using the Python programming language and its
associated libraries and frameworks to manipulate, analyze, and visualize data. Python
is a popular language for data analysis due to its simplicity, versatility, and extensive
libraries and frameworks for data analysis, machine learning, and visualization.
• By using Python for data analysis, we can gain insights into complex data sets and
make informed decisions based on data insights. Python's popularity in data analysis is
also due to its ease of use and readability, making it accessible to both experienced
programmers and beginners.
• The process of data analysis using Python typically involves several steps, including
data cleaning, data manipulation, data analysis, and data visualization. Python libraries
such as Pandas, NumPy, and Matplotlib are commonly used for these tasks.

APPLICATIONS
• Business: Data science and analysis are widely used in business to analyze
customer data, sales data, and market trends to inform decision-making. This
includes customer segmentation, product recommendations, and pricing
optimization.
• Healthcare: Data science and analysis are used in healthcare to analyze patient
data, identify disease patterns, and improve patient outcomes. This includes
disease diagnosis, drug discovery, and personalized medicine.
• Finance: Data science and analysis are used in finance to analyze financial data,
identify market trends, and inform investment decisions. This includes risk
assessment, fraud detection, and portfolio optimization.
• Social media: Data science and analysis are used in social media to analyze user
behavior, identify trends, and improve user engagement. This includes sentiment
analysis, user profiling, and content recommendation And Many more as such
applications of data Science and analysis exist.

CHALLENGES
• Data Quality: One of the biggest challenges in data science is ensuring that the data being used is accurate,
complete, and reliable. Poor data quality can lead to inaccurate results and flawed insights.
• Data Volume: With the increasing amount of data being generated, managing and processing large volumes
of data can be a significant challenge. This requires specialized tools and techniques for data storage,
processing, and analysis.
• Data Variety: Data comes in many different forms, including structured, semi-structured, and unstructured
data. Working with unstructured data, such as text, images, and video, can be particularly challenging and
requires specialized techniques for natural language processing, computer vision, and other areas of artificial
intelligence.
• Data Privacy and Security: As data becomes more valuable, ensuring its privacy and security becomes
increasingly important. Data scientists need to be aware of privacy regulations and take steps to protect
sensitive data from unauthorized access.
• Model Interpretability: Machine learning models can be complex and difficult to interpret, making it
challenging to understand how they arrived at their conclusions. This can be particularly problematic in
applications where decisions have significant consequences, such as healthcare or finance.
• Business Understanding: Data scientists need to have a deep understanding of the business context in
which they are working in order to develop insights that are relevant and actionable.

Data Science and Analysis.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Science and Analysis.pptx

Similar to Data Science and Analysis.pptx (20)

Recently uploaded

Recently uploaded (20)

Data Science and Analysis.pptx