SlideShare a Scribd company logo
1 of 13
1 Statistic interference
Statistical inference is a process of making conclusions about a
population based on a sample of data. It involves using statistical
methods to draw inferences about the population parameters based on
the information obtained from a sample.
There are two types of statistical inference: estimation and
hypothesis testing.
Estimation involves using sample data to estimate the value of a
population parameter, such as the population mean or standard
deviation. The most common estimator for the population mean is the
sample mean, which is an unbiased estimator if the sample is random
and the underlying population is normally distributed.
Hypothesis testing involves testing a hypothesis about a population
parameter. A hypothesis is a statement about the value of a population
parameter that can be tested using sample data. Hypothesis testing
involves specifying a null hypothesis and an alternative hypothesis,
and then using sample data to determine whether there is enough
evidence to reject the null hypothesis in favor of the alternative
hypothesis.
The process of hypothesis testing involves four steps:
1. Formulate the null and alternative hypotheses.
2. Choose an appropriate test statistic and calculate its value based
on the sample data.
3. Determine the p-value, which is the probability of obtaining a test
statistic as extreme or more extreme than the observed value if the
null hypothesis is true.
4. Compare the p-value to a significance level, such as 0.05, and
decide whether to reject or fail to reject the null hypothesis.
Statistical inference is a fundamental tool in data analysis, and it
is used in many fields such as medicine, economics, and social
sciences. By using statistical inference, researchers and analysts can
make informed decisions based on the information obtained from a
sample of data.
Statistical Modeling:-
Statistical modeling is the process of building a mathematical model
to describe the relationship between variables in a dataset. The model
is designed to capture the underlying patterns or trends in the data
and to make predictions about future observations.
Statistical models can be used for a variety of purposes, including
prediction, inference, and causal analysis. They are used extensively
in many fields, including economics, finance, marketing, engineering,
and the social sciences.
The process of building a statistical model typically involves several
steps:
1. Define the problem: The first step in building a statistical model
is to clearly define the problem you are trying to solve. This
involves specifying the variables of interest, the data that will be
used, and the type of model that will be built.
2. Collect and clean the data: The next step is to collect the data
and prepare it for analysis. This may involve cleaning the data,
transforming it into a different format, or dealing with missing data.
3. Explore the data: Once the data has been collected and cleaned, the
next step is to explore the data and identify any patterns or
relationships that may exist between the variables. This can be done
using exploratory data analysis (EDA) techniques such as histograms,
scatter plots, and correlation matrices.
4. Choose a modeling approach: Based on the insights gained from EDA,
the next step is to choose an appropriate modeling approach. This may
involve selecting a specific type of model, such as linear regression,
logistic regression, or decision trees, or choosing a more general
approach such as machine learning or time series analysis.
5. Build the model: Once the modeling approach has been chosen, the
next step is to build the model. This involves fitting the model to
the data using statistical software or programming languages such as R
or Python.
6. Evaluate the model: Once the model has been built, the next step is
to evaluate its performance. This may involve using metrics such as
accuracy, precision, recall, or root mean square error (RMSE) to
assess the model's predictive power.
7. Use the model: The final step in the modeling process is to use the
model to make predictions or draw inferences about the underlying
data. This may involve using the model to make predictions about
future outcomes, to identify important predictors of a particular
variable, or to test hypotheses about the relationship between
variables.
Statistical modeling is a powerful tool for analyzing complex datasets
and making informed decisions based on data. By following these steps
in the modeling process, researchers and analysts can build accurate
and reliable models that can be used to solve a wide range of
problems.
3 probability distributions
Probability distributions are mathematical functions that describe the
likelihood of different outcomes or values of a random variable in a
given population or sample. Probability distributions are used in
statistics to describe the behavior of data and to make predictions
based on probability theory.
There are two main types of probability distributions: discrete and
continuous.
Discrete probability distributions describe the probability of
obtaining a specific outcome from a discrete set of possible outcomes.
Examples of discrete probability distributions include the binomial
distribution, the Poisson distribution, and the hypergeometric
distribution.
The binomial distribution describes the probability of obtaining a
certain number of successes in a fixed number of trials, where each
trial has only two possible outcomes (success or failure). The Poisson
distribution describes the probability of a certain number of events
occurring in a fixed interval of time or space, where the events occur
independently and at a constant rate. The hypergeometric distribution
describes the probability of obtaining a certain number of successes
in a sample of fixed size, taken from a population of known size and
composition.
Continuous probability distributions describe the probability of
obtaining a value within a continuous range of possible values.
Examples of continuous probability distributions include the normal
distribution, the exponential distribution, and the beta distribution.
The normal distribution is perhaps the most well-known probability
distribution and is often used to model natural phenomena such as
height or weight. The exponential distribution is often used to model
waiting times or lifetimes of systems, and the beta distribution is
used to model probabilities of success or failure when there is
uncertainty about the underlying probability.
Probability distributions play an important role in statistical
inference, as they allow analysts to make predictions about the
behavior of a population or sample based on the information obtained
from a sample. By understanding the properties of different
probability distributions, analysts can choose the appropriate
distribution to model their data and use statistical methods to draw
meaningful conclusions from it.
fitting a model
Fitting a model is the process of estimating the parameters of a
statistical model to best fit the data. The goal is to find the values
of the model's parameters that minimize the difference between the
model's predictions and the observed data.
The process of fitting a model typically involves the following steps:
1. Choose a model: The first step in fitting a model is to choose a
suitable model that can capture the relationship between the variables
in the data. This may involve selecting a specific type of model, such
as linear regression or logistic regression, or choosing a more
complex model such as a neural network or decision tree.
2. Define the objective function: The objective function is a
mathematical function that measures the goodness of fit between the
model's predictions and the observed data. The goal is to find the
values of the model's parameters that minimize the value of the
objective function.
3. Estimate the parameters: Once the objective function has been
defined, the next step is to estimate the values of the model's
parameters that minimize the value of the objective function. This is
typically done using an optimization algorithm such as gradient
descent or a variant of it.
4. Evaluate the model: Once the model has been fitted to the data, it
is important to evaluate its performance. This may involve using
metrics such as mean squared error or accuracy to assess the model's
predictive power and its ability to generalize to new data.
5. Refine the model: If the model does not perform well, it may be
necessary to refine the model by adding or removing variables,
changing the functional form of the model, or modifying the objective
function.
Fitting a model is a critical step in statistical modeling and is
essential for making accurate predictions and drawing meaningful
conclusions from data. By following these steps, analysts can fit
models that accurately capture the underlying patterns and
relationships in the data and make reliable predictions about future
observations.
Intro to R
R is a popular language in the data science community, widely used for
data analysis, visualization, and modeling. Here are some of the
reasons why R is such a popular tool in data science:
1. Open source: R is open-source software, which means that it is free
to use and modify. This makes it accessible to a wide range of users,
including students, researchers, and businesses.
2. Wide range of packages: R has a large and active community of
developers who have created a wide range of packages for data
analysis, modeling, and visualization. These packages can be easily
installed and loaded into R, making it easy to perform complex
analyses and create advanced visualizations.
3. Powerful graphics capabilities: R is known for its powerful
graphics capabilities, particularly the ggplot2 package, which allows
users to create high-quality visualizations for publication.
4. Interoperability: R can work with a wide range of data formats and
can easily interface with other programming languages and tools. For
example, R can connect to SQL databases and web APIs, making it easy
to extract and analyze data from a variety of sources.
5. Reproducibility: R is designed to make it easy to document and
reproduce analyses. By using scripts and markdown documents, analysts
can create fully reproducible analyses that can be easily shared and
replicated.
Overall, R is a powerful and flexible tool for data science that has
become an essential part of the data science toolkit. Its wide range
of packages, powerful graphics capabilities, and interoperability make
it a popular choice for data analysts and scientists.
EDA
Exploratory Data Analysis (EDA) is a critical step in the data science
process. EDA is the process of analyzing and visualizing data to
understand its underlying patterns, distributions, and relationships.
The goal of EDA is to gain insight into the data and identify any
potential problems or issues that need to be addressed before modeling
and analysis.
The data science process typically includes the following steps:
1. Problem Definition: The first step in the data science process is
to define the problem that needs to be solved. This involves
identifying the business question or problem that needs to be answered
and determining the data needed to address it.
2. Data Collection: The next step is to collect the data needed to
address the problem. This may involve collecting data from internal or
external sources, or acquiring data through web scraping or other
means.
3. Data Cleaning and Preparation: Once the data is collected, it must
be cleaned and prepared for analysis. This involves identifying and
correcting any errors or inconsistencies in the data, dealing with
missing values, and transforming the data into a format that can be
easily analyzed.
4. Exploratory Data Analysis: The next step is to perform EDA on the
data. This involves using descriptive statistics, visualizations, and
other techniques to explore the data and gain insight into its
underlying patterns and relationships.
5. Statistical Modeling: Once the data has been cleaned and explored,
statistical models can be built to address the business question or
problem. This may involve building regression models, decision trees,
or other types of models.
6. Model Evaluation: The models are then evaluated to determine their
accuracy and effectiveness in addressing the problem. This may involve
testing the models on a separate data set or using cross-validation
techniques.
7. Deployment: Once the models have been evaluated, they can be
deployed in a production environment to address the business question
or problem.
8. Monitoring and Maintenance: Finally, the models must be monitored
and maintained to ensure that they continue to perform effectively
over time.
Overall, EDA plays a critical role in the data science process. By
exploring the data and gaining insight into its underlying patterns
and relationships, analysts can identify potential problems and
address them before modeling and analysis. This helps to ensure that
the resulting models are accurate and effective in addressing the
business question or problem.
Basic tools (plots, graphs and summary statistics) of EDA,
Exploratory Data Analysis (EDA) involves using a variety of tools to
visualize and summarize data. Some of the basic tools used in EDA
include:
1. Histograms: Histograms are used to visualize the distribution of a
numerical variable. They display the frequency of values within
specific intervals or bins.
2. Boxplots: Boxplots are used to visualize the distribution of a
numerical variable and to identify potential outliers. They display
the median, quartiles, and range of the data.
3. Scatterplots: Scatterplots are used to visualize the relationship
between two numerical variables. They display the data points as dots
on a two-dimensional graph.
4. Bar charts: Bar charts are used to visualize the frequency or
proportion of categorical variables.
5. Summary statistics: Summary statistics, such as mean, median, and
standard deviation, are used to summarize numerical variables. They
provide information about the central tendency and variability of the
data.
6. Heatmaps: Heatmaps are used to visualize the relationship between
two categorical variables. They display the frequency or proportion of
each combination of categories as a color-coded grid.
7. Density plots: Density plots are used to visualize the distribution
of a numerical variable. They display the probability density function
of the data.
8. Box-and-whisker plots: Box-and-whisker plots are similar to
boxplots, but also show the distribution of the data outside the
quartiles.
These tools can be used to explore data and identify potential
patterns, trends, and outliers. By using a combination of these tools,
analysts can gain a better understanding of the data and identify
potential issues or areas for further investigation.
The Data Science Process -
Case Study,
Let's walk through a simple case study to illustrate the data science
process:
1. Problem Definition: A marketing team wants to increase sales of
their new product, a healthy snack bar, and they want to identify the
most effective marketing channels to reach their target audience.
2. Data Collection: The team collects sales data and marketing data
from different sources, including social media, email campaigns, and
customer surveys.
3. Data Cleaning and Preparation: The team cleans and prepares the
data by removing duplicates, filling in missing values, and converting
data into a consistent format.
4. Exploratory Data Analysis: The team performs EDA on the data to
identify patterns, trends, and relationships. They create
visualizations, such as histograms and scatterplots, to better
understand the distribution of sales and the relationship between
different marketing channels and sales.
5. Statistical Modeling: The team uses statistical modeling
techniques, such as regression analysis, to identify the most
significant factors that affect sales. They build a model that
predicts sales based on different marketing channels, demographics,
and other variables.
6. Model Evaluation: The team evaluates the model by comparing its
predictions to the actual sales data. They use different evaluation
metrics, such as mean squared error (MSE), to determine the accuracy
of the model.
7. Deployment: The team deploys the model in a production environment
and uses it to make predictions about the effectiveness of different
marketing channels.
8. Monitoring and Maintenance: The team monitors the performance of
the model over time and makes adjustments as needed. They continue to
collect data and update the model to improve its accuracy and
effectiveness.
Overall, the data science process involves identifying a problem or
question, collecting and preparing data, performing EDA, building and
evaluating a model, deploying the model, and monitoring its
performance over time. By following this process, data scientists can
effectively analyze data and use it to make informed decisions and
drive business value.
Real Direct (online real estate firm).
Real Direct is an online real estate firm that uses data science to
provide a more streamlined and efficient buying and selling experience
for customers. Here are some ways Real Direct uses data science:
1. Predictive Analytics: Real Direct uses predictive analytics to
identify potential buyers and sellers, as well as to estimate home
values. By analyzing data on historical sales, property features, and
market trends, Real Direct can provide accurate predictions of home
values and identify potential customers.
2. Matching Algorithms: Real Direct uses matching algorithms to
connect buyers with sellers who meet their specific criteria. By
analyzing data on buyer preferences, property features, and location,
Real Direct can quickly and accurately match buyers with properties
that meet their needs.
3. Data Visualization: Real Direct uses data visualization techniques
to display property data in a user-friendly and informative way. This
includes interactive maps, graphs, and charts that allow users to
explore and compare property data.
4. Chatbots: Real Direct uses chatbots to provide instant customer
support and answer frequently asked questions. By using natural
language processing and machine learning, the chatbots can quickly and
accurately respond to customer inquiries and provide personalized
recommendations.
5. Image Recognition: Real Direct uses image recognition technology to
automatically identify and classify property images. This allows for
faster and more accurate listing creation, as well as improved search
functionality for users.
Overall, Real Direct uses data science to provide a more efficient and
personalized real estate experience for its customers. By leveraging
data and advanced technologies, Real Direct is able to streamline the
buying and selling process and provide valuable insights to customers.

More Related Content

Similar to datascience.docx

Quantitative Method
Quantitative MethodQuantitative Method
Quantitative Methodzahraa Aamir
 
Data Analysis & Interpretation and Report Writing
Data Analysis & Interpretation and Report WritingData Analysis & Interpretation and Report Writing
Data Analysis & Interpretation and Report WritingSOMASUNDARAM T
 
How to calculate Cohen's kappa in a systematic review.pdf
How to calculate Cohen's kappa in a systematic review.pdfHow to calculate Cohen's kappa in a systematic review.pdf
How to calculate Cohen's kappa in a systematic review.pdfNay Aung
 
Statistical analysis using spss
Statistical analysis using spssStatistical analysis using spss
Statistical analysis using spssjpcagphil
 
Introduction to statistics in health care
Introduction to statistics in health care Introduction to statistics in health care
Introduction to statistics in health care Dhasarathi Kumar
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
Week 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptx
Week 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptxWeek 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptx
Week 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptxChristineTorrepenida1
 
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...NewUOPCourse
 
QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...NewUOPCourse
 
QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...NewUOPCourse
 
QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...NewUOPCourse
 
QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...NewUOPCourse
 
QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...
QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...
QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...NewUOPCourse
 
QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...NewUOPCourse
 
QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...NewUOPCourse
 
QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...NewUOPCourse
 
QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...NewUOPCourse
 

Similar to datascience.docx (20)

Brm unit iv - cheet sheet
Brm   unit iv - cheet sheetBrm   unit iv - cheet sheet
Brm unit iv - cheet sheet
 
man0 ppt.pptx
man0 ppt.pptxman0 ppt.pptx
man0 ppt.pptx
 
Chapter-Four.pdf
Chapter-Four.pdfChapter-Four.pdf
Chapter-Four.pdf
 
Quantitative Method
Quantitative MethodQuantitative Method
Quantitative Method
 
Data Analysis & Interpretation and Report Writing
Data Analysis & Interpretation and Report WritingData Analysis & Interpretation and Report Writing
Data Analysis & Interpretation and Report Writing
 
How to calculate Cohen's kappa in a systematic review.pdf
How to calculate Cohen's kappa in a systematic review.pdfHow to calculate Cohen's kappa in a systematic review.pdf
How to calculate Cohen's kappa in a systematic review.pdf
 
Statistical analysis using spss
Statistical analysis using spssStatistical analysis using spss
Statistical analysis using spss
 
Introduction to statistics in health care
Introduction to statistics in health care Introduction to statistics in health care
Introduction to statistics in health care
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Week 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptx
Week 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptxWeek 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptx
Week 1-2 -INTRODUCTION TO QUANTITATIVE RESEARCH.pptx
 
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
QNT 275 Week 5 Apply Connect Week 5 Case Qnt 275 qnt275 https://uopcourses.co...
 
QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 2 Apply: Connect Week 2 Case Qnt 275 qnt275 https://uopcourses.c...
 
QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 5 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
 
QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 4 Apply: Connect Week 4 Case Qnt 275 qnt275 https://uopcourses.c...
 
QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 3 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
 
QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...
QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...
QNT 275 Week 1 Apply Connect Week 1 Exercise Qnt 275 qnt275 https://uopcourse...
 
QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 2 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
 
QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...
QNT 275 Week 3 Apply: Connect Week 3 Case Qnt 275 qnt275 https://uopcourses.c...
 
QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
QNT 275 Week 1 Practice: Connect Knowledge Check Qnt 275 qnt275 https://uopco...
 
QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
QNT 275 Week 4 Practice Connect Knowledge Check Qnt 275 qnt275 https://uopcou...
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

datascience.docx

  • 1. 1 Statistic interference Statistical inference is a process of making conclusions about a population based on a sample of data. It involves using statistical methods to draw inferences about the population parameters based on the information obtained from a sample. There are two types of statistical inference: estimation and hypothesis testing. Estimation involves using sample data to estimate the value of a population parameter, such as the population mean or standard deviation. The most common estimator for the population mean is the sample mean, which is an unbiased estimator if the sample is random and the underlying population is normally distributed. Hypothesis testing involves testing a hypothesis about a population parameter. A hypothesis is a statement about the value of a population parameter that can be tested using sample data. Hypothesis testing involves specifying a null hypothesis and an alternative hypothesis, and then using sample data to determine whether there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis. The process of hypothesis testing involves four steps: 1. Formulate the null and alternative hypotheses. 2. Choose an appropriate test statistic and calculate its value based on the sample data. 3. Determine the p-value, which is the probability of obtaining a test statistic as extreme or more extreme than the observed value if the null hypothesis is true. 4. Compare the p-value to a significance level, such as 0.05, and decide whether to reject or fail to reject the null hypothesis. Statistical inference is a fundamental tool in data analysis, and it is used in many fields such as medicine, economics, and social sciences. By using statistical inference, researchers and analysts can make informed decisions based on the information obtained from a sample of data.
  • 2. Statistical Modeling:- Statistical modeling is the process of building a mathematical model to describe the relationship between variables in a dataset. The model is designed to capture the underlying patterns or trends in the data and to make predictions about future observations. Statistical models can be used for a variety of purposes, including prediction, inference, and causal analysis. They are used extensively in many fields, including economics, finance, marketing, engineering, and the social sciences. The process of building a statistical model typically involves several steps: 1. Define the problem: The first step in building a statistical model is to clearly define the problem you are trying to solve. This involves specifying the variables of interest, the data that will be used, and the type of model that will be built.
  • 3. 2. Collect and clean the data: The next step is to collect the data and prepare it for analysis. This may involve cleaning the data, transforming it into a different format, or dealing with missing data. 3. Explore the data: Once the data has been collected and cleaned, the next step is to explore the data and identify any patterns or relationships that may exist between the variables. This can be done using exploratory data analysis (EDA) techniques such as histograms, scatter plots, and correlation matrices. 4. Choose a modeling approach: Based on the insights gained from EDA, the next step is to choose an appropriate modeling approach. This may involve selecting a specific type of model, such as linear regression, logistic regression, or decision trees, or choosing a more general approach such as machine learning or time series analysis. 5. Build the model: Once the modeling approach has been chosen, the next step is to build the model. This involves fitting the model to the data using statistical software or programming languages such as R or Python. 6. Evaluate the model: Once the model has been built, the next step is to evaluate its performance. This may involve using metrics such as accuracy, precision, recall, or root mean square error (RMSE) to assess the model's predictive power. 7. Use the model: The final step in the modeling process is to use the model to make predictions or draw inferences about the underlying data. This may involve using the model to make predictions about future outcomes, to identify important predictors of a particular variable, or to test hypotheses about the relationship between variables. Statistical modeling is a powerful tool for analyzing complex datasets and making informed decisions based on data. By following these steps in the modeling process, researchers and analysts can build accurate and reliable models that can be used to solve a wide range of problems.
  • 4. 3 probability distributions Probability distributions are mathematical functions that describe the likelihood of different outcomes or values of a random variable in a given population or sample. Probability distributions are used in statistics to describe the behavior of data and to make predictions based on probability theory. There are two main types of probability distributions: discrete and continuous. Discrete probability distributions describe the probability of obtaining a specific outcome from a discrete set of possible outcomes. Examples of discrete probability distributions include the binomial distribution, the Poisson distribution, and the hypergeometric distribution. The binomial distribution describes the probability of obtaining a certain number of successes in a fixed number of trials, where each trial has only two possible outcomes (success or failure). The Poisson distribution describes the probability of a certain number of events occurring in a fixed interval of time or space, where the events occur independently and at a constant rate. The hypergeometric distribution describes the probability of obtaining a certain number of successes in a sample of fixed size, taken from a population of known size and composition. Continuous probability distributions describe the probability of obtaining a value within a continuous range of possible values. Examples of continuous probability distributions include the normal distribution, the exponential distribution, and the beta distribution. The normal distribution is perhaps the most well-known probability distribution and is often used to model natural phenomena such as height or weight. The exponential distribution is often used to model waiting times or lifetimes of systems, and the beta distribution is used to model probabilities of success or failure when there is uncertainty about the underlying probability.
  • 5. Probability distributions play an important role in statistical inference, as they allow analysts to make predictions about the behavior of a population or sample based on the information obtained from a sample. By understanding the properties of different probability distributions, analysts can choose the appropriate distribution to model their data and use statistical methods to draw meaningful conclusions from it. fitting a model Fitting a model is the process of estimating the parameters of a statistical model to best fit the data. The goal is to find the values of the model's parameters that minimize the difference between the model's predictions and the observed data. The process of fitting a model typically involves the following steps: 1. Choose a model: The first step in fitting a model is to choose a suitable model that can capture the relationship between the variables in the data. This may involve selecting a specific type of model, such as linear regression or logistic regression, or choosing a more complex model such as a neural network or decision tree. 2. Define the objective function: The objective function is a mathematical function that measures the goodness of fit between the model's predictions and the observed data. The goal is to find the values of the model's parameters that minimize the value of the objective function.
  • 6. 3. Estimate the parameters: Once the objective function has been defined, the next step is to estimate the values of the model's parameters that minimize the value of the objective function. This is typically done using an optimization algorithm such as gradient descent or a variant of it. 4. Evaluate the model: Once the model has been fitted to the data, it is important to evaluate its performance. This may involve using metrics such as mean squared error or accuracy to assess the model's predictive power and its ability to generalize to new data. 5. Refine the model: If the model does not perform well, it may be necessary to refine the model by adding or removing variables, changing the functional form of the model, or modifying the objective function. Fitting a model is a critical step in statistical modeling and is essential for making accurate predictions and drawing meaningful conclusions from data. By following these steps, analysts can fit models that accurately capture the underlying patterns and relationships in the data and make reliable predictions about future observations. Intro to R R is a popular language in the data science community, widely used for data analysis, visualization, and modeling. Here are some of the reasons why R is such a popular tool in data science:
  • 7. 1. Open source: R is open-source software, which means that it is free to use and modify. This makes it accessible to a wide range of users, including students, researchers, and businesses. 2. Wide range of packages: R has a large and active community of developers who have created a wide range of packages for data analysis, modeling, and visualization. These packages can be easily installed and loaded into R, making it easy to perform complex analyses and create advanced visualizations. 3. Powerful graphics capabilities: R is known for its powerful graphics capabilities, particularly the ggplot2 package, which allows users to create high-quality visualizations for publication. 4. Interoperability: R can work with a wide range of data formats and can easily interface with other programming languages and tools. For example, R can connect to SQL databases and web APIs, making it easy to extract and analyze data from a variety of sources. 5. Reproducibility: R is designed to make it easy to document and reproduce analyses. By using scripts and markdown documents, analysts can create fully reproducible analyses that can be easily shared and replicated. Overall, R is a powerful and flexible tool for data science that has become an essential part of the data science toolkit. Its wide range of packages, powerful graphics capabilities, and interoperability make it a popular choice for data analysts and scientists.
  • 8. EDA Exploratory Data Analysis (EDA) is a critical step in the data science process. EDA is the process of analyzing and visualizing data to understand its underlying patterns, distributions, and relationships. The goal of EDA is to gain insight into the data and identify any potential problems or issues that need to be addressed before modeling and analysis. The data science process typically includes the following steps: 1. Problem Definition: The first step in the data science process is to define the problem that needs to be solved. This involves identifying the business question or problem that needs to be answered and determining the data needed to address it. 2. Data Collection: The next step is to collect the data needed to address the problem. This may involve collecting data from internal or external sources, or acquiring data through web scraping or other means. 3. Data Cleaning and Preparation: Once the data is collected, it must be cleaned and prepared for analysis. This involves identifying and correcting any errors or inconsistencies in the data, dealing with missing values, and transforming the data into a format that can be easily analyzed. 4. Exploratory Data Analysis: The next step is to perform EDA on the data. This involves using descriptive statistics, visualizations, and other techniques to explore the data and gain insight into its underlying patterns and relationships. 5. Statistical Modeling: Once the data has been cleaned and explored, statistical models can be built to address the business question or problem. This may involve building regression models, decision trees, or other types of models. 6. Model Evaluation: The models are then evaluated to determine their accuracy and effectiveness in addressing the problem. This may involve testing the models on a separate data set or using cross-validation techniques. 7. Deployment: Once the models have been evaluated, they can be deployed in a production environment to address the business question or problem. 8. Monitoring and Maintenance: Finally, the models must be monitored and maintained to ensure that they continue to perform effectively over time.
  • 9. Overall, EDA plays a critical role in the data science process. By exploring the data and gaining insight into its underlying patterns and relationships, analysts can identify potential problems and address them before modeling and analysis. This helps to ensure that the resulting models are accurate and effective in addressing the business question or problem. Basic tools (plots, graphs and summary statistics) of EDA, Exploratory Data Analysis (EDA) involves using a variety of tools to visualize and summarize data. Some of the basic tools used in EDA include: 1. Histograms: Histograms are used to visualize the distribution of a numerical variable. They display the frequency of values within specific intervals or bins. 2. Boxplots: Boxplots are used to visualize the distribution of a numerical variable and to identify potential outliers. They display the median, quartiles, and range of the data. 3. Scatterplots: Scatterplots are used to visualize the relationship between two numerical variables. They display the data points as dots on a two-dimensional graph. 4. Bar charts: Bar charts are used to visualize the frequency or proportion of categorical variables.
  • 10. 5. Summary statistics: Summary statistics, such as mean, median, and standard deviation, are used to summarize numerical variables. They provide information about the central tendency and variability of the data. 6. Heatmaps: Heatmaps are used to visualize the relationship between two categorical variables. They display the frequency or proportion of each combination of categories as a color-coded grid. 7. Density plots: Density plots are used to visualize the distribution of a numerical variable. They display the probability density function of the data. 8. Box-and-whisker plots: Box-and-whisker plots are similar to boxplots, but also show the distribution of the data outside the quartiles. These tools can be used to explore data and identify potential patterns, trends, and outliers. By using a combination of these tools, analysts can gain a better understanding of the data and identify potential issues or areas for further investigation. The Data Science Process - Case Study, Let's walk through a simple case study to illustrate the data science process:
  • 11. 1. Problem Definition: A marketing team wants to increase sales of their new product, a healthy snack bar, and they want to identify the most effective marketing channels to reach their target audience. 2. Data Collection: The team collects sales data and marketing data from different sources, including social media, email campaigns, and customer surveys. 3. Data Cleaning and Preparation: The team cleans and prepares the data by removing duplicates, filling in missing values, and converting data into a consistent format. 4. Exploratory Data Analysis: The team performs EDA on the data to identify patterns, trends, and relationships. They create visualizations, such as histograms and scatterplots, to better understand the distribution of sales and the relationship between different marketing channels and sales. 5. Statistical Modeling: The team uses statistical modeling techniques, such as regression analysis, to identify the most significant factors that affect sales. They build a model that predicts sales based on different marketing channels, demographics, and other variables. 6. Model Evaluation: The team evaluates the model by comparing its predictions to the actual sales data. They use different evaluation metrics, such as mean squared error (MSE), to determine the accuracy of the model. 7. Deployment: The team deploys the model in a production environment and uses it to make predictions about the effectiveness of different marketing channels. 8. Monitoring and Maintenance: The team monitors the performance of the model over time and makes adjustments as needed. They continue to collect data and update the model to improve its accuracy and effectiveness. Overall, the data science process involves identifying a problem or question, collecting and preparing data, performing EDA, building and evaluating a model, deploying the model, and monitoring its performance over time. By following this process, data scientists can effectively analyze data and use it to make informed decisions and drive business value.
  • 12. Real Direct (online real estate firm). Real Direct is an online real estate firm that uses data science to provide a more streamlined and efficient buying and selling experience for customers. Here are some ways Real Direct uses data science: 1. Predictive Analytics: Real Direct uses predictive analytics to identify potential buyers and sellers, as well as to estimate home values. By analyzing data on historical sales, property features, and market trends, Real Direct can provide accurate predictions of home values and identify potential customers. 2. Matching Algorithms: Real Direct uses matching algorithms to connect buyers with sellers who meet their specific criteria. By analyzing data on buyer preferences, property features, and location, Real Direct can quickly and accurately match buyers with properties that meet their needs. 3. Data Visualization: Real Direct uses data visualization techniques to display property data in a user-friendly and informative way. This includes interactive maps, graphs, and charts that allow users to explore and compare property data. 4. Chatbots: Real Direct uses chatbots to provide instant customer support and answer frequently asked questions. By using natural language processing and machine learning, the chatbots can quickly and accurately respond to customer inquiries and provide personalized recommendations.
  • 13. 5. Image Recognition: Real Direct uses image recognition technology to automatically identify and classify property images. This allows for faster and more accurate listing creation, as well as improved search functionality for users. Overall, Real Direct uses data science to provide a more efficient and personalized real estate experience for its customers. By leveraging data and advanced technologies, Real Direct is able to streamline the buying and selling process and provide valuable insights to customers.