UNIT I
INTRODUCTION TO DATA SCIENCE
What is Data Science?
► Data Science is an interdisciplinary field making use of scientific
methods, processes, algorithms and systems for extracting
knowledge and insights from structured and unstructured data, and
applies knowledge and actionable insight from data across a broad
range of application domains.
Data Science Definition
► Data science is the practice of mining large data sets of raw data,
structured and unstructured for identifying patterns and extract
actionable insight from it. It is an interdisciplinary field and the
foundation of data science includes statistics, inference, computer
science, predictive analytics, machine learning algorithm
development, and new technologies for gaining insights from big
data. Data science life cycle includes acquiring data, extracting
and entering it in the system.
► Next stage includes maintenance, including data warehousing,
data cleaning, data processing, data staging, and data
architecture.
Stages of Data Science Lifecycle
Data science has five stages:
► Capture: Data acquisition, data entry, signal reception, data
extraction
► Maintain: Data warehousing, data cleansing, data staging, data
processing, data architecture
► Process: Data mining, clustering/classification, data modeling, data
summarization
► Communicate: Data reporting, data visualization, business
intelligence, decision making
► Analyze: Exploratory/confirmatory, predictive analysis, regression,
text mining, qualitative analysis
Why Businesses need Data
Science?
► The amount of data created every day has resulted in need for
professionals to tackle and make sense of it.
► There is a huge mine of unstructured and semi-structure data
coming from various sources and the traditional business
intelligence tools are just not sufficient to make sense of it.
► Data science offers advanced tools for working on large volumes of
data coming from various types of sources such as financial logs,
marketing forms, sensors, instruments, text files, and multimedia files.
Job Roles in Data Science
► Data Analyst
► Data Engineers
► Database Administrator
► Machine Learning Engineer
► Data Scientist
► Data Architect
► Statistician
► Business Analyst
► Data and Analytics Manager
Skill Set Needed for a Data Scientist
► Technical
► Statistical analysis and computing
► Machine Learning
► Deep Learning
► Processing large data sets
► Data Visualization
► Data Wrangling
► Mathematics
► Programming
► Statistics
► Big Data
Skill Set Needed for a Data Scientist
► Non-Technical
► Critical Thinking
► Effective Communication
► Proactive Problem Solving
► Intellectual Curiosity
► Business Sense
Statistical Inference
Statistical inference is the
process of using data
analysis to infer properties of
an underlying distribution of
probability.
EDA and the Data Science Process
Basic Tools of EDA
Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment
for statistical computing and graphics supported by the R foundation for
statistical computing. The R language is widely used among statisticians in
developing statistical observations and data analys
2. Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high level, built-in data structures, combined with
dynamic binding, make it very attractive for rapid application development,
also as to be used as a scripting or glue language to attach existing
components together. Python and EDA are often used together to spot missing
values in the data set, which is vital so you’ll decide the way to handle missing
values for machine learning.
Application of Data Science
► Anomaly detection (fraud, disease, crime, etc.)
► Automation and decision-making (background checks, credit
worthiness, etc.)
► Classifications (in an email server, this could mean classifying emails
as important or junk)
► Forecasting (sales, revenue and customer retention)
► Pattern detection (weather patterns, financial market patterns, etc.)
► Recognition (facial, voice, text, etc.)
► Recommendations (based on learned preferences,
recommendation engines can refer you to movies, restaurants and
books you may like)
Data Science in Business
► Gain Customer Insights
► Increase Security
► Inform Internal Finances
► Streamline Manufacturing
► Predict Future Market Trends
Business Intelligence Vs Data
Science
S.No Factor Data Science Business Intelligence
1 Concept It is a field that uses mathematics,
statistics and various other tools to
discover the hidden patterns in the
data.
It is basically a set of technologies,
applications and processes that are used by
the enterprises for business data analysis.
2 Focus It focuses on the future. It focuses on the past and present.
3 Data It deals with both structured as well
as unstructured data.
It mainly deals only with structured data.
4 Flexibility Data science is much more flexible
as data sources can be added as per
requirement.
It is less flexible as in case of business
intelligence data sources need to be pre-
planned.
5 Method It makes use of the scientific method. It makes use of the analytic method.
Data Analytics Lifecycle
Machine Learning
Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being explicitly
programmed to do so. Machine learning algorithms use historical data as input to
predict new output values.
Why is machine learning important?
Machine learning is important because it gives enterprises a view of trends in customer
behavior and business operational patterns, as well as supports the development of
new products. Many of today's leading companies, such as Facebook, Google and
Uber, make machine learning a central part of their operations. Machine learning has
become a significant competitive differentiator for many companies.
What are the different types of machine learning?
Classical machine learning is often categorized by how an algorithm learns to become
more accurate in its predictions. There are four basic approaches:supervised learning,
unsupervised learning, semi-supervised learning and reinforcement learning. The type
of algorithm data scientists choose to use depends on what type of data they want to
predict.

INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx

  • 1.
  • 2.
    What is DataScience? ► Data Science is an interdisciplinary field making use of scientific methods, processes, algorithms and systems for extracting knowledge and insights from structured and unstructured data, and applies knowledge and actionable insight from data across a broad range of application domains.
  • 3.
    Data Science Definition ►Data science is the practice of mining large data sets of raw data, structured and unstructured for identifying patterns and extract actionable insight from it. It is an interdisciplinary field and the foundation of data science includes statistics, inference, computer science, predictive analytics, machine learning algorithm development, and new technologies for gaining insights from big data. Data science life cycle includes acquiring data, extracting and entering it in the system. ► Next stage includes maintenance, including data warehousing, data cleaning, data processing, data staging, and data architecture.
  • 4.
    Stages of DataScience Lifecycle Data science has five stages: ► Capture: Data acquisition, data entry, signal reception, data extraction ► Maintain: Data warehousing, data cleansing, data staging, data processing, data architecture ► Process: Data mining, clustering/classification, data modeling, data summarization ► Communicate: Data reporting, data visualization, business intelligence, decision making ► Analyze: Exploratory/confirmatory, predictive analysis, regression, text mining, qualitative analysis
  • 5.
    Why Businesses needData Science? ► The amount of data created every day has resulted in need for professionals to tackle and make sense of it. ► There is a huge mine of unstructured and semi-structure data coming from various sources and the traditional business intelligence tools are just not sufficient to make sense of it. ► Data science offers advanced tools for working on large volumes of data coming from various types of sources such as financial logs, marketing forms, sensors, instruments, text files, and multimedia files.
  • 6.
    Job Roles inData Science ► Data Analyst ► Data Engineers ► Database Administrator ► Machine Learning Engineer ► Data Scientist ► Data Architect ► Statistician ► Business Analyst ► Data and Analytics Manager
  • 7.
    Skill Set Neededfor a Data Scientist ► Technical ► Statistical analysis and computing ► Machine Learning ► Deep Learning ► Processing large data sets ► Data Visualization ► Data Wrangling ► Mathematics ► Programming ► Statistics ► Big Data
  • 8.
    Skill Set Neededfor a Data Scientist ► Non-Technical ► Critical Thinking ► Effective Communication ► Proactive Problem Solving ► Intellectual Curiosity ► Business Sense
  • 9.
    Statistical Inference Statistical inferenceis the process of using data analysis to infer properties of an underlying distribution of probability.
  • 10.
    EDA and theData Science Process
  • 11.
    Basic Tools ofEDA Some of the most common tools used to create an EDA are: 1. R: An open-source programming language and free software environment for statistical computing and graphics supported by the R foundation for statistical computing. The R language is widely used among statisticians in developing statistical observations and data analys 2. Python: An interpreted, object-oriented programming language with dynamic semantics. Its high level, built-in data structures, combined with dynamic binding, make it very attractive for rapid application development, also as to be used as a scripting or glue language to attach existing components together. Python and EDA are often used together to spot missing values in the data set, which is vital so you’ll decide the way to handle missing values for machine learning.
  • 12.
    Application of DataScience ► Anomaly detection (fraud, disease, crime, etc.) ► Automation and decision-making (background checks, credit worthiness, etc.) ► Classifications (in an email server, this could mean classifying emails as important or junk) ► Forecasting (sales, revenue and customer retention) ► Pattern detection (weather patterns, financial market patterns, etc.) ► Recognition (facial, voice, text, etc.) ► Recommendations (based on learned preferences, recommendation engines can refer you to movies, restaurants and books you may like)
  • 13.
    Data Science inBusiness ► Gain Customer Insights ► Increase Security ► Inform Internal Finances ► Streamline Manufacturing ► Predict Future Market Trends
  • 14.
    Business Intelligence VsData Science S.No Factor Data Science Business Intelligence 1 Concept It is a field that uses mathematics, statistics and various other tools to discover the hidden patterns in the data. It is basically a set of technologies, applications and processes that are used by the enterprises for business data analysis. 2 Focus It focuses on the future. It focuses on the past and present. 3 Data It deals with both structured as well as unstructured data. It mainly deals only with structured data. 4 Flexibility Data science is much more flexible as data sources can be added as per requirement. It is less flexible as in case of business intelligence data sources need to be pre- planned. 5 Method It makes use of the scientific method. It makes use of the analytic method.
  • 15.
  • 16.
    Machine Learning Machine learning(ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values. Why is machine learning important? Machine learning is important because it gives enterprises a view of trends in customer behavior and business operational patterns, as well as supports the development of new products. Many of today's leading companies, such as Facebook, Google and Uber, make machine learning a central part of their operations. Machine learning has become a significant competitive differentiator for many companies. What are the different types of machine learning? Classical machine learning is often categorized by how an algorithm learns to become more accurate in its predictions. There are four basic approaches:supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. The type of algorithm data scientists choose to use depends on what type of data they want to predict.