SlideShare a Scribd company logo
UNIT-I
Data: Data can be defined as an elementary value or the collection of values, for example,
student's name and its id are the data about the student.
DATA SCIENCE :
• Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
• Data Science is an interdisciplinary field that focuses on extracting knowledge
from data sets which are typically huge in amount. The field encompasses
analysis, preparing data for analysis, and presenting findings to inform high-
level decisions in an organization. As such, it incorporates skills from computer
science, mathematics, statics, information visualization, graphic, and business.
In short, we can say that data science is all about:
o Asking the correct questions and analyzing the raw data.
o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.
NEED FOR DATA SCIENCE:
• Traditionally, the data that we had was mostly structured and small in size, which
could be analyzed by using simple Business Intelligence tools. Unlike data in
the traditional systems which was mostly structured, today most of the data is
unstructured or semi-structured.
• This data is generated from different sources like financial logs, text files,
multimedia forms, sensors, and instruments. Simple BI tools are not capable
of processing this huge volume and variety of data. This is why we need more
complex and advanced analytical tools and algorithms for processing, analyzing
and drawing meaningful insights out of it.
•
LIFECYCLE OF DATA SCIENCE
Step 1: Define Problem Statement : Creating a well-defined problem statement is a first
and critical step in data science. It is a brief description of the problem that you are going
to solve.
Step 2: Data Collection:
You need to collect the data which can help to solve the problem. Data collection is a
systematic approach to gather relevant information from a variety of sources. Depending
on the problem statement, the data collection method is broadly classified into two
categories.
• Primary Data Collection:
First, when you have some unique problem and no related research is done on the
subject. Then, you need to collect new data. This method is called as primary data
collection.
For example, you want information on the average time that employees spend in a
cafeteria across companies. There is no public data available of these. But you can collect
the data through various methods such as surveys, interviews of employees and by
monitoring the time spent by employees in cafeteria. This method is time-consuming
• Secondary Data Collection:
The data which is readily available or collected by someone else. These data can be
found on the internet, news articles, government census, magazines and so on. This
method is called as secondary data collection. This method is less time-consuming than
the primary method.
Step 3: Data Quality Check and Remediation:
One of the most important and often ignored aspects by data scientists is ensuring
the data that is used for analysis and interpretation is of good quality.
After collecting the data, most people start the analysis on it. Often, they forgot to
do a sanity check on the data. If the data is of bad quality, it can give misleading
information.
Step 4: Exploratory Data Analysis:
Before analysing the data it’s important to analyse the data. It is the most exciting
step as it helps you to build familiarity with the data and extract useful insights. If this step
is skipped then you might end up generating inaccurate models and choosing the
insignificant variables in your model.
Step 5: Data Modelling:
Modelling means formulating every step and gather the techniques required to
achieve the solution. You need to list down the flow of the calculations which is nothing
but modelling steps to the solution. The important factor is how to perform the
calculations. There are various techniques under Statistics and Machine Learning that you
can choose based on the requirement.
Step 6: Data Communication:
This is the final step where you present the results from your analysis to the
stakeholders. You explain to them how you came to a specific conclusion and your critical
findings.
Most often you need to present your findings to a non-technical audience, such as
the marketing team or business executives. You need to communicate the results in a
simple to understand manner. And the stakeholders should be able to chalk out an
actionable plan from it.
DATA SCIENCE COMPONENTS:
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science. Statistics
is a way to collect and analyze the numerical data in a large amount and finding
meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata
(data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so
that people can easily understand the significance of data. Data visualization makes it
easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of
computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves
the study of quantity, structure, space, and changes. For a data scientist, knowledge of
good mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is
all about to provide training to a machine so that it can act as a human brain. In data
science, we use various machine learning algorithms to solve the problems.
TOOLS FOR DATA SCIENCE
Following are some tools required for data science:
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.
APPLICATIONS OF DATA SCIENCE:
o Image recognition and speech recognition:
Data science is currently using for Image and speech recognition. When you
upload an image on Facebook and start getting the suggestion to tag to your
friends. This automatic tagging suggestion uses image recognition algorithm,
which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition
algorithm.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by
day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
o Internet search:
When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc. All these search
engines use the data science technology to make the search experience better,
and you can get a search result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving
cars. With self-driving cars, it will be easy to reduce the number of road
accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual
medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you
started getting suggestions for similar products, so this is because of data science
technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.
Most of the finance companies are looking for the data scientist to avoid risk and
any type of losses with an increase in customer satisfaction.
o
PYTHON FOR DATA SCIENCE
• Python is open source, interpreted, high level language and provides
great approach for object-oriented programming. It is one of the best
language used by data scientist for various data science
projects/application.
• Python provide great functionality to deal with mathematics, statistics
and scientific function. It provides great libraries to deals with data
science application.
• One of the main reasons why Python is widely used in the scientific and
research communities is because of its ease of use and simple syntax
which makes it easy to adapt for people who do not have an engineering
background. It is also more suited for quick prototyping.
Features of Python language:
• It uses the elegant syntax, hence the programs are easier to read.
• It is a simple to access language, which makes it easy to achieve the program
working.
• The large standard library and community support.
• The interactive mode of Python makes its simple to test codes.
• In Python, it is also simple to extend the code by appending new modules that
are implemented in other compiled language like C++ or C.
• Python is an expressive language which is possible to embed into
applications to offer a programmable interface.
• Allows developer to run the code anywhere, including Windows, Mac OS X,
UNIX, and Linux.
• It is free software in a couple of categories. It does not cost anything to use or
download Pythons or to add it to the application.
NEED FOR PYTHON IN DATA SCIENCE
Python is no-doubt the best-suited language for a Data Scientist. I have listed down a few
points which will help you understand why people go with Python for Data Science:
• Python is a free, flexible and powerful open-source language
• Python cuts development time in half with its simple and easy to read syntax
• With Python, you can perform data manipulation, analysis, and visualization
• Python provides powerful libraries for Machine learning applications and other
scientific computations
PYTHON IDES FOR DATA SCIENCE
Data Science is a field that is used to study and understand data and draw various
conclusions with the help of different scientific processes. Python is a popular language
that is quite useful for data science because of its capacity for statistical analysis and its
easy readability. Python also has various packages for machine learning, natural
language processing, data visualization, data analysis, etc. that make it suited for data
science. Some of the Python IDE’s that are used for Data Science are given as follows:
1. Jupyter notebook – Jupyter notebook is an open source IDE that is used to
create Jupyter documents that can be created and shared with live codes.
Also, it is a web-based interactive computational environment. The Jupyter
notebook can support various languages that are popular in data science
such as Python, Julia, Scala, R, etc.
2. Spyder –Spyder is an open source IDE that was originally created and
developed by Pierre Raybaut in 2009. It can be integrated with many
different Python packages such as NumPy, SymPy, SciPy, pandas, IPython,
etc. The Spyder editor also supports code introspection, code completion,
syntax highlighting, horizontal and vertical splitting, etc.
3. Sublime text –Sublime text is a proprietary code editor and it supports a
Python API. Some of the features of Sublime text are project-specific
preferences, quick navigation, supportive plugins for cross-platform, etc.
While the Sublime text is quite fast and has a good support group, it is not
available for free.
4. Visual Studio Code –
Visual Studio Code is a code editor that was developed by Microsoft. It was
developed using Electron but it does not use Atom. Some of the features of
Visual Studio Code are embedded Git control, intelligent code completion,
support for debugging, syntax highlighting, code refactoring, etc. It is also
quite fast and lightweight as well.
5. Pycharm –
Pycharm is an IDE developed by JetBrains and created specifically for
Python. It has various features such as code analysis, integrated unit tester,
integrated Python debugger, support for web frameworks, etc. Pycharm is
particularly useful in machine learning because it supports libraries such as
Pandas, Matplotlib, Scikit-Learn, NumPy, etc.
6. Rodeo –
Rodeo is an open source IDE that was developed by Yhat for data science in
Python. So Rodeo includes Python tutorials and also cheat sheets that can
be used for reference if required. Some of the features of Rodeo are syntax
highlighting, auto-completion, easy interaction with data frames and plots,
built-in IPython support, etc.
7. Thonny –
Thonny is an IDE that was developed at the The University of Tartu for
Python. It is created for beginners that are learning to programme in Python
or for those that are teaching it. Some of the features of Thonny are statement
stepping without breakpoints, simple pip GUI, line numbers, live variables
during debugging, etc.
8. Atom –
Atom is an open source text and code editor that was developed using
Electron. It has multiple features such as a sleek interface, a file system
browser, various extensions, etc. Atom also has an extension that can support
Python while it is running.
9. Geany –
Geany is a free text editor that supports Python and contains IDE features as
well. It was originally authored by Enrico Tröger in C and C++. Some of the
features of Geany are Symbol lists, Auto-completion, Syntax highlighting,
Code navigation, Multiple document support, etc.
MOST COMMONLY USED PYTHON LIBRARIES FOR DATA SCIENCE :
• Numpy: Numpy is Python library that provides mathematical function to handle
large dimension array. It provides various method/function for Array, Metrics,
and linear algebra.
NumPy stands for Numerical Python. It provides lots of useful features for
operations on n-arrays and matrices in Python. The library provides
vectorization of mathematical operations on the NumPy array type, which
enhance performance and speeds up the execution. It’s very easy to work with
large multidimensional arrays and matrices using NumPy.
• Pandas: Pandas is one of the most popular Python library for data manipulation
and analysis. Pandas provide useful functions to manipulate large amount of
structured data. Pandas provide easiest method to perform analysis. It provide
large data structures and manipulating numerical tables and time series data.
Pandas is a perfect tool for data wrangling. Pandas is designed for quick and easy
data manipulation, aggregation, and visualization. There two data structures in
Pandas
Series – It Handle and store data in one-dimensional data.
DataFrame – It Handle and store Two dimensional data.
• Matplotlib: Matplotlib is another useful Python library for Data
Visualization. Descriptive analysis and visualizing data is very important for
any organization. Matplotlib provides various method to Visualize data in
more effective way. Matplotlib allows to quickly make line graphs, pie charts,
histograms, and other professional grade figures. Using Matplotlib, one can
customize every aspect of a figure. Matplotlib has interactive features like
zooming and planning and saving the Graph in graphics format.
• Scipy: Scipy is another popular Python library for data science and scientific
computing. Scipy provides great functionality to scientific mathematics and
computing programming. SciPy contains sub-modules for optimization,
linear algebra, integration, interpolation, special functions, FFT, signal and
image processing, ODE solvers, Statmodel and other tasks common in science
and engineering.
• Scikit – learn: Sklearn is Python library for machine learning. Sklearn
provides various algorithms and functions that are used in machine learning.
Sklearn is built on NumPy, SciPy, and matplotlib. Sklearn provides easy and
simple tools for data mining and data analysis. It provides a set of common
machine learning algorithms to users through a consistent interface. Scikit-
Learn helps to quickly implement popular algorithms on datasets and solve
real-world problems.
PYTHON BASICS FOR DATA SCIENCE
Basic concept of Python Programming :
• Variables: Variables refer to the reserved memory locations to store the values.
In Python, you don’t need to declare variables before using them or even declare
their type.
• Data Types: Python supports numerous data types, which defines the operations
possible on the variables and the storage method. The list of data types includes –
Numeric, Lists, Strings, tuples, Sets, and Dictionary.
• Operators: Operators helps to manipulate the value of operands. The list of
operators in Python includes- Arithmetic, Comparison, Assignment, Logical,
Bitwise, Membership, and Identity.
• Conditional Statements: Conditional statements help to execute a set of
statements based on a condition. There are namely three conditional statements
– If, Elif and Else.
• Loops: Loops are used to iterate through small pieces of code. There are three
types of loops namely – While, for and nested loops.
• Functions: Functions are used to divide your code into useful blocks, allowing you
to order the code, make it more readable, reuse it & save some time.
Practical implementations, using Python coding .
Loading The Data
The very first step, to begin with, is loading the data into your program. We can do so by
using the read_csv( ) from the Python panda’s library.
1
2
import pandas as pd
data = pd.read_csv("file_name.csv")
Cleaning the Data
The next step is to look for irregularities in the data by doing some data exploration.
Finding out the null values and replacing them with other values or dropping that row
altogether is involved in this phase.
1. data.describe()
#to check for null values
2. data.isnull().sum()
#drop the null values
3. df = data.dropna()
#checking again to be double sure
4. df.isnull().sum()
Visualization
After we are done cleaning, we can move ahead and make some visualizations to
understand the relationship between various aspects of our dataset.
1sns.scatterplot(x=df["npg"], y=df["birth_rate"])

More Related Content

Similar to Data Science Unit1 AMET.pdf

Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Sahilakhurana
 
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptxINTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
Madhumitha N
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
sunnypatil1778
 
Untitled document.pdf
Untitled document.pdfUntitled document.pdf
Untitled document.pdf
MuhammadTahiriqbal13
 
Emerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptx
Emerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptxEmerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptx
Emerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptx
sahanagowda464633
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
AbderrahmanABID2
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptx
Samiksha880257
 
Data Analytics Course In Surat.pdf
Data Analytics Course In Surat.pdfData Analytics Course In Surat.pdf
Data Analytics Course In Surat.pdf
Sujata Gupta
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
mallikarjuntalakal
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
ikenossama03
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
PothyeswariPothyes
 
Data Science for Beginners: A Step-by-Step Introduction
Data Science for Beginners: A Step-by-Step IntroductionData Science for Beginners: A Step-by-Step Introduction
Data Science for Beginners: A Step-by-Step Introduction
Uncodemy
 
Lecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptxLecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptx
JayChauhan100
 
L3 Big Data and Application.pptx
L3  Big Data and Application.pptxL3  Big Data and Application.pptx
L3 Big Data and Application.pptx
Shambhavi Vats
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
Data science course in ameerpet Hyderabad
Data science course in ameerpet HyderabadData science course in ameerpet Hyderabad
Data science course in ameerpet Hyderabad
ShivaKanukuntla33
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabad
rajasrichalamala3zen
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .
rajasrichalamala3zen
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .
rajasrichalamala3zen
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabad
akhilamadupativibhin
 

Similar to Data Science Unit1 AMET.pdf (20)

Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
Data Analytics in Industry Verticals, Data Analytics Lifecycle, Challenges of...
 
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptxINTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
Untitled document.pdf
Untitled document.pdfUntitled document.pdf
Untitled document.pdf
 
Emerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptx
Emerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptxEmerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptx
Emerging_Exponential_Technologies[1]_[Autosaved]_[Autosaved][1].pptx
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptx
 
Data Analytics Course In Surat.pdf
Data Analytics Course In Surat.pdfData Analytics Course In Surat.pdf
Data Analytics Course In Surat.pdf
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
 
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdfIntroduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data Science for Beginners: A Step-by-Step Introduction
Data Science for Beginners: A Step-by-Step IntroductionData Science for Beginners: A Step-by-Step Introduction
Data Science for Beginners: A Step-by-Step Introduction
 
Lecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptxLecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptx
 
L3 Big Data and Application.pptx
L3  Big Data and Application.pptxL3  Big Data and Application.pptx
L3 Big Data and Application.pptx
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Data science course in ameerpet Hyderabad
Data science course in ameerpet HyderabadData science course in ameerpet Hyderabad
Data science course in ameerpet Hyderabad
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabad
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabad
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 

Data Science Unit1 AMET.pdf

  • 1. UNIT-I Data: Data can be defined as an elementary value or the collection of values, for example, student's name and its id are the data about the student. DATA SCIENCE : • Data science is a deep study of the massive amount of data, which involves extracting meaningful insights from raw, structured, and unstructured data that is processed using the scientific method, different technologies, and algorithms. • Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets which are typically huge in amount. The field encompasses analysis, preparing data for analysis, and presenting findings to inform high- level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statics, information visualization, graphic, and business. In short, we can say that data science is all about: o Asking the correct questions and analyzing the raw data. o Modeling the data using various complex and efficient algorithms. o Visualizing the data to get a better perspective. o Understanding the data to make better decisions and finding the final result.
  • 2. NEED FOR DATA SCIENCE: • Traditionally, the data that we had was mostly structured and small in size, which could be analyzed by using simple Business Intelligence tools. Unlike data in the traditional systems which was mostly structured, today most of the data is unstructured or semi-structured. • This data is generated from different sources like financial logs, text files, multimedia forms, sensors, and instruments. Simple BI tools are not capable of processing this huge volume and variety of data. This is why we need more complex and advanced analytical tools and algorithms for processing, analyzing and drawing meaningful insights out of it. • LIFECYCLE OF DATA SCIENCE Step 1: Define Problem Statement : Creating a well-defined problem statement is a first and critical step in data science. It is a brief description of the problem that you are going to solve. Step 2: Data Collection: You need to collect the data which can help to solve the problem. Data collection is a systematic approach to gather relevant information from a variety of sources. Depending on the problem statement, the data collection method is broadly classified into two categories.
  • 3. • Primary Data Collection: First, when you have some unique problem and no related research is done on the subject. Then, you need to collect new data. This method is called as primary data collection. For example, you want information on the average time that employees spend in a cafeteria across companies. There is no public data available of these. But you can collect the data through various methods such as surveys, interviews of employees and by monitoring the time spent by employees in cafeteria. This method is time-consuming • Secondary Data Collection: The data which is readily available or collected by someone else. These data can be found on the internet, news articles, government census, magazines and so on. This method is called as secondary data collection. This method is less time-consuming than the primary method. Step 3: Data Quality Check and Remediation: One of the most important and often ignored aspects by data scientists is ensuring the data that is used for analysis and interpretation is of good quality. After collecting the data, most people start the analysis on it. Often, they forgot to do a sanity check on the data. If the data is of bad quality, it can give misleading information. Step 4: Exploratory Data Analysis: Before analysing the data it’s important to analyse the data. It is the most exciting step as it helps you to build familiarity with the data and extract useful insights. If this step is skipped then you might end up generating inaccurate models and choosing the insignificant variables in your model. Step 5: Data Modelling: Modelling means formulating every step and gather the techniques required to achieve the solution. You need to list down the flow of the calculations which is nothing but modelling steps to the solution. The important factor is how to perform the calculations. There are various techniques under Statistics and Machine Learning that you can choose based on the requirement.
  • 4. Step 6: Data Communication: This is the final step where you present the results from your analysis to the stakeholders. You explain to them how you came to a specific conclusion and your critical findings. Most often you need to present your findings to a non-technical audience, such as the marketing team or business executives. You need to communicate the results in a simple to understand manner. And the stakeholders should be able to chalk out an actionable plan from it. DATA SCIENCE COMPONENTS: The main components of Data Science are given below: 1. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect and analyze the numerical data in a large amount and finding meaningful insights from it. 2. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means specialized knowledge or skills of a particular area. In data science, there are various areas for which we need domain experts. 3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving, and transforming the data. Data engineering also includes metadata (data about data) to the data. 4. Visualization: Data visualization is meant by representing data in a visual context so that people can easily understand the significance of data. Data visualization makes it easy to access the huge amount of data in visuals. 5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing involves designing, writing, debugging, and maintaining the source code of computer programs. 6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of quantity, structure, space, and changes. For a data scientist, knowledge of good mathematics is essential. 7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to provide training to a machine so that it can act as a human brain. In data science, we use various machine learning algorithms to solve the problems.
  • 5. TOOLS FOR DATA SCIENCE Following are some tools required for data science: o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner. o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift o Data Visualization tools: R, Jupyter, Tableau, Cognos. o Machine learning tools: Spark, Mahout, Azure ML studio.
  • 6. APPLICATIONS OF DATA SCIENCE: o Image recognition and speech recognition: Data science is currently using for Image and speech recognition. When you upload an image on Facebook and start getting the suggestion to tag to your friends. This automatic tagging suggestion uses image recognition algorithm, which is part of data science. When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per voice control, so this is possible with speech recognition algorithm. o Gaming world: In the gaming world, the use of Machine learning algorithms is increasing day by day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user experience. o Internet search: When we want to search for something on the internet, then we use different types of search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the data science technology to make the search experience better, and you can get a search result with a fraction of seconds. o Transport: Transport industries also using data science technology to create self-driving cars. With self-driving cars, it will be easy to reduce the number of road accidents. o Healthcare: In the healthcare sector, data science is providing lots of benefits. Data science is being used for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc. o Recommendation systems: Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science technology for making a better user experience with personalized recommendations. Such as, when you search for something on Amazon, and you started getting suggestions for similar products, so this is because of data science technology. o Risk detection: Finance industries always had an issue of fraud and risk of losses, but with the help of data science, this can be rescued.
  • 7. Most of the finance companies are looking for the data scientist to avoid risk and any type of losses with an increase in customer satisfaction. o PYTHON FOR DATA SCIENCE • Python is open source, interpreted, high level language and provides great approach for object-oriented programming. It is one of the best language used by data scientist for various data science projects/application. • Python provide great functionality to deal with mathematics, statistics and scientific function. It provides great libraries to deals with data science application. • One of the main reasons why Python is widely used in the scientific and research communities is because of its ease of use and simple syntax which makes it easy to adapt for people who do not have an engineering background. It is also more suited for quick prototyping. Features of Python language: • It uses the elegant syntax, hence the programs are easier to read. • It is a simple to access language, which makes it easy to achieve the program working. • The large standard library and community support.
  • 8. • The interactive mode of Python makes its simple to test codes. • In Python, it is also simple to extend the code by appending new modules that are implemented in other compiled language like C++ or C. • Python is an expressive language which is possible to embed into applications to offer a programmable interface. • Allows developer to run the code anywhere, including Windows, Mac OS X, UNIX, and Linux. • It is free software in a couple of categories. It does not cost anything to use or download Pythons or to add it to the application. NEED FOR PYTHON IN DATA SCIENCE Python is no-doubt the best-suited language for a Data Scientist. I have listed down a few points which will help you understand why people go with Python for Data Science: • Python is a free, flexible and powerful open-source language • Python cuts development time in half with its simple and easy to read syntax • With Python, you can perform data manipulation, analysis, and visualization • Python provides powerful libraries for Machine learning applications and other scientific computations PYTHON IDES FOR DATA SCIENCE Data Science is a field that is used to study and understand data and draw various conclusions with the help of different scientific processes. Python is a popular language that is quite useful for data science because of its capacity for statistical analysis and its easy readability. Python also has various packages for machine learning, natural language processing, data visualization, data analysis, etc. that make it suited for data science. Some of the Python IDE’s that are used for Data Science are given as follows: 1. Jupyter notebook – Jupyter notebook is an open source IDE that is used to create Jupyter documents that can be created and shared with live codes. Also, it is a web-based interactive computational environment. The Jupyter notebook can support various languages that are popular in data science
  • 9. such as Python, Julia, Scala, R, etc. 2. Spyder –Spyder is an open source IDE that was originally created and developed by Pierre Raybaut in 2009. It can be integrated with many different Python packages such as NumPy, SymPy, SciPy, pandas, IPython, etc. The Spyder editor also supports code introspection, code completion, syntax highlighting, horizontal and vertical splitting, etc. 3. Sublime text –Sublime text is a proprietary code editor and it supports a Python API. Some of the features of Sublime text are project-specific preferences, quick navigation, supportive plugins for cross-platform, etc. While the Sublime text is quite fast and has a good support group, it is not available for free. 4. Visual Studio Code – Visual Studio Code is a code editor that was developed by Microsoft. It was developed using Electron but it does not use Atom. Some of the features of Visual Studio Code are embedded Git control, intelligent code completion, support for debugging, syntax highlighting, code refactoring, etc. It is also quite fast and lightweight as well. 5. Pycharm – Pycharm is an IDE developed by JetBrains and created specifically for Python. It has various features such as code analysis, integrated unit tester, integrated Python debugger, support for web frameworks, etc. Pycharm is particularly useful in machine learning because it supports libraries such as Pandas, Matplotlib, Scikit-Learn, NumPy, etc. 6. Rodeo – Rodeo is an open source IDE that was developed by Yhat for data science in Python. So Rodeo includes Python tutorials and also cheat sheets that can be used for reference if required. Some of the features of Rodeo are syntax highlighting, auto-completion, easy interaction with data frames and plots, built-in IPython support, etc. 7. Thonny – Thonny is an IDE that was developed at the The University of Tartu for Python. It is created for beginners that are learning to programme in Python
  • 10. or for those that are teaching it. Some of the features of Thonny are statement stepping without breakpoints, simple pip GUI, line numbers, live variables during debugging, etc. 8. Atom – Atom is an open source text and code editor that was developed using Electron. It has multiple features such as a sleek interface, a file system browser, various extensions, etc. Atom also has an extension that can support Python while it is running. 9. Geany – Geany is a free text editor that supports Python and contains IDE features as well. It was originally authored by Enrico Tröger in C and C++. Some of the features of Geany are Symbol lists, Auto-completion, Syntax highlighting, Code navigation, Multiple document support, etc. MOST COMMONLY USED PYTHON LIBRARIES FOR DATA SCIENCE : • Numpy: Numpy is Python library that provides mathematical function to handle large dimension array. It provides various method/function for Array, Metrics, and linear algebra. NumPy stands for Numerical Python. It provides lots of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which enhance performance and speeds up the execution. It’s very easy to work with large multidimensional arrays and matrices using NumPy. • Pandas: Pandas is one of the most popular Python library for data manipulation and analysis. Pandas provide useful functions to manipulate large amount of structured data. Pandas provide easiest method to perform analysis. It provide large data structures and manipulating numerical tables and time series data. Pandas is a perfect tool for data wrangling. Pandas is designed for quick and easy data manipulation, aggregation, and visualization. There two data structures in Pandas Series – It Handle and store data in one-dimensional data. DataFrame – It Handle and store Two dimensional data.
  • 11. • Matplotlib: Matplotlib is another useful Python library for Data Visualization. Descriptive analysis and visualizing data is very important for any organization. Matplotlib provides various method to Visualize data in more effective way. Matplotlib allows to quickly make line graphs, pie charts, histograms, and other professional grade figures. Using Matplotlib, one can customize every aspect of a figure. Matplotlib has interactive features like zooming and planning and saving the Graph in graphics format. • Scipy: Scipy is another popular Python library for data science and scientific computing. Scipy provides great functionality to scientific mathematics and computing programming. SciPy contains sub-modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers, Statmodel and other tasks common in science and engineering. • Scikit – learn: Sklearn is Python library for machine learning. Sklearn provides various algorithms and functions that are used in machine learning. Sklearn is built on NumPy, SciPy, and matplotlib. Sklearn provides easy and simple tools for data mining and data analysis. It provides a set of common machine learning algorithms to users through a consistent interface. Scikit- Learn helps to quickly implement popular algorithms on datasets and solve real-world problems. PYTHON BASICS FOR DATA SCIENCE Basic concept of Python Programming : • Variables: Variables refer to the reserved memory locations to store the values. In Python, you don’t need to declare variables before using them or even declare their type. • Data Types: Python supports numerous data types, which defines the operations possible on the variables and the storage method. The list of data types includes – Numeric, Lists, Strings, tuples, Sets, and Dictionary. • Operators: Operators helps to manipulate the value of operands. The list of operators in Python includes- Arithmetic, Comparison, Assignment, Logical, Bitwise, Membership, and Identity. • Conditional Statements: Conditional statements help to execute a set of statements based on a condition. There are namely three conditional statements – If, Elif and Else.
  • 12. • Loops: Loops are used to iterate through small pieces of code. There are three types of loops namely – While, for and nested loops. • Functions: Functions are used to divide your code into useful blocks, allowing you to order the code, make it more readable, reuse it & save some time. Practical implementations, using Python coding . Loading The Data The very first step, to begin with, is loading the data into your program. We can do so by using the read_csv( ) from the Python panda’s library. 1 2 import pandas as pd data = pd.read_csv("file_name.csv") Cleaning the Data The next step is to look for irregularities in the data by doing some data exploration. Finding out the null values and replacing them with other values or dropping that row altogether is involved in this phase. 1. data.describe() #to check for null values 2. data.isnull().sum() #drop the null values 3. df = data.dropna() #checking again to be double sure 4. df.isnull().sum() Visualization After we are done cleaning, we can move ahead and make some visualizations to understand the relationship between various aspects of our dataset. 1sns.scatterplot(x=df["npg"], y=df["birth_rate"])