Data Science Unit1 AMET.pdf

UNIT-I
Data: Data can be defined as an elementary value or the collection of values, for example,
student's name and its id are the data about the student.
DATA SCIENCE :
• Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
• Data Science is an interdisciplinary field that focuses on extracting knowledge
from data sets which are typically huge in amount. The field encompasses
analysis, preparing data for analysis, and presenting findings to inform high-
level decisions in an organization. As such, it incorporates skills from computer
science, mathematics, statics, information visualization, graphic, and business.
In short, we can say that data science is all about:
o Asking the correct questions and analyzing the raw data.
o Modeling the data using various complex and efficient algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and finding the final result.

NEED FOR DATA SCIENCE:
• Traditionally, the data that we had was mostly structured and small in size, which
could be analyzed by using simple Business Intelligence tools. Unlike data in
the traditional systems which was mostly structured, today most of the data is
unstructured or semi-structured.
• This data is generated from different sources like financial logs, text files,
multimedia forms, sensors, and instruments. Simple BI tools are not capable
of processing this huge volume and variety of data. This is why we need more
complex and advanced analytical tools and algorithms for processing, analyzing
and drawing meaningful insights out of it.
•
LIFECYCLE OF DATA SCIENCE
Step 1: Define Problem Statement : Creating a well-defined problem statement is a first
and critical step in data science. It is a brief description of the problem that you are going
to solve.
Step 2: Data Collection:
You need to collect the data which can help to solve the problem. Data collection is a
systematic approach to gather relevant information from a variety of sources. Depending
on the problem statement, the data collection method is broadly classified into two
categories.

• Primary Data Collection:
First, when you have some unique problem and no related research is done on the
subject. Then, you need to collect new data. This method is called as primary data
collection.
For example, you want information on the average time that employees spend in a
cafeteria across companies. There is no public data available of these. But you can collect
the data through various methods such as surveys, interviews of employees and by
monitoring the time spent by employees in cafeteria. This method is time-consuming
• Secondary Data Collection:
The data which is readily available or collected by someone else. These data can be
found on the internet, news articles, government census, magazines and so on. This
method is called as secondary data collection. This method is less time-consuming than
the primary method.
Step 3: Data Quality Check and Remediation:
One of the most important and often ignored aspects by data scientists is ensuring
the data that is used for analysis and interpretation is of good quality.
After collecting the data, most people start the analysis on it. Often, they forgot to
do a sanity check on the data. If the data is of bad quality, it can give misleading
information.
Step 4: Exploratory Data Analysis:
Before analysing the data it’s important to analyse the data. It is the most exciting
step as it helps you to build familiarity with the data and extract useful insights. If this step
is skipped then you might end up generating inaccurate models and choosing the
insignificant variables in your model.
Step 5: Data Modelling:
Modelling means formulating every step and gather the techniques required to
achieve the solution. You need to list down the flow of the calculations which is nothing
but modelling steps to the solution. The important factor is how to perform the
calculations. There are various techniques under Statistics and Machine Learning that you
can choose based on the requirement.

Step 6: Data Communication:
This is the final step where you present the results from your analysis to the
stakeholders. You explain to them how you came to a specific conclusion and your critical
findings.
Most often you need to present your findings to a non-technical audience, such as
the marketing team or business executives. You need to communicate the results in a
simple to understand manner. And the stakeholders should be able to chalk out an
actionable plan from it.
DATA SCIENCE COMPONENTS:
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science. Statistics
is a way to collect and analyze the numerical data in a large amount and finding
meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata
(data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual context so
that people can easily understand the significance of data. Data visualization makes it
easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of
computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves
the study of quantity, structure, space, and changes. For a data scientist, knowledge of
good mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is
all about to provide training to a machine so that it can act as a human brain. In data
science, we use various machine learning algorithms to solve the problems.

TOOLS FOR DATA SCIENCE
Following are some tools required for data science:
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.

APPLICATIONS OF DATA SCIENCE:
o Image recognition and speech recognition:
Data science is currently using for Image and speech recognition. When you
upload an image on Facebook and start getting the suggestion to tag to your
friends. This automatic tagging suggestion uses image recognition algorithm,
which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition
algorithm.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is increasing day by
day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
o Internet search:
When we want to search for something on the internet, then we use different
types of search engines such as Google, Yahoo, Bing, Ask, etc. All these search
engines use the data science technology to make the search experience better,
and you can get a search result with a fraction of seconds.
o Transport:
Transport industries also using data science technology to create self-driving
cars. With self-driving cars, it will be easy to reduce the number of road
accidents.
o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual
medical bots, etc.
o Recommendation systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you
started getting suggestions for similar products, so this is because of data science
technology.
o Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued.

Most of the finance companies are looking for the data scientist to avoid risk and
any type of losses with an increase in customer satisfaction.
o
PYTHON FOR DATA SCIENCE
• Python is open source, interpreted, high level language and provides
great approach for object-oriented programming. It is one of the best
language used by data scientist for various data science
projects/application.
• Python provide great functionality to deal with mathematics, statistics
and scientific function. It provides great libraries to deals with data
science application.
• One of the main reasons why Python is widely used in the scientific and
research communities is because of its ease of use and simple syntax
which makes it easy to adapt for people who do not have an engineering
background. It is also more suited for quick prototyping.
Features of Python language:
• It uses the elegant syntax, hence the programs are easier to read.
• It is a simple to access language, which makes it easy to achieve the program
working.
• The large standard library and community support.

• The interactive mode of Python makes its simple to test codes.
• In Python, it is also simple to extend the code by appending new modules that
are implemented in other compiled language like C++ or C.
• Python is an expressive language which is possible to embed into
applications to offer a programmable interface.
• Allows developer to run the code anywhere, including Windows, Mac OS X,
UNIX, and Linux.
• It is free software in a couple of categories. It does not cost anything to use or
download Pythons or to add it to the application.
NEED FOR PYTHON IN DATA SCIENCE
Python is no-doubt the best-suited language for a Data Scientist. I have listed down a few
points which will help you understand why people go with Python for Data Science:
• Python is a free, flexible and powerful open-source language
• Python cuts development time in half with its simple and easy to read syntax
• With Python, you can perform data manipulation, analysis, and visualization
• Python provides powerful libraries for Machine learning applications and other
scientific computations
PYTHON IDES FOR DATA SCIENCE
Data Science is a field that is used to study and understand data and draw various
conclusions with the help of different scientific processes. Python is a popular language
that is quite useful for data science because of its capacity for statistical analysis and its
easy readability. Python also has various packages for machine learning, natural
language processing, data visualization, data analysis, etc. that make it suited for data
science. Some of the Python IDE’s that are used for Data Science are given as follows:
1. Jupyter notebook – Jupyter notebook is an open source IDE that is used to
create Jupyter documents that can be created and shared with live codes.
Also, it is a web-based interactive computational environment. The Jupyter
notebook can support various languages that are popular in data science

such as Python, Julia, Scala, R, etc.
2. Spyder –Spyder is an open source IDE that was originally created and
developed by Pierre Raybaut in 2009. It can be integrated with many
different Python packages such as NumPy, SymPy, SciPy, pandas, IPython,
etc. The Spyder editor also supports code introspection, code completion,
syntax highlighting, horizontal and vertical splitting, etc.
3. Sublime text –Sublime text is a proprietary code editor and it supports a
Python API. Some of the features of Sublime text are project-specific
preferences, quick navigation, supportive plugins for cross-platform, etc.
While the Sublime text is quite fast and has a good support group, it is not
available for free.
4. Visual Studio Code –
Visual Studio Code is a code editor that was developed by Microsoft. It was
developed using Electron but it does not use Atom. Some of the features of
Visual Studio Code are embedded Git control, intelligent code completion,
support for debugging, syntax highlighting, code refactoring, etc. It is also
quite fast and lightweight as well.
5. Pycharm –
Pycharm is an IDE developed by JetBrains and created specifically for
Python. It has various features such as code analysis, integrated unit tester,
integrated Python debugger, support for web frameworks, etc. Pycharm is
particularly useful in machine learning because it supports libraries such as
Pandas, Matplotlib, Scikit-Learn, NumPy, etc.
6. Rodeo –
Rodeo is an open source IDE that was developed by Yhat for data science in
Python. So Rodeo includes Python tutorials and also cheat sheets that can
be used for reference if required. Some of the features of Rodeo are syntax
highlighting, auto-completion, easy interaction with data frames and plots,
built-in IPython support, etc.
7. Thonny –
Thonny is an IDE that was developed at the The University of Tartu for
Python. It is created for beginners that are learning to programme in Python

or for those that are teaching it. Some of the features of Thonny are statement
stepping without breakpoints, simple pip GUI, line numbers, live variables
during debugging, etc.
8. Atom –
Atom is an open source text and code editor that was developed using
Electron. It has multiple features such as a sleek interface, a file system
browser, various extensions, etc. Atom also has an extension that can support
Python while it is running.
9. Geany –
Geany is a free text editor that supports Python and contains IDE features as
well. It was originally authored by Enrico Tröger in C and C++. Some of the
features of Geany are Symbol lists, Auto-completion, Syntax highlighting,
Code navigation, Multiple document support, etc.
MOST COMMONLY USED PYTHON LIBRARIES FOR DATA SCIENCE :
• Numpy: Numpy is Python library that provides mathematical function to handle
large dimension array. It provides various method/function for Array, Metrics,
and linear algebra.
NumPy stands for Numerical Python. It provides lots of useful features for
operations on n-arrays and matrices in Python. The library provides
vectorization of mathematical operations on the NumPy array type, which
enhance performance and speeds up the execution. It’s very easy to work with
large multidimensional arrays and matrices using NumPy.
• Pandas: Pandas is one of the most popular Python library for data manipulation
and analysis. Pandas provide useful functions to manipulate large amount of
structured data. Pandas provide easiest method to perform analysis. It provide
large data structures and manipulating numerical tables and time series data.
Pandas is a perfect tool for data wrangling. Pandas is designed for quick and easy
data manipulation, aggregation, and visualization. There two data structures in
Pandas
Series – It Handle and store data in one-dimensional data.
DataFrame – It Handle and store Two dimensional data.

• Matplotlib: Matplotlib is another useful Python library for Data
Visualization. Descriptive analysis and visualizing data is very important for
any organization. Matplotlib provides various method to Visualize data in
more effective way. Matplotlib allows to quickly make line graphs, pie charts,
histograms, and other professional grade figures. Using Matplotlib, one can
customize every aspect of a figure. Matplotlib has interactive features like
zooming and planning and saving the Graph in graphics format.
• Scipy: Scipy is another popular Python library for data science and scientific
computing. Scipy provides great functionality to scientific mathematics and
computing programming. SciPy contains sub-modules for optimization,
linear algebra, integration, interpolation, special functions, FFT, signal and
image processing, ODE solvers, Statmodel and other tasks common in science
and engineering.
• Scikit – learn: Sklearn is Python library for machine learning. Sklearn
provides various algorithms and functions that are used in machine learning.
Sklearn is built on NumPy, SciPy, and matplotlib. Sklearn provides easy and
simple tools for data mining and data analysis. It provides a set of common
machine learning algorithms to users through a consistent interface. Scikit-
Learn helps to quickly implement popular algorithms on datasets and solve
real-world problems.
PYTHON BASICS FOR DATA SCIENCE
Basic concept of Python Programming :
• Variables: Variables refer to the reserved memory locations to store the values.
In Python, you don’t need to declare variables before using them or even declare
their type.
• Data Types: Python supports numerous data types, which defines the operations
possible on the variables and the storage method. The list of data types includes –
Numeric, Lists, Strings, tuples, Sets, and Dictionary.
• Operators: Operators helps to manipulate the value of operands. The list of
operators in Python includes- Arithmetic, Comparison, Assignment, Logical,
Bitwise, Membership, and Identity.
• Conditional Statements: Conditional statements help to execute a set of
statements based on a condition. There are namely three conditional statements
– If, Elif and Else.

• Loops: Loops are used to iterate through small pieces of code. There are three
types of loops namely – While, for and nested loops.
• Functions: Functions are used to divide your code into useful blocks, allowing you
to order the code, make it more readable, reuse it & save some time.
Practical implementations, using Python coding .
Loading The Data
The very first step, to begin with, is loading the data into your program. We can do so by
using the read_csv( ) from the Python panda’s library.
1
2
import pandas as pd
data = pd.read_csv("file_name.csv")
Cleaning the Data
The next step is to look for irregularities in the data by doing some data exploration.
Finding out the null values and replacing them with other values or dropping that row
altogether is involved in this phase.
1. data.describe()
#to check for null values
2. data.isnull().sum()
#drop the null values
3. df = data.dropna()
#checking again to be double sure
4. df.isnull().sum()
Visualization
After we are done cleaning, we can move ahead and make some visualizations to
understand the relationship between various aspects of our dataset.
1sns.scatterplot(x=df["npg"], y=df["birth_rate"])

Data Science Unit1 AMET.pdf

Recommended

Recommended

More Related Content

Similar to Data Science Unit1 AMET.pdf

Similar to Data Science Unit1 AMET.pdf (20)

Recently uploaded

Recently uploaded (20)

Data Science Unit1 AMET.pdf