Introduction to Data Science
PURNA CHANDER RAO . KATHULA
Agenda
● What is Data Science?
● Domain’s - Need of Data Science?
● Data Life Cycle
● Data Science Sub-Domains
● Why Python for Data Science?
● Python - Modules in Data science
○ Introduction to Pandas
○ Introduction to Numpy
○ Introduction to Matplotlib
○ Introduction to Seaborn
● What is Machine Learning ?
What is Data Science
Data Science is the field of study that combines Domain expertise,
Programming skills, Knowledge of Math and Statistics to extract
meaningful insights from DATA.
In turn these systems generate insights that analysts and
business users translate into tangible business values.
Data Life Cycle
Data Science - Sub Domains
Domains - Need of Data Science
● Ecommerce
○ Recommendation System, Customer sentiment analysis,
Inventory management, improve customer service.
● HealthCare
○ Castlight - Helps customers / Client to take an appropriate plan
● Financials
○ Chatbots, call-center automation , paper work automation
● And ETC……….
Why Python for Data Science
● It is easy to Learn
○ Now the language of choice for 8 of 10 US computer science
programs
● Full Featured
○ Not just a statistics language , but has full capabilities for data
acquisition, cleaning, databases, high performance computing and
more
● Strong Data Science Libraries
○ Pandas, Numpy, Matplotlib, Scipy, Seaborn, NLTK, Scikitlearn and
etc….
Anaconda
What is Anaconda?
● Essentially a Large ( ~ 400 MB ) Python Installation
● But contains everything you need for Data Engineering, Analytics and
Machine Learning
● Unless you have a special reason not to , you should just install and use
this.
Introduction to Pandas
What is Pandas ?
Pandas is a Python library for data analysis and data manipulation. A
python version of the R data.frame library.
Key Features of Pandas
● It has API’s for loading data from different file formats into memory.
● ( exel, tsv, csv, db and etc).
● Data is structured in the form of Rows and Columns.
● Retrieval of data is similar as SQL, can perform all the operations such
as Groupby, Joins, Views and etc..
● Merging of data from multiple datasets.
● Does support much of DataTime series functionality, Timezone,
Business Days, Holidays and etc..
● Boolean Indexing
● Fancy Indexing
Core DataStrucures of Pandas
● DataFrames
● Series
Core Operations
Create Select Insert Map
Join Sort Clean ApplyMap
View Update Filter Append
Group Summarise Confirm Rotate
Introduction to Numpy
● Numpy is extremely used in scientific computing
● 3 Main benefits of using numpy array over a list
○ Less memory
○ Fast
○ Convenient
● Broadcasting allows universal functions to deal in a meaningful way with
numpy arrays.
Introduction to Matplotlib
A picture is worth than thousands of words. Matplotlib is a 2-D plotting library
that helps in visualizing figures. Matplotlib emulates Matlab like graphs and
visualizations.
Matplotlib is a python library used to create 2D graphs and plots
by using python scripts. It has a module named pyplot which makes things
easy for plotting by providing feature to control line styles, font properties,
formatting axes etc. It supports a very wide variety of graphs and plots
namely - histogram, bar charts, power spectra, error charts etc. It is used
along with NumPy to provide an environment that is an effective open source
alternative for MatLab.
Introduction to Seaborn
Seaborn is a Python data visualization library based on matplotlib . it
provides a high level interface for drawing attractive and informative
statistical graphics
Important features of seaborn
● Built in themes for styling matplotlib graphics
● Fitting in and visualizing linear regression models
● Plotting statistical time series data
● Seaborn works well with NumPy and Pandas data structures
● It comes with built in themes for styling Matplotlib graphics
BOX PLOTS
VIOLIN PLOTS
BAR PLOTS
BOX PLOTS
VIOLIN PLOTS
Machine Learning
● What is Machine Learning
● Types of Machine Learning
● Supervised and Unsupervised Learning.
● Use Cases
○ Linear Regression ( Supervised)
○ K-Means ( Unsupervised)
○ Sentiment Analysis
What is Machine Learning
Machine Learning is a subset of Artificial Intelligence ( AI ) which
provides the machines the ability to learn automatically & improve
from experience without being explicitly programmed.
Types of Machine Learning
● Supervised Learning.
● Unsupervised Learning.
● Reinforcement Learning.
Linear Regression (Supervised)
Linear Regression is a machine learning algorithm based on supervised
learning. It performs a regression task. Regression models a target prediction
value based on independent variables. It is mostly used for finding out the
relationship between variables and forecasting.
K - Means ( Unsupervised)
K-means clustering is a type of unsupervised learning, which is used when
you have unlabeled data (i.e., data without defined categories or groups).
The goal of this algorithm is to find groups in the data, with the number of
groups represented by the variable K. The algorithm works iteratively to
assign each data point to one of K groups based on the features that are
provided. Data points are clustered based on feature similarity. The results of
the K-means clustering algorithm are:
● The centroids of the K clusters, which can be used to label new data
● Labels for the training data (each data point is assigned to a single cluster)
References
Python / Anaconda - https://www.anaconda.com/distribution/
Pandas - https://pandas.pydata.org/
Numpy - https://numpy.org/
Matplotlib - https://matplotlib.org/
Seaborn - https://seaborn.pydata.org/
Scipy - https://www.scipy.org/
Bokeh - https://bokeh.pydata.org/en/latest/

Data science

  • 1.
    Introduction to DataScience PURNA CHANDER RAO . KATHULA
  • 2.
    Agenda ● What isData Science? ● Domain’s - Need of Data Science? ● Data Life Cycle ● Data Science Sub-Domains ● Why Python for Data Science? ● Python - Modules in Data science ○ Introduction to Pandas ○ Introduction to Numpy ○ Introduction to Matplotlib ○ Introduction to Seaborn ● What is Machine Learning ?
  • 4.
    What is DataScience Data Science is the field of study that combines Domain expertise, Programming skills, Knowledge of Math and Statistics to extract meaningful insights from DATA. In turn these systems generate insights that analysts and business users translate into tangible business values.
  • 5.
  • 6.
    Data Science -Sub Domains
  • 7.
    Domains - Needof Data Science ● Ecommerce ○ Recommendation System, Customer sentiment analysis, Inventory management, improve customer service. ● HealthCare ○ Castlight - Helps customers / Client to take an appropriate plan ● Financials ○ Chatbots, call-center automation , paper work automation ● And ETC……….
  • 8.
    Why Python forData Science ● It is easy to Learn ○ Now the language of choice for 8 of 10 US computer science programs ● Full Featured ○ Not just a statistics language , but has full capabilities for data acquisition, cleaning, databases, high performance computing and more ● Strong Data Science Libraries ○ Pandas, Numpy, Matplotlib, Scipy, Seaborn, NLTK, Scikitlearn and etc….
  • 9.
  • 10.
    What is Anaconda? ●Essentially a Large ( ~ 400 MB ) Python Installation ● But contains everything you need for Data Engineering, Analytics and Machine Learning ● Unless you have a special reason not to , you should just install and use this.
  • 11.
    Introduction to Pandas Whatis Pandas ? Pandas is a Python library for data analysis and data manipulation. A python version of the R data.frame library. Key Features of Pandas ● It has API’s for loading data from different file formats into memory. ● ( exel, tsv, csv, db and etc). ● Data is structured in the form of Rows and Columns. ● Retrieval of data is similar as SQL, can perform all the operations such as Groupby, Joins, Views and etc.. ● Merging of data from multiple datasets. ● Does support much of DataTime series functionality, Timezone, Business Days, Holidays and etc.. ● Boolean Indexing ● Fancy Indexing
  • 12.
    Core DataStrucures ofPandas ● DataFrames ● Series Core Operations Create Select Insert Map Join Sort Clean ApplyMap View Update Filter Append Group Summarise Confirm Rotate
  • 38.
    Introduction to Numpy ●Numpy is extremely used in scientific computing ● 3 Main benefits of using numpy array over a list ○ Less memory ○ Fast ○ Convenient ● Broadcasting allows universal functions to deal in a meaningful way with numpy arrays.
  • 43.
    Introduction to Matplotlib Apicture is worth than thousands of words. Matplotlib is a 2-D plotting library that helps in visualizing figures. Matplotlib emulates Matlab like graphs and visualizations. Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing feature to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts etc. It is used along with NumPy to provide an environment that is an effective open source alternative for MatLab.
  • 51.
    Introduction to Seaborn Seabornis a Python data visualization library based on matplotlib . it provides a high level interface for drawing attractive and informative statistical graphics Important features of seaborn ● Built in themes for styling matplotlib graphics ● Fitting in and visualizing linear regression models ● Plotting statistical time series data ● Seaborn works well with NumPy and Pandas data structures ● It comes with built in themes for styling Matplotlib graphics
  • 53.
  • 54.
  • 55.
  • 56.
  • 58.
  • 60.
    Machine Learning ● Whatis Machine Learning ● Types of Machine Learning ● Supervised and Unsupervised Learning. ● Use Cases ○ Linear Regression ( Supervised) ○ K-Means ( Unsupervised) ○ Sentiment Analysis
  • 61.
    What is MachineLearning Machine Learning is a subset of Artificial Intelligence ( AI ) which provides the machines the ability to learn automatically & improve from experience without being explicitly programmed.
  • 62.
    Types of MachineLearning ● Supervised Learning. ● Unsupervised Learning. ● Reinforcement Learning.
  • 63.
    Linear Regression (Supervised) LinearRegression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting.
  • 64.
    K - Means( Unsupervised) K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are: ● The centroids of the K clusters, which can be used to label new data ● Labels for the training data (each data point is assigned to a single cluster)
  • 65.
    References Python / Anaconda- https://www.anaconda.com/distribution/ Pandas - https://pandas.pydata.org/ Numpy - https://numpy.org/ Matplotlib - https://matplotlib.org/ Seaborn - https://seaborn.pydata.org/ Scipy - https://www.scipy.org/ Bokeh - https://bokeh.pydata.org/en/latest/