2. Private and Confidential www.futureconnect.net 2
AGENDA
UNIT
NAME
TOPICS
Hours
Count
Session
1.DATA
SCIENCE
1. DATA SCIENCE LIBARIES
2. NUMPY
3. PANDAS
4. MATPLOTLIB
5. DATA EXPLORATION
2 2
3. OBJECTIVES
• Gain knowledge of Data Science Libraries
• To understand Data Science Manipulation Packages
• Demo for Data Exploration using Package
3
Private and Confidential www.futureconnect.net 3
4. Data Mining
Scrapy
• One of the most popular Python data science libraries, Scrapy helps to build crawling programs
(spider bots) that can retrieve structured data from the web – for example, URLs or contact info.
• It's a great tool for scraping data used in, for example, Python machine learning models.
• Developers use it for gathering data from APIs.
BeautifulSoup
• BeautifulSoup is another really popular library for web crawling and data scraping.
• If you want to collect data that’s available on some website but not via a proper CSV or API,
BeautifulSoup can help you scrape it and arrange it into the format you need.
4
Private and Confidential www.futureconnect.net 4
5. Data Processing and Modeling
NumPy
• NumPy (Numerical Python) is a perfect tool for scientific computing and performing basic and
advanced array operations.
• The library offers many handy features performing operations on n-arrays and matrices in
Python.
SciPy
• This useful library includes modules for linear algebra, integration, optimization, and statistics.
• Its main functionality was built upon NumPy, so its arrays make use of this library.
• SciPy works great for all kinds of scientific programming projects (science, mathematics, and
engineering
5
Private and Confidential www.futureconnect.net 5
6. Data Processing and Modeling
Pandas
• Pandas is a library created to help developers work with "labeled" and "relational" data intuitively.
• It's based on two main data structures: "Series" (one-dimensional, like a list of items) and "Data
Frames" (two-dimensional, like a table with multiple columns).
Keras
• Keras is a great library for building neural networks and modeling.
• It's very straightforward to use and provides developers with a good degree of extensibility. The
library takes advantage of other packages, (Theano or TensorFlow) as its backends.
6
Private and Confidential www.futureconnect.net 6
7. Data Processing and Modeling
SciKit-Learn
• This is an industry-standard for data science projects based in Python.
• Scikits is a group of packages in the SciPy Stack that were created for specific functionalities –
for example, image processing. Scikit-learn uses the math operations of SciPy to expose a
concise interface to the most common machine learning algorithms.
PyTorch
• PyTorch is a framework that is perfect for data scientists who want to perform deep learning tasks
easily.
• The tool allows performing tensor computations with GPU acceleration. It's also used for other
tasks – for example, for creating dynamic computational graphs and calculating gradients
automatically.
7
Private and Confidential www.futureconnect.net 7
8. Data Processing and Modeling
TensorFlow
• TensorFlow is a popular Python framework for machine learning and deep learning, which was
developed at Google Brain.
• It's the best tool for tasks like object identification, speech recognition, and many others.
• It helps in working with artificial neural networks that need to handle multiple data sets.
XGBoost
• This library is used to implement machine learning algorithms under the Gradient Boosting
framework.
• XGBoost is portable, flexible, and efficient.
• It offers parallel tree boosting that helps teams to resolve many data science problems. Another
advantage is that developers can run the same code on major distributed environments such as
Hadoop, SGE, and MPI.
8
Private and Confidential www.futureconnect.net 8
9. Data Visualization
Matplotlib
• This is a standard data science library that helps to generate data visualizations such as two-
dimensional diagrams and graphs (histograms, scatterplots, non-Cartesian coordinates graphs).
• Matplotlib is one of those plotting libraries that are really useful in data science projects —
it provides an object-oriented API for embedding plots into applications.
• Developers need to write more code than usual while using this library for generating advanced
visualizations.
Seaborn
• Seaborn is based on Matplotlib and serves as a useful Python machine learning tool for
visualizing statistical models – heatmaps and other types of visualizations that summarize data
and depict the overall distributions.
• When using this library, you get to benefit from an extensive gallery of visualizations (including
complex ones like time series, joint plots, and violin diagrams).
9
Private and Confidential www.futureconnect.net 9
10. Data Visualization
Bokeh
• This library is a great tool for creating interactive and scalable visualizations inside browsers using
JavaScript widgets. Bokeh is fully independent of Matplotlib.
• It focuses on interactivity and presents visualizations through modern browsers – similarly to Data-
Driven Documents (d3.js). It offers a set of graphs, interaction abilities (like linking plots or adding
JavaScript widgets), and styling.
Plotly
• This web-based tool for data visualization that offers many useful out-of-box graphics – you can
find them on the Plot.ly website.
• The library works very well in interactive web applications.
pydot
• This library helps to generate oriented and non-oriented graphs.
• It serves as an interface to Graphviz (written in pure Python). The graphs created come in handy
when you're developing algorithms based on neural networks and decision trees.
10
Private and Confidential www.futureconnect.net 10
11. Python Libraries for Data Science
• Pandas: Used for structured data operations
• NumPy: Creating Arrays
• Matplotlib: Data Visualization
• Scikit-learn: Machine Learning Operations
• SciPy: Perform Scientific operations
• TensorFlow: Symbolic math library
• BeautifulSoup: Parsing HTML and XML pages
Private and Confidential www.futureconnect.net 11
This 3 Python Libraries will be
covered in the following slides
12. Numpy
• NumPy=Numerical Python
• Created in 2005 by Travis Oliphant.
• Consist of Array objects and perform array processing.
• NumPy is faster than traditional Python lists as it is stored in one continuous place
in memory.
• The array object in NumPy is called ndarray.
Private and Confidential www.futureconnect.net 12
13. Top four benefits that NumPy can bring to your code:
1. More speed: NumPy uses algorithms written in C that complete in nanoseconds rather than
seconds.
2. Fewer loops: NumPy helps you to reduce loops and keep from getting tangled up in iteration
indices.
3. Clearer code: Without loops, your code will look more like the equations you’re trying to
calculate.
4. Better quality: There are thousands of contributors working to keep NumPy fast, friendly, and
bug free.
13
Private and Confidential www.futureconnect.net 13
14. Numpy Installation and Importing
Pre-requirements: Python and Python Package Installer(pip)
Installation: pip install numpy
Import: After installation, import the package by the “import” keyword.
import numpy
This ensures that NumPy package is properly installed and ready to use
Package
Private and Confidential www.futureconnect.net 14
15. Numpy-ndarray Object
• It defines the collection of items which belong to same type.
• Each element in ndarray is an object of data-type object : dtype
• Basic ndarray creation: numpy.array
OR
numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin =
0)
Array interface Data type Object copying Row/Col major Base class array Number of
or 1D dimensions
Private and Confidential www.futureconnect.net 15
17. NumPy arrays can be multi-dimensional too.
np.array([[1,2,3,4],[5,6,7,8]])
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
• Here, we created a 2-dimensional array of values.
• Note: A matrix is just a rectangular array of numbers with shape N x M where N is
the number of rows and M is the number of columns in the matrix. The one you
just saw above is a 2 x 4 matrix.
17
Private and Confidential www.futureconnect.net 17
18. Types of NumPy arrays
• Array of zeros
• Array of ones
• Random numbers in ndarrays
• Imatrix in NumPy
• Evenly spaced ndarray
18
Private and Confidential www.futureconnect.net 18
19. Numpy - Array Indexing and Slicing
• It is used to access array elements by using index element.
• The indexes in NumPy arrays start with 0.
arr = np.array([1, 2, 3, 4])
arr[0] Accessing first element of the array. Hence, the value is 1.
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
arr[0,1] Accessing the second element of the 2D array. Hence, the value is 2.
Slicing: Taking elements of an array from start index to end index [start:end] or [start:step:end]
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5]) Ans: [2 3 4 5]
Private and Confidential www.futureconnect.net 19
20. Dimensions of NumPy arrays
You can easily determine the number of dimensions or axes of a NumPy array using the ndims attribute:
# number of axis
a = np.array([[5,10,15],[20,25,20]])
print('Array :','n',a)
print('Dimensions :','n',a.ndim)
Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :
2
This array has two dimensions: 2 rows and 3 columns.
20
Private and Confidential www.futureconnect.net 20
21. Numpy- Array Shape and Reshape
• The shape of an array is the number of data elements in the array.
• It has an attribute called shape to perform the action
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
• Reshaping is done to change the shape of an array.
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)
Output: (2,4)
Output: [[1 2 3]
[4 5 6]
[7 8 9]
[10 11 12]]
Private and Confidential www.futureconnect.net 21
22. Flattening a NumPy array
Sometimes when you have a multidimensional array and want to collapse it to a single-dimensional
array, you can either use the flatten() method or the ravel() method:
Syntax:
• flatten()
• ravel()
22
Private and Confidential www.futureconnect.net 22
23. Transpose of a NumPy array
Another very interesting reshaping method of NumPy is the transpose() method. It takes the input
array and swaps the rows with the column values, and the column values with the values of the rows:
Syntax : numpy.transpose()
23
Private and Confidential www.futureconnect.net 23
24. Expanding and Squeezing a NumPy array
Expanding a NumPy array
• You can add a new axis to an array using the expand_dims() method by providing the array and the
axis along which to expand
Squeezing a NumPy array
• On the other hand, if you instead want to reduce the axis of the array, use the squeeze() method.
• It removes the axis that has a single entry. This means if you have created a 2 x 2 x 1 matrix,
squeeze() will remove the third dimension from the matrix
24
Private and Confidential www.futureconnect.net 24
25. Numpy- Arrays Join and Split
• Joining means to merge two or more arrays.
• We use concatenate() function to join arrays.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
• Splitting means to breaking one array into many.
arr = np.array([1, 2, 3, 4, 5, 6])
newarr = np.array_split(arr, 3)
print(newarr)
Output: [1 2 3 4 5 6]
Output: [array([1,2]),array([3,4]),array([5,6])]
Private and Confidential www.futureconnect.net 25
26. Pandas
• Data Analysis Tool
• Used for exploring, manipulating, analyzing data.
• The source code for Pandas is found at this github repository
https://github.com/pandas-dev/pandas
• Pandas convert messy data into readable and required format for analysis.
Private and Confidential www.futureconnect.net 26
27. Pandas Installation and Importing
Pre-requirements: Python and Python Package Installer(pip)
Installation: pip install pandas
Import: After installation, import the package by the “import” keyword.
import pandas
This ensures that Pandas package is properly installed and ready to use
Package
Private and Confidential www.futureconnect.net 27
28. Pandas -Series and Dataframes
• Series is a 1D array containing one type of data
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
• Dataframe is a 2D array containing rows and columns
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
Output: 0 1
1 7
2 2
dtype: int64
Loading data into dataframe Output:
calories duration
0 420 50
1 380 40
2 390 45
Private and Confidential www.futureconnect.net 28
29. Pandas: Read CSV
• It is used to read CSV(Comma Separated File).
• pd.read_csv() function is used.
import pandas as pd
df = pd.read_csv('data.csv’)
When we print df, we get first 5 rows and last 5 columns in the data as default
df.head(10) : Print first 10 rows
df.tail(10): Print last 10 rows.
df.info(): Information about the data
Input File:data.csv
File is read and stored as data frame in df variable
Private and Confidential www.futureconnect.net 29
30. Python Matplotlib
• Graph Plotting Library
• Created by John D. Hunter
• The source code for Matplotlib is located at this github repository
https://github.com/matplotlib/matplotlib
• It makes use of NumPy, the numerical mathematics extension of Python
• The current stable version is 2.2.0 released in January 2018.
Private and Confidential www.futureconnect.net 30
31. Matplotlib Installation and Importing
Pre-requirements: Python and Python Package Installer(pip)
Installation: pip install matplotlib
Import: After installation, import the package by the “import” keyword.
import matplotlib
This ensures that Matplotlib package is properly installed and ready to use
Package
Private and Confidential www.futureconnect.net 31
32. Matplotlib Pyplot
• Matplotlib utilities comes under the Pyplot submodule as plt shown below:
import matplotlib.pyplot as plt
Now, Pyplot can be referred as plt
• plot() function is used to draw lines from points
• show() function is used to display the graph
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Private and Confidential www.futureconnect.net 32
33. Matplotlib Functions
• xlabel() and ylabel() functions are used to add labels
• subplots() functions to draw multiple plots in one figure
• scatter() function is used to construct scatter plots
• bar() function to draw bar graphs
Scatter Plot
Bar Plot
Private and Confidential www.futureconnect.net 33