2. Introduction
Toolbox is a box, where many different built in functions arestored.
Toolbox helps to perform a task efficiently and successfully, especially
for data scientist and programmer. Choosing right toolbox can save a
lot of time for doing a specific task within the targeted time. Using
toolboxes also help to enhance the overall performance of any kind of
task like data analysis from big data set and calculating the desired
result. For example, if we want to calculate the co-relation coefficient,
it is impossible to use a single piece of written code to handle a big
data set or extract desired information from this. So, here toolbox can
help to perform this task in an effective manner. Using toolbox we can
call different built in functions to perform the desired task and to keep
all this types of toolbox we can work simultaneously.
3. Different Tool and their benefits
There is different kind of tool and to use them is very beneficial for a data scientist. Toolboxes are
like
• Statistical Tool
R/Python - is used for statistical analysis.
Mean, Median, Mode, Standard Deviation are also in statistical tool.
• Mathematical Tool
SAS – Strong data analysis abilities, data management, data encryption
Matlab – A numeric computing environment, Powerful graphics librabry, Can process
complex mathematical operations.
• Database Tool
Apache Cassandra - is an open source and high scalable NoSQL database to manage
massive amount of data in a faster manner.
SQL – is a very popular and widely used but in data science it is recommended for it (i)
Flexibility, (ii) Ease of use, (iii) No redundancy and (iv) Reliability
In data science statistical toolbox is not the only toolbox for naming the data science it also need
mathematical calculations/functions and database to read or write the task. So, all of them
together can
be called the data science. A lot of benefits lie in toolboxes for any data scientist. Here is some
tasks,
those can be done by using toolbox. Like
• Big data analysis
• Handling massive volume of data
• Collection a large scale of data set
• Building a structure for the operational data
• Make a pattern and derive the valuable insights from chosen data set.
4. Toolbox’s’ advantages over the other programming languages
and Similarity among them
There are many programming language, like C, Fortran, C++,JAVA etc which are generally
used for developing high-performance production or prototyping any kind of certain task or
project but the problem is in those language many basic tools are not available or to re-
implement those things again and again. So the advantages of toolboxes over the
programming languages like
It has a number of built in function which can be use anywhere in the code by just calling
them.
It is not needed to write specific code for the specific task by using toolbox because we can
perform the needed task by calling the specific function which is stored in toolbox.
We can avoid re-complications of anything for introducing any kind of new function in the
task.
Easy and all the basic functions are available in toolbox.
To find out the similarity among them we can identify some basic similarity in the working
procedure. In toolbox, a collection of built in functions are there as well as in the
programming language it has also owned some function by declaring them in the code. To
generate a task by using toolbox we can call the specific built in function in the code and in
programming language the needed function need to be written to complete the same task. If
we consider the performance to generate the task we can find out some similarity among
them. Both can be used for developing high performance production and prototyping and
building a data structure. In environmental perspective both have similarity and both are
supported object oriented programming. Both has basic statements for functional
programming in its own core library.
5. Why python is the best choice ?
• Python is a widely used and very popular programming language to all. Even it has
great properties for who is new to write computer program or even who never
programmed. Though, python has the features for doing data science task more
effectively and as we know that data science is not only about the statistical
function it also owned the mathematical and database function to it. So, the
combination of those three function we can call it data science and here in python
tools we can see all of them. So, it is a major reason to choose python. Otherwise
it has some most remarkable properties are easy to read code and has suppression
of non-mandatory delimiters, dynamic typing and dynamic memory usage. The
code is executed immediately in python console like IPython as Python has the
ability to interpreter language. Which can give us a richer environment to execute
python code? Flexibility is also the reason for choosing python. For this
characteristic it can be seen as multiparadigm language. Among them it has the
property to program with other languages and python also supports the object
oriented paradigm and C programming language code can be mixed with python
code and C code using cython. Python also has basic statements for functional
programming in its own core library. Large Eco system is also another major reason
for Choosing python.
6. Python libraries for Data Scientist and theirs usages
• Python community offers a huge number of developed toolboxes. This is
very exciting that to know most of them can be used for data science. The
most popular python toolboxes for any data scientist are
NumPy
SciPy
Pandas
Scikit- Learn
7. NumPy and Scipy
NumPy is known as the basis of computing toolbox. It has served
various kind of operational functions. Though SciPy is domain
specific toolbox and it also has several functions. It has also
statistical, mathematical and database tools.
NumPy is doing scientific computing with Python.
It provides multidimensional arrays with basic operations on them.
It is very useful in linear algebra function.
Several toolboxes use the NumPy array representation as an
efficient basic data structure.
SciPy provides collection of numeric algorithms and domain specific
toolboxes.
SciPy can process signal and optimization and handle statistical task.
SciPy is the plotting library Matplotlib and it has many tools for data
visualization.
8. SCIKIT-Learn
• It is a machine learning library built from NumPy, SciPy and
Matplotlib.
• It offers simple and efficient tools for common tasks in data
analysis such as,
Classification
Regression
Clustering
Dimensionality
Reduction
Model selection
Preprocessing
9. Pandas
Pandas have both statistical and database tools and it also provides hard performance,
different type of tools and key features.
• It provides high performance data structure and data analyzing tools.
• It has a key feature to work fast and efficient dataframe object for data
manipulation with integrated indexing.
• The dataframe can be seen as spreadsheet which offers very flexibility.
• In pandas we can easily transform any dataset in the way we want.
• Reshaping, Adding or removing columns or rows.
• Provides high performance functions for aggregating, merging and joining
datasets.
• Pandas also has tools for importing and exporting data from different formats, like
CSV
Microsoft Excel
SQL databases
Fast HDF5 format.
10. Data Science Eco System
• After choosing Python, we can set up a data scientist python
ecosystem by individual toolboxes or to perform a bundle of
installation with all needed toolboxes. For those who is new
to here, It can be chosen to install the mentioned toolboxes
like Python 2.X and Python 3.X , exactly in a order.
• However if a bundle installation is chosen, the Anaconda
python distribution is the good option. Because the Anaconda
distribution provides integration of all the python toolboxes
and applications needed for the data scientist into a single
directory without mixing it with other python toolboxes
installed on the machine. The toolboxes and applications such
as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython,
Spyder..etc but more specific tools for other related tasks such
as data visualization, code, optimization and big data
processing.
11. IDE (Integrated Development Environments)
• The integrated development environment is software and it is very essential tool
for data scientist. IDEs is created to serve different purpose for the data scientist as
well as the programmer. Thus, over the years this software has evolved in order to
make the coding task less complicated. Selecting right IDEs for each person is very
crucial and unfortunately there is no “one size fits all” programming environment.
The best solution is to try the most popular IDE are the editor and the compiler
and the debugger. Some IDEs can be used in multiple programming language and
those provides by language specific plugins, such as NETBEANS or Eclips.
• In the case of python there are a large number if specific IDEs, both commercial
such as PyCharm and WingIDE and open source. The open source community
helps IEDs to spring up, thus anyone can customize their own environment and
share it with the rest if the community. For example Spyder (it is the Scientific
Python Development Environment) is an IDE customized with the task of the data
scientist in mind.
12. Data Science Eco System
• After choosing Python, we can set up a data scientist python
ecosystem by individual toolboxes or to perform a bundle of
installation with all needed toolboxes. For those who is new to
here, It can be chosen to install the mentioned toolboxes like
Python 2.X and Python 3.X , exactly in a order.
• However if a bundle installation is chosen, the Anaconda python
distribution is the good option. Because the Anaconda distribution
provides integration of all the python toolboxes and applications
needed for the data scientist into a single directory without mixing
it with other python toolboxes installed on the machine. The
toolboxes and applications such as NumPy, Pandas, SciPy,
Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific
tools for other related tasks such as data visualization, code,
optimization and big data processing.
13. IDE (Integrated Development Environments)
• The integrated development environment is software and it is very
essential tool for data scientist. IDEs is created to serve different purpose
for the data scientist as well as the programmer. Thus, over the years this
software has evolved in order to make the coding task less complicated.
Selecting right IDEs for each person is very crucial and unfortunately there
is no “one size fits all” programming environment. The best solution is to
try the most popular IDE are the editor and the compiler and the
debugger. Some IDEs can be used in multiple programming language and
those provides by language specific plugins, such as NETBEANS or Eclips.
• In the case of python there are a large number if specific IDEs, both
commercial such as PyCharm and WingIDE and open source. The open
source community helps IEDs to spring up, thus anyone can customize
their own environment and share it with the rest if the community. For
example Spyder (it is the Scientific Python Development Environment) is
an IDE customized with the task of the data scientist in mind.
14. WIDE(Web Integrated Development Environment)- Jupyter
• Python has also been developed for web application, it is a new
generation of IDEs for interactive language. Nowadays, such sessions are
called notebooks and they are not only used in classrooms but also used
to show results in presentations or on business dashboards. The Jupyter
Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and
explanatory text. Uses include: data cleaning and transformation,
numerical simulation, statistical modeling, machine learning and much
more. The recent spread of such notebooks is mainly due to IPython. Since
December 2011, IPython has been issued as a browser version of its
interactive console, called IPython notebook, which shows the Python
execution results very clearly and concisely by means of cells. Cells can
contain content other than code. For example, markdown cells can be
added to introduce algorithms. In this Jupyter Notebook it is also possible
to insert Matplotlib graphics to illustrate examples or even web pages.
IPython notebook has been separated from IPython software and now it
has become a part of a larger project. Jupyter,especiall for Julia, Python
and R that aims to reuse the same WIDE for all these interpreted
languages and not just Python. All old IPython notebooks are
automatically imported to the new version when they are opened with the
Jupyter platform.
15. Python, Used in Data Science
• We came to know about the python ecosystem, and the containing
Toolboxes and interactive IDEs in different that environment and their
widely uses.
16. The Jupyter Notebook Environment
• Here now we are discussing the Jupyter Notebook environment. we can start by
launching the Jupyter notebook platform. This can be done by simply typing the
command in terminal or command line. For example:
: $ jupyter notebook
• But if we chose the bundle installation, we can start the Jupyter notebook
platform by clicking on the Jupyter Notebook icon installed by Anaconda in the
start menu or on the desktop. If we use the command line, the root directory is
the same directory where we launched the Jupyter notebook. Otherwise, if we use
the Anaconda launcher, the root directory is the current user directory. Now, to
start a new notebook, we only need to press the
New NoteBook Python2
17. • Button at the top on the right of the home page. By importing those toolboxes that
we will need for our program. In the first cell we put the code to import the
Pandas library as pd. This is for convenience; every time we need to use some
functionality from the Pandas library, we will write pd instead of pandas. We will
also import the two core libraries mentioned above: the numpy library as np and
the matplotlib library as plt.
• Need to write in commands:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
• To execute just one cell, we need to press the pause sign button or to click Cell ->
Run or press the keys Ctrl + Enter. While execution is underway, the header of the
cell shows the * mark:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
18. • While a cell is being executed, no other cell can be executed. If we try to execute
another cell, its execution will not start until the first cell has finished its execution.
Once the execution is finished, the header of the cell will be replaced by the next
number of execution. Since this will be the first cell executed, the number shown
will be 1. If the process of importing the libraries is correct, no output cell is
produced.
import pandas as pd
import numpy as np
import matplotlin.pyplot as plt
19. The DataFrame Data Structer
• data structure in Pandas is the DataFrame object. A DataFrame is basically a
tabular data structure, with rows and columns. Rows have a specific index to
access them, which can be any name or value. In Pandas, the columns are
called Series, a special type of data, which in essence consists of a list of
several values, where each value has an index. Therefore, the DataFrame data
structure can be seen as a spreadsheet, but it is much more flexible. To
understand how it works, let us see how to create a DataFrame from a
common Python dictionary of lists. First, we will create a new cell by clicking
Insert -> Insert Cell Below or pressing the keys Ctrl+B
For example, the following code:
import pandas as pd
# a simple int list
list = [1,2,3,4,5]
20. # create series form a int list
res = pd.Series(list)
print(res)
the result will be like: 0 1
2 3
4 5
dtype: int64
import pandas as pd
dic = { 'Id': 1013, 'Name': 'Sudipto','State': 'Khulna','Age': 27}
res = pd.Series(dic)
print(res)
the result will be like: Id 1013
Name Sudipto
State Khulna
Age 27
dtype: object
Apart from DataFrame data structure creation, Panda offers a lot of functions to manipulate them. Among
other things, it offers us functions for aggregation, manipulation, and transformation of the data. In the
ollowing sections, we will introduce some of these functions.
21. Data Analysis Example Using Pandas
• we can use Pandas in a simple real problem, we will start doing some basic
analysis of any data. For the sake of transparency, data produced that must be
open, meaning that they can be freely used, reused, and distributed by anyone.
• Pandas is a Python library that provides extensive means for data analysis. Data
scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas
makes it very convenient to load, process, and analyze such tabular data using
SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a
wide range of opportunities for visual analysis of tabular data.
• The main data structures in Pandas are implemented
with series and dataframes classes. The former is a one-dimensional indexed
array of some fixed data type. The latter is a two-dimensional data structure -
a table - where each column contains data of the same type. You can see it as
a dictionary of Series instances. DataFrames are great for representing real
data: rows correspond to instances (examples, observations, etc.), and
columns correspond to features of these instances.
Import numpy as np
Import pandas as pd
Pd.set_option(“display.precision”, 2)
22. • We will demonstrate the main methods in action by analyzing a dataset on the churn
rate of telecom operator clients. Let’s read the data and take a look at the 5 lines using
the head method,
df = pd. Read_csv(“../input/telecom_churn.csv”)
df.head()
• About printing dataframe in jupyter notebooks recall that each row corresponds to
one client, an instance, and columns are features of this instance.
print(df.shape)
(3333, 20)
• From the output, we can see that the table contains 3333 rows and 20 columns.
If we want to print out the column name using columns:
print(df.columns)
• We can use the info( ) methods some genatral information about the dataframe.
print (df.info( ) )
23. • We see that one feature is logical (bool), 3 features are of type object, and
16 features are numeric. With this same method, we can easily see if
there are any missing values. Here, there are none because each column
contains 3333 observations, the same number of rows we saw before
with shape.
• We can change the column type with the astype method. Lets apply this to
the Churn feature to convert it in to int64:
df[“ churn ”]= df [“Churn”]. astype(“int64”)
• To describe method shows basic statistical charterstics of each numeric
feature (int 64 and float64 types): number if non-missing values, mean,
standard deviation, range, median, 0.25 and 0.75 quartiles.
df.describes( )
• In order to see statistics on non-numerical features, one has to explicitly
indicate data types of interest in the include parameter.
df.describe(include=[“object”, “bool”] )
24. • To delete columns or rows, use the drop method, passing the required indexes
and the axis parameter (1 if you delete columns, and nothing or 0 if you delete
rows). The inplace argument tells whether to change the original DataFrame.
With inplace=False, the drop method doesn't change the existing DataFrame
and returns a new one with dropped rows or columns. With inplace=True, it
alters the DataFrame.
#get rid of just created columns
df. Drop ([“Total charge”, “Toatal calls”], axis = 1, inplace= True)
#and here is how you can delete rows
df. Drop ([1,2]). head()
25. Reading Data
• To read the data from that we downloaded. First of all, we have to create a new notebook called
Open Government Data Analysis and open it. Then, after ensuring that the educ_figdp_1_Data.csv
file is stored in the same directory as our notebook directory, we will write the following code to
read and show the content:
edu = pd.read_csv (‘files/ch02/educ_figdp_1_Data.csv’, na_values = ‘ : ’, usecols =
[“TIME”,”GEO”,”VALUES”])
edu
• Beside this, Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated
files, or even the content from the clipboard
(read_excel(), read_hdf(), read_table(), read_clipboard()).
• If we want to know the names of the columns or the names of the indexes, we can use the
DataFrame attributes columns and index respectively. The names of the columns or indexes can
be changed by assigning a new list of the same length to these attributes. The values of any
DataFrame can be retrieved as a Python array by calling its values attribute. If we just want quick
statistical information on all the numeric columns in a DataFrame, we can use the function
describe(). The result shows the count, the mean, the standard deviation, the minimum and
maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values in each
column or series.
edu.describe ( )
26. Selecting Data
• If we want to select a subset of data from a DataFrame, it is necessary to
indicate this subset using square brackets ([ ]) after the DataFrame. The
subset can be specified in several ways. If we want to select only one
column from a DataFrame, we only need to put its name between the
square brackets. The result will be a Series data structure, not a
DataFrame, because only one column is retrieved.
edu [‘ value’]
• If we want to select a subset of rows from a DataFrame, we can do so by
indicating a range of rows separated by a colon (:) inside the square
brackets. This is commonly known as a slice of rows:
edu [ 10 : 14 ]
• For example, We assume a scenario and observe it,
• 13 2001 European Union (27 countries) 4.99 This instruction returns the
slice of rows from the 10th to the 13th position. Note that the slice does
not use the index labels as references, but the position. In this case, the
labels of the rows simply coincide with the position of the rows. If we want
to select a subset of columns and rows using the labels as our references
instead of the positions, we can use ix indexing:
edu.ix [90 : 94, [‘TIME’ , ‘GEO’] ]
27. Filtering Data
• Another way to select a subset of data is by applying Boolean
indexing. This indexing is commonly known as a filter. For
instance, if we want to filter those values less than or equal to
6.5, we can do it like this:
edu [ edu [‘value’] > 6.5 . tail ( )
• The Boolean operation edu[’Value’] > 6.5 produces a Boolean
mask. When an element in the “Value” column is greater than
6.5, the corresponding value in the mask is set to True,
otherwise it is set to False. Then, when this mask is applied as
an index in edu[edu[’Value’] > 6.5], the result is a filtered
DataFrame containing only rows with values higher than 6.5.
Of course, any of the usual Boolean operators can be used for
filtering:
< (less than),<= (less than or equal to), > (greater than), >=
(greater than or equal to), = (equal to), and ! = (not equal to).
28. Filtering Missing Values
• Pandas uses the special value NaN (not a number) to represent missing
values. In Python, NaN is a special floating-point value returned by certain
operations when one of their results ends in an undefined value. A subtle
feature of NaN values is that two NaN are never equal. Because of this,
the only safe way to tell whether a value is missing in a DataFrame is by
using the isnull() function. Indeed, this function can be used to filter rows
with missing values :
edu [edu [“value”].isnull ( ) ]. head ( )
29. Manipulating Data
• To manipulate data we need to know how to select the desired data. One of the
most straightforward things we can do is to operate with columns or rows using
aggregation functions. , If a function is applied to a DataFrame or a selection of
rows and columns, then you can specify if the function should be applied to the
rows for each column (setting the axis=0 keyword on the invocation of the
function), or it should be applied on the columns for each row (setting the axis=1
keyword on the invocation of the function).
edu.max ( axis = 0)
• Note that these are functions specific to Pandas, not the generic Python functions.
There are differences in their implementation. In Python, NaN values propagate
through all operations without raising an exception. In contrast, Pandas operations
exclude NaN values representing missing data. For example, the pandas max
function excludes NaN values, thus they are interpreted as missing values, while
the standard Python max function will take the mathematical interpretation of
NaN and return it as the maximum:
Input:
print “pandas max function : “ ,edu [ ‘ value ‘]. max ( )
print “pandas max function : “ ,max ( edu [ ‘ value ‘] )
Output:
Pandas max function : 8.81
Python max function: nan
30. • Beside these aggregation functions, we can apply operations over
all the values in rows, columns or a selection of both. The rule of
thumb is that an operation between columns means that it is
applied to each row in that column and an operation between rows
means that it is applied to each column in that row. For example we
can apply any binary arithmetical operation (+,-,*,/) to an entire
row:
Input:
S = edu [ “ Value ” ] / 100
S. head ()
Output:
0 NaN
1 Nan
2 0.0500
3 0.0503
4 0.0495
Name: Value, dtype : float64
31. Sorting
• This is a important functionality we will need when inspecting our data is
to sort by columns. We can sort a DataFrame using any column, using the
sort function. If we want to see the first five rows of data sorted in
descending order (i.e., from the largest to the smallest values) and using
the Value column, then we just need to do this:
edu . sort_values (by = ‘value’ , ascending = False, inplace = True )
edu. head ( )
• that the inplace keyword means that the DataFrame will be overwritten,
and hence no new DataFrame is returned. If instead of ascending = False
we use ascending = True, the values are sorted in ascending order (i.e.,
from the smallest to the largest values). If we want to return to the
original order, we can sort by an index using the sort_index function and
specifying axis=0:
edu.sort_index (axis = 0, ascending = True, inplace = True )
edu. head ( )
32. Ranking Data
• In statistics, “ranking” refers to the data transformation in which numerical or ordinal values are replaced by
their rank when the data are sorted. If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of
these data items would be 2, 3, 1 and 4 respectively.
• Now we can perform the ranking using the rank function. Note here that the parameter ascending=False makes the
ranking go from the highest values to the lowest values. The Pandas rank function supports different tie-breaking
methods, specified with the method parameter. In our case, we use the first method, in which ranks are assigned in
the order they appear in the array, avoiding gaps between ranking.
pivedu = pivedu.drop([
’Euro area (13 countries)’,
’Euro area (15 countries)’,
’Euro area (17 countries)’,
’Euro area (18 countries)’,
’European Union (25 countries)’,
’European Union (27 countries)’,
’European Union (28 countries)’
] ,
axis = 0)
pivedu = pivedu.rename(index = {’Germany ( until 1990 former territory of the FRG)’: ’Germany’})
pivedu = pivedu.dropna()
pivedu.rank( ascending = False , method = ’first’).head()
• If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank
the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this
way:
totalSum = pivedu. sum(axis = 1)
totalSum. rank( ascending = False , method = ’dense’) .sort_values(). head()
33. • If we want to make a global ranking taking into account all the years, we can
sum up all the columns and rank the result. Then we can sort the resulting
values to retrieve the top five countries for the last 6 years, in this way:
totalSum = pivedu. sum(axis = 1)
totalSum. rank( ascending = False , method =’dense’) .sort_values(). head()
34. Plotting
• Pandas DataFrames and Series can be plotted using the plot function, which uses
the library for graphics Matplotlib. For example, if we want to plot the accumulated
values for each country over the last 6 years, we can take the Series obtained in the
previous example and plot it directly by calling the plot function as shown in the
next cell:
totalSum = pivedu. sum(axis = 1) .sort_values(ascending = False)
totalSum. plot(kind = ’bar’, style = ’b’, alpha = 0.4, title = "Total Values for Country")
• It is also possible to plot a DataFrame directly. In this case, each column is treated
as a separated Series. For example, instead of printing the accumulated value over
the years, we can plot the value for each year.
my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’]
ax = pivedu. plot(kind = ’barh’,
stacked = True ,
color = my_colors)
ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5)
35. Why ToolBox is improved version of Sub functional language
• ToolBox offers features over the other programming language
and the toolboxes are the updated and improved version of any
kind of sub functional language. Because ToolBox has all the
feature including other programming language. In toolbox we
have all function to perform when it is needed but in other
programming language those all features are not available in
package like TollBox. We can call the built in functions, which are
stored in toolbox at anywhere in the programmer without
fetching any kind of complication or error. But in other sub
functional programming language does not offer those kind of
built in function, there we need to declare the function to
perform it. But sometimes it shows that this declared functions
shows different kinds of error like missing arguments or
functional error. So after considering all the resources, we can
make sure that ToolaBox is definitely the improved version of
other programming/Sub functional language.
36. Conclusion
• Data Science is like the sea and the tools that data scientist use is like the
elements inside the sea water. So, to handle this massive task we need a
complete package to run it efficiently. Data Scientist handles this in a
smart manner like ToolBox. It helps data scientist to work more efficiently
and obviously considering the performance. We must to say about the
Python’s ecosystem to have all those things. It offers a perfect way to
perform like a pro. Python ecosystem offers a complete package to a data
scientist to lead the task in a efficient manner for developing any data
scientist projects.