SlideShare a Scribd company logo
1 of 36
Toolboxes
for
Data Scientists
Sudipto Krishna Dutta
20204021
Introduction to Data Science
Jahangirnagar University
Introduction
 Toolbox is a box, where many different built in functions arestored.
Toolbox helps to perform a task efficiently and successfully, especially
for data scientist and programmer. Choosing right toolbox can save a
lot of time for doing a specific task within the targeted time. Using
toolboxes also help to enhance the overall performance of any kind of
task like data analysis from big data set and calculating the desired
result. For example, if we want to calculate the co-relation coefficient,
it is impossible to use a single piece of written code to handle a big
data set or extract desired information from this. So, here toolbox can
help to perform this task in an effective manner. Using toolbox we can
call different built in functions to perform the desired task and to keep
all this types of toolbox we can work simultaneously.
Different Tool and their benefits
There is different kind of tool and to use them is very beneficial for a data scientist. Toolboxes are
like
• Statistical Tool
R/Python - is used for statistical analysis.
Mean, Median, Mode, Standard Deviation are also in statistical tool.
• Mathematical Tool
SAS – Strong data analysis abilities, data management, data encryption
Matlab – A numeric computing environment, Powerful graphics librabry, Can process
complex mathematical operations.
• Database Tool
Apache Cassandra - is an open source and high scalable NoSQL database to manage
massive amount of data in a faster manner.
SQL – is a very popular and widely used but in data science it is recommended for it (i)
Flexibility, (ii) Ease of use, (iii) No redundancy and (iv) Reliability
In data science statistical toolbox is not the only toolbox for naming the data science it also need
mathematical calculations/functions and database to read or write the task. So, all of them
together can
be called the data science. A lot of benefits lie in toolboxes for any data scientist. Here is some
tasks,
those can be done by using toolbox. Like
• Big data analysis
• Handling massive volume of data
• Collection a large scale of data set
• Building a structure for the operational data
• Make a pattern and derive the valuable insights from chosen data set.
Toolbox’s’ advantages over the other programming languages
and Similarity among them
 There are many programming language, like C, Fortran, C++,JAVA etc which are generally
used for developing high-performance production or prototyping any kind of certain task or
project but the problem is in those language many basic tools are not available or to re-
implement those things again and again. So the advantages of toolboxes over the
programming languages like
 It has a number of built in function which can be use anywhere in the code by just calling
them.
 It is not needed to write specific code for the specific task by using toolbox because we can
perform the needed task by calling the specific function which is stored in toolbox.
 We can avoid re-complications of anything for introducing any kind of new function in the
task.
 Easy and all the basic functions are available in toolbox.
 To find out the similarity among them we can identify some basic similarity in the working
procedure. In toolbox, a collection of built in functions are there as well as in the
programming language it has also owned some function by declaring them in the code. To
generate a task by using toolbox we can call the specific built in function in the code and in
programming language the needed function need to be written to complete the same task. If
we consider the performance to generate the task we can find out some similarity among
them. Both can be used for developing high performance production and prototyping and
building a data structure. In environmental perspective both have similarity and both are
supported object oriented programming. Both has basic statements for functional
programming in its own core library.
Why python is the best choice ?
• Python is a widely used and very popular programming language to all. Even it has
great properties for who is new to write computer program or even who never
programmed. Though, python has the features for doing data science task more
effectively and as we know that data science is not only about the statistical
function it also owned the mathematical and database function to it. So, the
combination of those three function we can call it data science and here in python
tools we can see all of them. So, it is a major reason to choose python. Otherwise
it has some most remarkable properties are easy to read code and has suppression
of non-mandatory delimiters, dynamic typing and dynamic memory usage. The
code is executed immediately in python console like IPython as Python has the
ability to interpreter language. Which can give us a richer environment to execute
python code? Flexibility is also the reason for choosing python. For this
characteristic it can be seen as multiparadigm language. Among them it has the
property to program with other languages and python also supports the object
oriented paradigm and C programming language code can be mixed with python
code and C code using cython. Python also has basic statements for functional
programming in its own core library. Large Eco system is also another major reason
for Choosing python.
Python libraries for Data Scientist and theirs usages
• Python community offers a huge number of developed toolboxes. This is
very exciting that to know most of them can be used for data science. The
most popular python toolboxes for any data scientist are
 NumPy
 SciPy
 Pandas
 Scikit- Learn
NumPy and Scipy
 NumPy is known as the basis of computing toolbox. It has served
various kind of operational functions. Though SciPy is domain
specific toolbox and it also has several functions. It has also
statistical, mathematical and database tools.
 NumPy is doing scientific computing with Python.
 It provides multidimensional arrays with basic operations on them.
 It is very useful in linear algebra function.
 Several toolboxes use the NumPy array representation as an
efficient basic data structure.
 SciPy provides collection of numeric algorithms and domain specific
toolboxes.
 SciPy can process signal and optimization and handle statistical task.
 SciPy is the plotting library Matplotlib and it has many tools for data
visualization.
SCIKIT-Learn
• It is a machine learning library built from NumPy, SciPy and
Matplotlib.
• It offers simple and efficient tools for common tasks in data
analysis such as,
 Classification
 Regression
 Clustering
 Dimensionality
 Reduction
 Model selection
 Preprocessing
Pandas
Pandas have both statistical and database tools and it also provides hard performance,
different type of tools and key features.
• It provides high performance data structure and data analyzing tools.
• It has a key feature to work fast and efficient dataframe object for data
manipulation with integrated indexing.
• The dataframe can be seen as spreadsheet which offers very flexibility.
• In pandas we can easily transform any dataset in the way we want.
• Reshaping, Adding or removing columns or rows.
• Provides high performance functions for aggregating, merging and joining
datasets.
• Pandas also has tools for importing and exporting data from different formats, like
 CSV
 Microsoft Excel
 SQL databases
 Fast HDF5 format.
Data Science Eco System
• After choosing Python, we can set up a data scientist python
ecosystem by individual toolboxes or to perform a bundle of
installation with all needed toolboxes. For those who is new
to here, It can be chosen to install the mentioned toolboxes
like Python 2.X and Python 3.X , exactly in a order.
• However if a bundle installation is chosen, the Anaconda
python distribution is the good option. Because the Anaconda
distribution provides integration of all the python toolboxes
and applications needed for the data scientist into a single
directory without mixing it with other python toolboxes
installed on the machine. The toolboxes and applications such
as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython,
Spyder..etc but more specific tools for other related tasks such
as data visualization, code, optimization and big data
processing.
IDE (Integrated Development Environments)
• The integrated development environment is software and it is very essential tool
for data scientist. IDEs is created to serve different purpose for the data scientist as
well as the programmer. Thus, over the years this software has evolved in order to
make the coding task less complicated. Selecting right IDEs for each person is very
crucial and unfortunately there is no “one size fits all” programming environment.
The best solution is to try the most popular IDE are the editor and the compiler
and the debugger. Some IDEs can be used in multiple programming language and
those provides by language specific plugins, such as NETBEANS or Eclips.
• In the case of python there are a large number if specific IDEs, both commercial
such as PyCharm and WingIDE and open source. The open source community
helps IEDs to spring up, thus anyone can customize their own environment and
share it with the rest if the community. For example Spyder (it is the Scientific
Python Development Environment) is an IDE customized with the task of the data
scientist in mind.
Data Science Eco System
• After choosing Python, we can set up a data scientist python
ecosystem by individual toolboxes or to perform a bundle of
installation with all needed toolboxes. For those who is new to
here, It can be chosen to install the mentioned toolboxes like
Python 2.X and Python 3.X , exactly in a order.
• However if a bundle installation is chosen, the Anaconda python
distribution is the good option. Because the Anaconda distribution
provides integration of all the python toolboxes and applications
needed for the data scientist into a single directory without mixing
it with other python toolboxes installed on the machine. The
toolboxes and applications such as NumPy, Pandas, SciPy,
Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific
tools for other related tasks such as data visualization, code,
optimization and big data processing.
IDE (Integrated Development Environments)
• The integrated development environment is software and it is very
essential tool for data scientist. IDEs is created to serve different purpose
for the data scientist as well as the programmer. Thus, over the years this
software has evolved in order to make the coding task less complicated.
Selecting right IDEs for each person is very crucial and unfortunately there
is no “one size fits all” programming environment. The best solution is to
try the most popular IDE are the editor and the compiler and the
debugger. Some IDEs can be used in multiple programming language and
those provides by language specific plugins, such as NETBEANS or Eclips.
• In the case of python there are a large number if specific IDEs, both
commercial such as PyCharm and WingIDE and open source. The open
source community helps IEDs to spring up, thus anyone can customize
their own environment and share it with the rest if the community. For
example Spyder (it is the Scientific Python Development Environment) is
an IDE customized with the task of the data scientist in mind.
WIDE(Web Integrated Development Environment)- Jupyter
• Python has also been developed for web application, it is a new
generation of IDEs for interactive language. Nowadays, such sessions are
called notebooks and they are not only used in classrooms but also used
to show results in presentations or on business dashboards. The Jupyter
Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and
explanatory text. Uses include: data cleaning and transformation,
numerical simulation, statistical modeling, machine learning and much
more. The recent spread of such notebooks is mainly due to IPython. Since
December 2011, IPython has been issued as a browser version of its
interactive console, called IPython notebook, which shows the Python
execution results very clearly and concisely by means of cells. Cells can
contain content other than code. For example, markdown cells can be
added to introduce algorithms. In this Jupyter Notebook it is also possible
to insert Matplotlib graphics to illustrate examples or even web pages.
IPython notebook has been separated from IPython software and now it
has become a part of a larger project. Jupyter,especiall for Julia, Python
and R that aims to reuse the same WIDE for all these interpreted
languages and not just Python. All old IPython notebooks are
automatically imported to the new version when they are opened with the
Jupyter platform.
Python, Used in Data Science
• We came to know about the python ecosystem, and the containing
Toolboxes and interactive IDEs in different that environment and their
widely uses.
The Jupyter Notebook Environment
• Here now we are discussing the Jupyter Notebook environment. we can start by
launching the Jupyter notebook platform. This can be done by simply typing the
command in terminal or command line. For example:
: $ jupyter notebook
• But if we chose the bundle installation, we can start the Jupyter notebook
platform by clicking on the Jupyter Notebook icon installed by Anaconda in the
start menu or on the desktop. If we use the command line, the root directory is
the same directory where we launched the Jupyter notebook. Otherwise, if we use
the Anaconda launcher, the root directory is the current user directory. Now, to
start a new notebook, we only need to press the
New NoteBook Python2
• Button at the top on the right of the home page. By importing those toolboxes that
we will need for our program. In the first cell we put the code to import the
Pandas library as pd. This is for convenience; every time we need to use some
functionality from the Pandas library, we will write pd instead of pandas. We will
also import the two core libraries mentioned above: the numpy library as np and
the matplotlib library as plt.
• Need to write in commands:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
• To execute just one cell, we need to press the pause sign button or to click Cell ->
Run or press the keys Ctrl + Enter. While execution is underway, the header of the
cell shows the * mark:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
• While a cell is being executed, no other cell can be executed. If we try to execute
another cell, its execution will not start until the first cell has finished its execution.
Once the execution is finished, the header of the cell will be replaced by the next
number of execution. Since this will be the first cell executed, the number shown
will be 1. If the process of importing the libraries is correct, no output cell is
produced.
import pandas as pd
import numpy as np
import matplotlin.pyplot as plt
The DataFrame Data Structer
• data structure in Pandas is the DataFrame object. A DataFrame is basically a
tabular data structure, with rows and columns. Rows have a specific index to
access them, which can be any name or value. In Pandas, the columns are
called Series, a special type of data, which in essence consists of a list of
several values, where each value has an index. Therefore, the DataFrame data
structure can be seen as a spreadsheet, but it is much more flexible. To
understand how it works, let us see how to create a DataFrame from a
common Python dictionary of lists. First, we will create a new cell by clicking
Insert -> Insert Cell Below or pressing the keys Ctrl+B
For example, the following code:
import pandas as pd
# a simple int list
list = [1,2,3,4,5]
# create series form a int list
res = pd.Series(list)
print(res)
the result will be like: 0 1
2 3
4 5
dtype: int64
import pandas as pd
dic = { 'Id': 1013, 'Name': 'Sudipto','State': 'Khulna','Age': 27}
res = pd.Series(dic)
print(res)
the result will be like: Id 1013
Name Sudipto
State Khulna
Age 27
dtype: object
Apart from DataFrame data structure creation, Panda offers a lot of functions to manipulate them. Among
other things, it offers us functions for aggregation, manipulation, and transformation of the data. In the
ollowing sections, we will introduce some of these functions.
Data Analysis Example Using Pandas
• we can use Pandas in a simple real problem, we will start doing some basic
analysis of any data. For the sake of transparency, data produced that must be
open, meaning that they can be freely used, reused, and distributed by anyone.
• Pandas is a Python library that provides extensive means for data analysis. Data
scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas
makes it very convenient to load, process, and analyze such tabular data using
SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a
wide range of opportunities for visual analysis of tabular data.
• The main data structures in Pandas are implemented
with series and dataframes classes. The former is a one-dimensional indexed
array of some fixed data type. The latter is a two-dimensional data structure -
a table - where each column contains data of the same type. You can see it as
a dictionary of Series instances. DataFrames are great for representing real
data: rows correspond to instances (examples, observations, etc.), and
columns correspond to features of these instances.
Import numpy as np
Import pandas as pd
Pd.set_option(“display.precision”, 2)
• We will demonstrate the main methods in action by analyzing a dataset on the churn
rate of telecom operator clients. Let’s read the data and take a look at the 5 lines using
the head method,
df = pd. Read_csv(“../input/telecom_churn.csv”)
df.head()
• About printing dataframe in jupyter notebooks recall that each row corresponds to
one client, an instance, and columns are features of this instance.
print(df.shape)
(3333, 20)
• From the output, we can see that the table contains 3333 rows and 20 columns.
If we want to print out the column name using columns:
print(df.columns)
• We can use the info( ) methods some genatral information about the dataframe.
print (df.info( ) )
• We see that one feature is logical (bool), 3 features are of type object, and
16 features are numeric. With this same method, we can easily see if
there are any missing values. Here, there are none because each column
contains 3333 observations, the same number of rows we saw before
with shape.
• We can change the column type with the astype method. Lets apply this to
the Churn feature to convert it in to int64:
df[“ churn ”]= df [“Churn”]. astype(“int64”)
• To describe method shows basic statistical charterstics of each numeric
feature (int 64 and float64 types): number if non-missing values, mean,
standard deviation, range, median, 0.25 and 0.75 quartiles.
df.describes( )
• In order to see statistics on non-numerical features, one has to explicitly
indicate data types of interest in the include parameter.
df.describe(include=[“object”, “bool”] )
• To delete columns or rows, use the drop method, passing the required indexes
and the axis parameter (1 if you delete columns, and nothing or 0 if you delete
rows). The inplace argument tells whether to change the original DataFrame.
With inplace=False, the drop method doesn't change the existing DataFrame
and returns a new one with dropped rows or columns. With inplace=True, it
alters the DataFrame.
#get rid of just created columns
df. Drop ([“Total charge”, “Toatal calls”], axis = 1, inplace= True)
#and here is how you can delete rows
df. Drop ([1,2]). head()
Reading Data
• To read the data from that we downloaded. First of all, we have to create a new notebook called
Open Government Data Analysis and open it. Then, after ensuring that the educ_figdp_1_Data.csv
file is stored in the same directory as our notebook directory, we will write the following code to
read and show the content:
edu = pd.read_csv (‘files/ch02/educ_figdp_1_Data.csv’, na_values = ‘ : ’, usecols =
[“TIME”,”GEO”,”VALUES”])
edu
• Beside this, Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated
files, or even the content from the clipboard
(read_excel(), read_hdf(), read_table(), read_clipboard()).
• If we want to know the names of the columns or the names of the indexes, we can use the
DataFrame attributes columns and index respectively. The names of the columns or indexes can
be changed by assigning a new list of the same length to these attributes. The values of any
DataFrame can be retrieved as a Python array by calling its values attribute. If we just want quick
statistical information on all the numeric columns in a DataFrame, we can use the function
describe(). The result shows the count, the mean, the standard deviation, the minimum and
maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values in each
column or series.
edu.describe ( )
Selecting Data
• If we want to select a subset of data from a DataFrame, it is necessary to
indicate this subset using square brackets ([ ]) after the DataFrame. The
subset can be specified in several ways. If we want to select only one
column from a DataFrame, we only need to put its name between the
square brackets. The result will be a Series data structure, not a
DataFrame, because only one column is retrieved.
edu [‘ value’]
• If we want to select a subset of rows from a DataFrame, we can do so by
indicating a range of rows separated by a colon (:) inside the square
brackets. This is commonly known as a slice of rows:
edu [ 10 : 14 ]
• For example, We assume a scenario and observe it,
• 13 2001 European Union (27 countries) 4.99 This instruction returns the
slice of rows from the 10th to the 13th position. Note that the slice does
not use the index labels as references, but the position. In this case, the
labels of the rows simply coincide with the position of the rows. If we want
to select a subset of columns and rows using the labels as our references
instead of the positions, we can use ix indexing:
edu.ix [90 : 94, [‘TIME’ , ‘GEO’] ]
Filtering Data
• Another way to select a subset of data is by applying Boolean
indexing. This indexing is commonly known as a filter. For
instance, if we want to filter those values less than or equal to
6.5, we can do it like this:
edu [ edu [‘value’] > 6.5 . tail ( )
• The Boolean operation edu[’Value’] > 6.5 produces a Boolean
mask. When an element in the “Value” column is greater than
6.5, the corresponding value in the mask is set to True,
otherwise it is set to False. Then, when this mask is applied as
an index in edu[edu[’Value’] > 6.5], the result is a filtered
DataFrame containing only rows with values higher than 6.5.
Of course, any of the usual Boolean operators can be used for
filtering:
< (less than),<= (less than or equal to), > (greater than), >=
(greater than or equal to), = (equal to), and ! = (not equal to).
Filtering Missing Values
• Pandas uses the special value NaN (not a number) to represent missing
values. In Python, NaN is a special floating-point value returned by certain
operations when one of their results ends in an undefined value. A subtle
feature of NaN values is that two NaN are never equal. Because of this,
the only safe way to tell whether a value is missing in a DataFrame is by
using the isnull() function. Indeed, this function can be used to filter rows
with missing values :
edu [edu [“value”].isnull ( ) ]. head ( )
Manipulating Data
• To manipulate data we need to know how to select the desired data. One of the
most straightforward things we can do is to operate with columns or rows using
aggregation functions. , If a function is applied to a DataFrame or a selection of
rows and columns, then you can specify if the function should be applied to the
rows for each column (setting the axis=0 keyword on the invocation of the
function), or it should be applied on the columns for each row (setting the axis=1
keyword on the invocation of the function).
edu.max ( axis = 0)
• Note that these are functions specific to Pandas, not the generic Python functions.
There are differences in their implementation. In Python, NaN values propagate
through all operations without raising an exception. In contrast, Pandas operations
exclude NaN values representing missing data. For example, the pandas max
function excludes NaN values, thus they are interpreted as missing values, while
the standard Python max function will take the mathematical interpretation of
NaN and return it as the maximum:
Input:
print “pandas max function : “ ,edu [ ‘ value ‘]. max ( )
print “pandas max function : “ ,max ( edu [ ‘ value ‘] )
Output:
Pandas max function : 8.81
Python max function: nan
• Beside these aggregation functions, we can apply operations over
all the values in rows, columns or a selection of both. The rule of
thumb is that an operation between columns means that it is
applied to each row in that column and an operation between rows
means that it is applied to each column in that row. For example we
can apply any binary arithmetical operation (+,-,*,/) to an entire
row:
Input:
S = edu [ “ Value ” ] / 100
S. head ()
Output:
0 NaN
1 Nan
2 0.0500
3 0.0503
4 0.0495
Name: Value, dtype : float64
Sorting
• This is a important functionality we will need when inspecting our data is
to sort by columns. We can sort a DataFrame using any column, using the
sort function. If we want to see the first five rows of data sorted in
descending order (i.e., from the largest to the smallest values) and using
the Value column, then we just need to do this:
edu . sort_values (by = ‘value’ , ascending = False, inplace = True )
edu. head ( )
• that the inplace keyword means that the DataFrame will be overwritten,
and hence no new DataFrame is returned. If instead of ascending = False
we use ascending = True, the values are sorted in ascending order (i.e.,
from the smallest to the largest values). If we want to return to the
original order, we can sort by an index using the sort_index function and
specifying axis=0:
edu.sort_index (axis = 0, ascending = True, inplace = True )
edu. head ( )
Ranking Data
• In statistics, “ranking” refers to the data transformation in which numerical or ordinal values are replaced by
their rank when the data are sorted. If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of
these data items would be 2, 3, 1 and 4 respectively.
• Now we can perform the ranking using the rank function. Note here that the parameter ascending=False makes the
ranking go from the highest values to the lowest values. The Pandas rank function supports different tie-breaking
methods, specified with the method parameter. In our case, we use the first method, in which ranks are assigned in
the order they appear in the array, avoiding gaps between ranking.
pivedu = pivedu.drop([
’Euro area (13 countries)’,
’Euro area (15 countries)’,
’Euro area (17 countries)’,
’Euro area (18 countries)’,
’European Union (25 countries)’,
’European Union (27 countries)’,
’European Union (28 countries)’
] ,
axis = 0)
pivedu = pivedu.rename(index = {’Germany ( until 1990 former territory of the FRG)’: ’Germany’})
pivedu = pivedu.dropna()
pivedu.rank( ascending = False , method = ’first’).head()
• If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank
the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this
way:
totalSum = pivedu. sum(axis = 1)
totalSum. rank( ascending = False , method = ’dense’) .sort_values(). head()
• If we want to make a global ranking taking into account all the years, we can
sum up all the columns and rank the result. Then we can sort the resulting
values to retrieve the top five countries for the last 6 years, in this way:
totalSum = pivedu. sum(axis = 1)
totalSum. rank( ascending = False , method =’dense’) .sort_values(). head()
Plotting
• Pandas DataFrames and Series can be plotted using the plot function, which uses
the library for graphics Matplotlib. For example, if we want to plot the accumulated
values for each country over the last 6 years, we can take the Series obtained in the
previous example and plot it directly by calling the plot function as shown in the
next cell:
totalSum = pivedu. sum(axis = 1) .sort_values(ascending = False)
totalSum. plot(kind = ’bar’, style = ’b’, alpha = 0.4, title = "Total Values for Country")
• It is also possible to plot a DataFrame directly. In this case, each column is treated
as a separated Series. For example, instead of printing the accumulated value over
the years, we can plot the value for each year.
my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’]
ax = pivedu. plot(kind = ’barh’,
stacked = True ,
color = my_colors)
ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5)
Why ToolBox is improved version of Sub functional language
• ToolBox offers features over the other programming language
and the toolboxes are the updated and improved version of any
kind of sub functional language. Because ToolBox has all the
feature including other programming language. In toolbox we
have all function to perform when it is needed but in other
programming language those all features are not available in
package like TollBox. We can call the built in functions, which are
stored in toolbox at anywhere in the programmer without
fetching any kind of complication or error. But in other sub
functional programming language does not offer those kind of
built in function, there we need to declare the function to
perform it. But sometimes it shows that this declared functions
shows different kinds of error like missing arguments or
functional error. So after considering all the resources, we can
make sure that ToolaBox is definitely the improved version of
other programming/Sub functional language.
Conclusion
• Data Science is like the sea and the tools that data scientist use is like the
elements inside the sea water. So, to handle this massive task we need a
complete package to run it efficiently. Data Scientist handles this in a
smart manner like ToolBox. It helps data scientist to work more efficiently
and obviously considering the performance. We must to say about the
Python’s ecosystem to have all those things. It offers a perfect way to
perform like a pro. Python ecosystem offers a complete package to a data
scientist to lead the task in a efficient manner for developing any data
scientist projects.

More Related Content

What's hot

Register organization, stack
Register organization, stackRegister organization, stack
Register organization, stackAsif Iqbal
 
central processing unit and pipeline
central processing unit and pipelinecentral processing unit and pipeline
central processing unit and pipelineRai University
 
General register organization (computer organization)
General register organization  (computer organization)General register organization  (computer organization)
General register organization (computer organization)rishi ram khanal
 
80386 processor
80386 processor80386 processor
80386 processorRasmi M
 
Data transfer and manipulation
Data transfer and manipulationData transfer and manipulation
Data transfer and manipulationSanjeev Patel
 
Computer architecture addressing modes and formats
Computer architecture addressing modes and formatsComputer architecture addressing modes and formats
Computer architecture addressing modes and formatsMazin Alwaaly
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInteX Research Lab
 
Types of Instruction Format
Types of Instruction FormatTypes of Instruction Format
Types of Instruction FormatDhrumil Panchal
 
Memory organization in computer architecture
Memory organization in computer architectureMemory organization in computer architecture
Memory organization in computer architectureFaisal Hussain
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORSAmirthavalli Senthil
 
Instruction codes and computer registers
Instruction codes and computer registersInstruction codes and computer registers
Instruction codes and computer registersSanjeev Patel
 
Arithmetic for Computers.ppt
Arithmetic for Computers.pptArithmetic for Computers.ppt
Arithmetic for Computers.pptJEEVANANTHAMG6
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set ArchitectureJaffer Haadi
 
8086 architecture and pin description
8086 architecture and pin description 8086 architecture and pin description
8086 architecture and pin description Aswini Dharmaraj
 

What's hot (20)

Register organization, stack
Register organization, stackRegister organization, stack
Register organization, stack
 
central processing unit and pipeline
central processing unit and pipelinecentral processing unit and pipeline
central processing unit and pipeline
 
General register organization (computer organization)
General register organization  (computer organization)General register organization  (computer organization)
General register organization (computer organization)
 
80386 processor
80386 processor80386 processor
80386 processor
 
Data transfer and manipulation
Data transfer and manipulationData transfer and manipulation
Data transfer and manipulation
 
CO by Rakesh Roshan
CO by Rakesh RoshanCO by Rakesh Roshan
CO by Rakesh Roshan
 
Instruction codes
Instruction codesInstruction codes
Instruction codes
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
Computer architecture addressing modes and formats
Computer architecture addressing modes and formatsComputer architecture addressing modes and formats
Computer architecture addressing modes and formats
 
Interrupts
InterruptsInterrupts
Interrupts
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
 
Types of Instruction Format
Types of Instruction FormatTypes of Instruction Format
Types of Instruction Format
 
Assembly language
Assembly languageAssembly language
Assembly language
 
Memory organization in computer architecture
Memory organization in computer architectureMemory organization in computer architecture
Memory organization in computer architecture
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
Instruction codes and computer registers
Instruction codes and computer registersInstruction codes and computer registers
Instruction codes and computer registers
 
Arithmetic for Computers.ppt
Arithmetic for Computers.pptArithmetic for Computers.ppt
Arithmetic for Computers.ppt
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set Architecture
 
8086 architecture and pin description
8086 architecture and pin description 8086 architecture and pin description
8086 architecture and pin description
 
Associative memory
Associative memoryAssociative memory
Associative memory
 

Similar to Toolboxes for data scientists

Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guidepriyanka rajput
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?SofiaCarter4
 
Class 12th IP project on buisness management
Class 12th IP project on buisness managementClass 12th IP project on buisness management
Class 12th IP project on buisness managementsankhlasheetal3
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxhkabir55
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
Top 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App DevelopmentTop 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App DevelopmentInexture Solutions
 
Study of Various Tools for Data Science
Study of Various Tools for Data ScienceStudy of Various Tools for Data Science
Study of Various Tools for Data ScienceIRJET Journal
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Breast Cancer Prediction.pdf
Breast Cancer Prediction.pdfBreast Cancer Prediction.pdf
Breast Cancer Prediction.pdfSouravNaga2
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020Marié Roux
 
Python vs. r for data science
Python vs. r for data sciencePython vs. r for data science
Python vs. r for data scienceHugo Shi
 

Similar to Toolboxes for data scientists (20)

Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?
 
Class 12th IP project on buisness management
Class 12th IP project on buisness managementClass 12th IP project on buisness management
Class 12th IP project on buisness management
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
Datasciencetools
DatasciencetoolsDatasciencetools
Datasciencetools
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Top 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App DevelopmentTop 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App Development
 
Study of Various Tools for Data Science
Study of Various Tools for Data ScienceStudy of Various Tools for Data Science
Study of Various Tools for Data Science
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Breast Cancer Prediction.pdf
Breast Cancer Prediction.pdfBreast Cancer Prediction.pdf
Breast Cancer Prediction.pdf
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020
 
Python vs. r for data science
Python vs. r for data sciencePython vs. r for data science
Python vs. r for data science
 
C++
C++C++
C++
 
Python libraries
Python librariesPython libraries
Python libraries
 

More from Sudipto Krishna Dutta

More from Sudipto Krishna Dutta (14)

A Project Report on RFID Based Attendance System.pdf
A Project Report on RFID Based Attendance System.pdfA Project Report on RFID Based Attendance System.pdf
A Project Report on RFID Based Attendance System.pdf
 
RFID BASED ATTENDANCE SYSTEM.pptx
RFID BASED ATTENDANCE SYSTEM.pptxRFID BASED ATTENDANCE SYSTEM.pptx
RFID BASED ATTENDANCE SYSTEM.pptx
 
Memory hierarchy (In Details)
Memory hierarchy (In Details)Memory hierarchy (In Details)
Memory hierarchy (In Details)
 
Character Recognition using Data Mining Technique (Artificial Neural Network)
Character Recognition using Data Mining Technique (Artificial Neural Network)Character Recognition using Data Mining Technique (Artificial Neural Network)
Character Recognition using Data Mining Technique (Artificial Neural Network)
 
Central tendency
Central tendency Central tendency
Central tendency
 
Determination and Analysis of Sample size
Determination and Analysis of Sample sizeDetermination and Analysis of Sample size
Determination and Analysis of Sample size
 
Newborn Care
Newborn CareNewborn Care
Newborn Care
 
English Literature Book for BCS
English Literature  Book for BCSEnglish Literature  Book for BCS
English Literature Book for BCS
 
How to prepare for Bank exam in Bangladesh
How to prepare for Bank exam in Bangladesh How to prepare for Bank exam in Bangladesh
How to prepare for Bank exam in Bangladesh
 
Bcs study roadmap
Bcs study roadmapBcs study roadmap
Bcs study roadmap
 
Rooppur Atomic Power Plant
Rooppur Atomic Power PlantRooppur Atomic Power Plant
Rooppur Atomic Power Plant
 
Acute myocardial-infraction
Acute myocardial-infraction Acute myocardial-infraction
Acute myocardial-infraction
 
Prospectus and Drawbacks of E-commerce in Bangladesh
Prospectus and Drawbacks of E-commerce in BangladeshProspectus and Drawbacks of E-commerce in Bangladesh
Prospectus and Drawbacks of E-commerce in Bangladesh
 
Cybersecurity fundamental
Cybersecurity fundamentalCybersecurity fundamental
Cybersecurity fundamental
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Toolboxes for data scientists

  • 1. Toolboxes for Data Scientists Sudipto Krishna Dutta 20204021 Introduction to Data Science Jahangirnagar University
  • 2. Introduction  Toolbox is a box, where many different built in functions arestored. Toolbox helps to perform a task efficiently and successfully, especially for data scientist and programmer. Choosing right toolbox can save a lot of time for doing a specific task within the targeted time. Using toolboxes also help to enhance the overall performance of any kind of task like data analysis from big data set and calculating the desired result. For example, if we want to calculate the co-relation coefficient, it is impossible to use a single piece of written code to handle a big data set or extract desired information from this. So, here toolbox can help to perform this task in an effective manner. Using toolbox we can call different built in functions to perform the desired task and to keep all this types of toolbox we can work simultaneously.
  • 3. Different Tool and their benefits There is different kind of tool and to use them is very beneficial for a data scientist. Toolboxes are like • Statistical Tool R/Python - is used for statistical analysis. Mean, Median, Mode, Standard Deviation are also in statistical tool. • Mathematical Tool SAS – Strong data analysis abilities, data management, data encryption Matlab – A numeric computing environment, Powerful graphics librabry, Can process complex mathematical operations. • Database Tool Apache Cassandra - is an open source and high scalable NoSQL database to manage massive amount of data in a faster manner. SQL – is a very popular and widely used but in data science it is recommended for it (i) Flexibility, (ii) Ease of use, (iii) No redundancy and (iv) Reliability In data science statistical toolbox is not the only toolbox for naming the data science it also need mathematical calculations/functions and database to read or write the task. So, all of them together can be called the data science. A lot of benefits lie in toolboxes for any data scientist. Here is some tasks, those can be done by using toolbox. Like • Big data analysis • Handling massive volume of data • Collection a large scale of data set • Building a structure for the operational data • Make a pattern and derive the valuable insights from chosen data set.
  • 4. Toolbox’s’ advantages over the other programming languages and Similarity among them  There are many programming language, like C, Fortran, C++,JAVA etc which are generally used for developing high-performance production or prototyping any kind of certain task or project but the problem is in those language many basic tools are not available or to re- implement those things again and again. So the advantages of toolboxes over the programming languages like  It has a number of built in function which can be use anywhere in the code by just calling them.  It is not needed to write specific code for the specific task by using toolbox because we can perform the needed task by calling the specific function which is stored in toolbox.  We can avoid re-complications of anything for introducing any kind of new function in the task.  Easy and all the basic functions are available in toolbox.  To find out the similarity among them we can identify some basic similarity in the working procedure. In toolbox, a collection of built in functions are there as well as in the programming language it has also owned some function by declaring them in the code. To generate a task by using toolbox we can call the specific built in function in the code and in programming language the needed function need to be written to complete the same task. If we consider the performance to generate the task we can find out some similarity among them. Both can be used for developing high performance production and prototyping and building a data structure. In environmental perspective both have similarity and both are supported object oriented programming. Both has basic statements for functional programming in its own core library.
  • 5. Why python is the best choice ? • Python is a widely used and very popular programming language to all. Even it has great properties for who is new to write computer program or even who never programmed. Though, python has the features for doing data science task more effectively and as we know that data science is not only about the statistical function it also owned the mathematical and database function to it. So, the combination of those three function we can call it data science and here in python tools we can see all of them. So, it is a major reason to choose python. Otherwise it has some most remarkable properties are easy to read code and has suppression of non-mandatory delimiters, dynamic typing and dynamic memory usage. The code is executed immediately in python console like IPython as Python has the ability to interpreter language. Which can give us a richer environment to execute python code? Flexibility is also the reason for choosing python. For this characteristic it can be seen as multiparadigm language. Among them it has the property to program with other languages and python also supports the object oriented paradigm and C programming language code can be mixed with python code and C code using cython. Python also has basic statements for functional programming in its own core library. Large Eco system is also another major reason for Choosing python.
  • 6. Python libraries for Data Scientist and theirs usages • Python community offers a huge number of developed toolboxes. This is very exciting that to know most of them can be used for data science. The most popular python toolboxes for any data scientist are  NumPy  SciPy  Pandas  Scikit- Learn
  • 7. NumPy and Scipy  NumPy is known as the basis of computing toolbox. It has served various kind of operational functions. Though SciPy is domain specific toolbox and it also has several functions. It has also statistical, mathematical and database tools.  NumPy is doing scientific computing with Python.  It provides multidimensional arrays with basic operations on them.  It is very useful in linear algebra function.  Several toolboxes use the NumPy array representation as an efficient basic data structure.  SciPy provides collection of numeric algorithms and domain specific toolboxes.  SciPy can process signal and optimization and handle statistical task.  SciPy is the plotting library Matplotlib and it has many tools for data visualization.
  • 8. SCIKIT-Learn • It is a machine learning library built from NumPy, SciPy and Matplotlib. • It offers simple and efficient tools for common tasks in data analysis such as,  Classification  Regression  Clustering  Dimensionality  Reduction  Model selection  Preprocessing
  • 9. Pandas Pandas have both statistical and database tools and it also provides hard performance, different type of tools and key features. • It provides high performance data structure and data analyzing tools. • It has a key feature to work fast and efficient dataframe object for data manipulation with integrated indexing. • The dataframe can be seen as spreadsheet which offers very flexibility. • In pandas we can easily transform any dataset in the way we want. • Reshaping, Adding or removing columns or rows. • Provides high performance functions for aggregating, merging and joining datasets. • Pandas also has tools for importing and exporting data from different formats, like  CSV  Microsoft Excel  SQL databases  Fast HDF5 format.
  • 10. Data Science Eco System • After choosing Python, we can set up a data scientist python ecosystem by individual toolboxes or to perform a bundle of installation with all needed toolboxes. For those who is new to here, It can be chosen to install the mentioned toolboxes like Python 2.X and Python 3.X , exactly in a order. • However if a bundle installation is chosen, the Anaconda python distribution is the good option. Because the Anaconda distribution provides integration of all the python toolboxes and applications needed for the data scientist into a single directory without mixing it with other python toolboxes installed on the machine. The toolboxes and applications such as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific tools for other related tasks such as data visualization, code, optimization and big data processing.
  • 11. IDE (Integrated Development Environments) • The integrated development environment is software and it is very essential tool for data scientist. IDEs is created to serve different purpose for the data scientist as well as the programmer. Thus, over the years this software has evolved in order to make the coding task less complicated. Selecting right IDEs for each person is very crucial and unfortunately there is no “one size fits all” programming environment. The best solution is to try the most popular IDE are the editor and the compiler and the debugger. Some IDEs can be used in multiple programming language and those provides by language specific plugins, such as NETBEANS or Eclips. • In the case of python there are a large number if specific IDEs, both commercial such as PyCharm and WingIDE and open source. The open source community helps IEDs to spring up, thus anyone can customize their own environment and share it with the rest if the community. For example Spyder (it is the Scientific Python Development Environment) is an IDE customized with the task of the data scientist in mind.
  • 12. Data Science Eco System • After choosing Python, we can set up a data scientist python ecosystem by individual toolboxes or to perform a bundle of installation with all needed toolboxes. For those who is new to here, It can be chosen to install the mentioned toolboxes like Python 2.X and Python 3.X , exactly in a order. • However if a bundle installation is chosen, the Anaconda python distribution is the good option. Because the Anaconda distribution provides integration of all the python toolboxes and applications needed for the data scientist into a single directory without mixing it with other python toolboxes installed on the machine. The toolboxes and applications such as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific tools for other related tasks such as data visualization, code, optimization and big data processing.
  • 13. IDE (Integrated Development Environments) • The integrated development environment is software and it is very essential tool for data scientist. IDEs is created to serve different purpose for the data scientist as well as the programmer. Thus, over the years this software has evolved in order to make the coding task less complicated. Selecting right IDEs for each person is very crucial and unfortunately there is no “one size fits all” programming environment. The best solution is to try the most popular IDE are the editor and the compiler and the debugger. Some IDEs can be used in multiple programming language and those provides by language specific plugins, such as NETBEANS or Eclips. • In the case of python there are a large number if specific IDEs, both commercial such as PyCharm and WingIDE and open source. The open source community helps IEDs to spring up, thus anyone can customize their own environment and share it with the rest if the community. For example Spyder (it is the Scientific Python Development Environment) is an IDE customized with the task of the data scientist in mind.
  • 14. WIDE(Web Integrated Development Environment)- Jupyter • Python has also been developed for web application, it is a new generation of IDEs for interactive language. Nowadays, such sessions are called notebooks and they are not only used in classrooms but also used to show results in presentations or on business dashboards. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. The recent spread of such notebooks is mainly due to IPython. Since December 2011, IPython has been issued as a browser version of its interactive console, called IPython notebook, which shows the Python execution results very clearly and concisely by means of cells. Cells can contain content other than code. For example, markdown cells can be added to introduce algorithms. In this Jupyter Notebook it is also possible to insert Matplotlib graphics to illustrate examples or even web pages. IPython notebook has been separated from IPython software and now it has become a part of a larger project. Jupyter,especiall for Julia, Python and R that aims to reuse the same WIDE for all these interpreted languages and not just Python. All old IPython notebooks are automatically imported to the new version when they are opened with the Jupyter platform.
  • 15. Python, Used in Data Science • We came to know about the python ecosystem, and the containing Toolboxes and interactive IDEs in different that environment and their widely uses.
  • 16. The Jupyter Notebook Environment • Here now we are discussing the Jupyter Notebook environment. we can start by launching the Jupyter notebook platform. This can be done by simply typing the command in terminal or command line. For example: : $ jupyter notebook • But if we chose the bundle installation, we can start the Jupyter notebook platform by clicking on the Jupyter Notebook icon installed by Anaconda in the start menu or on the desktop. If we use the command line, the root directory is the same directory where we launched the Jupyter notebook. Otherwise, if we use the Anaconda launcher, the root directory is the current user directory. Now, to start a new notebook, we only need to press the New NoteBook Python2
  • 17. • Button at the top on the right of the home page. By importing those toolboxes that we will need for our program. In the first cell we put the code to import the Pandas library as pd. This is for convenience; every time we need to use some functionality from the Pandas library, we will write pd instead of pandas. We will also import the two core libraries mentioned above: the numpy library as np and the matplotlib library as plt. • Need to write in commands: import pandas as pd import numpy as np import matplotlib.pyplot as plt • To execute just one cell, we need to press the pause sign button or to click Cell -> Run or press the keys Ctrl + Enter. While execution is underway, the header of the cell shows the * mark: import pandas as pd import numpy as np import matplotlib.pyplot as plt
  • 18. • While a cell is being executed, no other cell can be executed. If we try to execute another cell, its execution will not start until the first cell has finished its execution. Once the execution is finished, the header of the cell will be replaced by the next number of execution. Since this will be the first cell executed, the number shown will be 1. If the process of importing the libraries is correct, no output cell is produced. import pandas as pd import numpy as np import matplotlin.pyplot as plt
  • 19. The DataFrame Data Structer • data structure in Pandas is the DataFrame object. A DataFrame is basically a tabular data structure, with rows and columns. Rows have a specific index to access them, which can be any name or value. In Pandas, the columns are called Series, a special type of data, which in essence consists of a list of several values, where each value has an index. Therefore, the DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. To understand how it works, let us see how to create a DataFrame from a common Python dictionary of lists. First, we will create a new cell by clicking Insert -> Insert Cell Below or pressing the keys Ctrl+B For example, the following code: import pandas as pd # a simple int list list = [1,2,3,4,5]
  • 20. # create series form a int list res = pd.Series(list) print(res) the result will be like: 0 1 2 3 4 5 dtype: int64 import pandas as pd dic = { 'Id': 1013, 'Name': 'Sudipto','State': 'Khulna','Age': 27} res = pd.Series(dic) print(res) the result will be like: Id 1013 Name Sudipto State Khulna Age 27 dtype: object Apart from DataFrame data structure creation, Panda offers a lot of functions to manipulate them. Among other things, it offers us functions for aggregation, manipulation, and transformation of the data. In the ollowing sections, we will introduce some of these functions.
  • 21. Data Analysis Example Using Pandas • we can use Pandas in a simple real problem, we will start doing some basic analysis of any data. For the sake of transparency, data produced that must be open, meaning that they can be freely used, reused, and distributed by anyone. • Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data. • The main data structures in Pandas are implemented with series and dataframes classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Series instances. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances. Import numpy as np Import pandas as pd Pd.set_option(“display.precision”, 2)
  • 22. • We will demonstrate the main methods in action by analyzing a dataset on the churn rate of telecom operator clients. Let’s read the data and take a look at the 5 lines using the head method, df = pd. Read_csv(“../input/telecom_churn.csv”) df.head() • About printing dataframe in jupyter notebooks recall that each row corresponds to one client, an instance, and columns are features of this instance. print(df.shape) (3333, 20) • From the output, we can see that the table contains 3333 rows and 20 columns. If we want to print out the column name using columns: print(df.columns) • We can use the info( ) methods some genatral information about the dataframe. print (df.info( ) )
  • 23. • We see that one feature is logical (bool), 3 features are of type object, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with shape. • We can change the column type with the astype method. Lets apply this to the Churn feature to convert it in to int64: df[“ churn ”]= df [“Churn”]. astype(“int64”) • To describe method shows basic statistical charterstics of each numeric feature (int 64 and float64 types): number if non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles. df.describes( ) • In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the include parameter. df.describe(include=[“object”, “bool”] )
  • 24. • To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows). The inplace argument tells whether to change the original DataFrame. With inplace=False, the drop method doesn't change the existing DataFrame and returns a new one with dropped rows or columns. With inplace=True, it alters the DataFrame. #get rid of just created columns df. Drop ([“Total charge”, “Toatal calls”], axis = 1, inplace= True) #and here is how you can delete rows df. Drop ([1,2]). head()
  • 25. Reading Data • To read the data from that we downloaded. First of all, we have to create a new notebook called Open Government Data Analysis and open it. Then, after ensuring that the educ_figdp_1_Data.csv file is stored in the same directory as our notebook directory, we will write the following code to read and show the content: edu = pd.read_csv (‘files/ch02/educ_figdp_1_Data.csv’, na_values = ‘ : ’, usecols = [“TIME”,”GEO”,”VALUES”]) edu • Beside this, Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated files, or even the content from the clipboard (read_excel(), read_hdf(), read_table(), read_clipboard()). • If we want to know the names of the columns or the names of the indexes, we can use the DataFrame attributes columns and index respectively. The names of the columns or indexes can be changed by assigning a new list of the same length to these attributes. The values of any DataFrame can be retrieved as a Python array by calling its values attribute. If we just want quick statistical information on all the numeric columns in a DataFrame, we can use the function describe(). The result shows the count, the mean, the standard deviation, the minimum and maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values in each column or series. edu.describe ( )
  • 26. Selecting Data • If we want to select a subset of data from a DataFrame, it is necessary to indicate this subset using square brackets ([ ]) after the DataFrame. The subset can be specified in several ways. If we want to select only one column from a DataFrame, we only need to put its name between the square brackets. The result will be a Series data structure, not a DataFrame, because only one column is retrieved. edu [‘ value’] • If we want to select a subset of rows from a DataFrame, we can do so by indicating a range of rows separated by a colon (:) inside the square brackets. This is commonly known as a slice of rows: edu [ 10 : 14 ] • For example, We assume a scenario and observe it, • 13 2001 European Union (27 countries) 4.99 This instruction returns the slice of rows from the 10th to the 13th position. Note that the slice does not use the index labels as references, but the position. In this case, the labels of the rows simply coincide with the position of the rows. If we want to select a subset of columns and rows using the labels as our references instead of the positions, we can use ix indexing: edu.ix [90 : 94, [‘TIME’ , ‘GEO’] ]
  • 27. Filtering Data • Another way to select a subset of data is by applying Boolean indexing. This indexing is commonly known as a filter. For instance, if we want to filter those values less than or equal to 6.5, we can do it like this: edu [ edu [‘value’] > 6.5 . tail ( ) • The Boolean operation edu[’Value’] > 6.5 produces a Boolean mask. When an element in the “Value” column is greater than 6.5, the corresponding value in the mask is set to True, otherwise it is set to False. Then, when this mask is applied as an index in edu[edu[’Value’] > 6.5], the result is a filtered DataFrame containing only rows with values higher than 6.5. Of course, any of the usual Boolean operators can be used for filtering: < (less than),<= (less than or equal to), > (greater than), >= (greater than or equal to), = (equal to), and ! = (not equal to).
  • 28. Filtering Missing Values • Pandas uses the special value NaN (not a number) to represent missing values. In Python, NaN is a special floating-point value returned by certain operations when one of their results ends in an undefined value. A subtle feature of NaN values is that two NaN are never equal. Because of this, the only safe way to tell whether a value is missing in a DataFrame is by using the isnull() function. Indeed, this function can be used to filter rows with missing values : edu [edu [“value”].isnull ( ) ]. head ( )
  • 29. Manipulating Data • To manipulate data we need to know how to select the desired data. One of the most straightforward things we can do is to operate with columns or rows using aggregation functions. , If a function is applied to a DataFrame or a selection of rows and columns, then you can specify if the function should be applied to the rows for each column (setting the axis=0 keyword on the invocation of the function), or it should be applied on the columns for each row (setting the axis=1 keyword on the invocation of the function). edu.max ( axis = 0) • Note that these are functions specific to Pandas, not the generic Python functions. There are differences in their implementation. In Python, NaN values propagate through all operations without raising an exception. In contrast, Pandas operations exclude NaN values representing missing data. For example, the pandas max function excludes NaN values, thus they are interpreted as missing values, while the standard Python max function will take the mathematical interpretation of NaN and return it as the maximum: Input: print “pandas max function : “ ,edu [ ‘ value ‘]. max ( ) print “pandas max function : “ ,max ( edu [ ‘ value ‘] ) Output: Pandas max function : 8.81 Python max function: nan
  • 30. • Beside these aggregation functions, we can apply operations over all the values in rows, columns or a selection of both. The rule of thumb is that an operation between columns means that it is applied to each row in that column and an operation between rows means that it is applied to each column in that row. For example we can apply any binary arithmetical operation (+,-,*,/) to an entire row: Input: S = edu [ “ Value ” ] / 100 S. head () Output: 0 NaN 1 Nan 2 0.0500 3 0.0503 4 0.0495 Name: Value, dtype : float64
  • 31. Sorting • This is a important functionality we will need when inspecting our data is to sort by columns. We can sort a DataFrame using any column, using the sort function. If we want to see the first five rows of data sorted in descending order (i.e., from the largest to the smallest values) and using the Value column, then we just need to do this: edu . sort_values (by = ‘value’ , ascending = False, inplace = True ) edu. head ( ) • that the inplace keyword means that the DataFrame will be overwritten, and hence no new DataFrame is returned. If instead of ascending = False we use ascending = True, the values are sorted in ascending order (i.e., from the smallest to the largest values). If we want to return to the original order, we can sort by an index using the sort_index function and specifying axis=0: edu.sort_index (axis = 0, ascending = True, inplace = True ) edu. head ( )
  • 32. Ranking Data • In statistics, “ranking” refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. • Now we can perform the ranking using the rank function. Note here that the parameter ascending=False makes the ranking go from the highest values to the lowest values. The Pandas rank function supports different tie-breaking methods, specified with the method parameter. In our case, we use the first method, in which ranks are assigned in the order they appear in the array, avoiding gaps between ranking. pivedu = pivedu.drop([ ’Euro area (13 countries)’, ’Euro area (15 countries)’, ’Euro area (17 countries)’, ’Euro area (18 countries)’, ’European Union (25 countries)’, ’European Union (27 countries)’, ’European Union (28 countries)’ ] , axis = 0) pivedu = pivedu.rename(index = {’Germany ( until 1990 former territory of the FRG)’: ’Germany’}) pivedu = pivedu.dropna() pivedu.rank( ascending = False , method = ’first’).head() • If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this way: totalSum = pivedu. sum(axis = 1) totalSum. rank( ascending = False , method = ’dense’) .sort_values(). head()
  • 33. • If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this way: totalSum = pivedu. sum(axis = 1) totalSum. rank( ascending = False , method =’dense’) .sort_values(). head()
  • 34. Plotting • Pandas DataFrames and Series can be plotted using the plot function, which uses the library for graphics Matplotlib. For example, if we want to plot the accumulated values for each country over the last 6 years, we can take the Series obtained in the previous example and plot it directly by calling the plot function as shown in the next cell: totalSum = pivedu. sum(axis = 1) .sort_values(ascending = False) totalSum. plot(kind = ’bar’, style = ’b’, alpha = 0.4, title = "Total Values for Country") • It is also possible to plot a DataFrame directly. In this case, each column is treated as a separated Series. For example, instead of printing the accumulated value over the years, we can plot the value for each year. my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’] ax = pivedu. plot(kind = ’barh’, stacked = True , color = my_colors) ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5)
  • 35. Why ToolBox is improved version of Sub functional language • ToolBox offers features over the other programming language and the toolboxes are the updated and improved version of any kind of sub functional language. Because ToolBox has all the feature including other programming language. In toolbox we have all function to perform when it is needed but in other programming language those all features are not available in package like TollBox. We can call the built in functions, which are stored in toolbox at anywhere in the programmer without fetching any kind of complication or error. But in other sub functional programming language does not offer those kind of built in function, there we need to declare the function to perform it. But sometimes it shows that this declared functions shows different kinds of error like missing arguments or functional error. So after considering all the resources, we can make sure that ToolaBox is definitely the improved version of other programming/Sub functional language.
  • 36. Conclusion • Data Science is like the sea and the tools that data scientist use is like the elements inside the sea water. So, to handle this massive task we need a complete package to run it efficiently. Data Scientist handles this in a smart manner like ToolBox. It helps data scientist to work more efficiently and obviously considering the performance. We must to say about the Python’s ecosystem to have all those things. It offers a perfect way to perform like a pro. Python ecosystem offers a complete package to a data scientist to lead the task in a efficient manner for developing any data scientist projects.