Python (Data Analysis) cleaning and visualize

PYTHON FOR
DATA ANALYSIS
IRUOLAGBE PIUS

 Overview of Python Programming Language
PART 1

CONTENTS
 Introduction to Python
 Features of Python
 Uses of Python
programming language
 How to create a simple
“hello world” program in
Python shell environment
 Comments in Python
 Python variables
 Variables naming rules
 Python Data types
 Python numbers
 Python strings
 Python booleans
 Python input function
 Python Operators
 Arithmetic operators
 Assignment operators
 Comparison operators
 Logical operators
 Membership operators
 Decision Making & Conditional
Statements
 if statement
 elif statement
 else statement
 and & or
 Nested if statement
 pass statement
 Python Loops
 while loop
o break statement
o continue statement
o else statement
 for loop
o looping through a string
o break statement
o range() function
o nested loop

Introduction to Python
Python is a widely used, general-purpose, interpreted, high-level programming language for general-
purpose programming, created by Guido van Rossum and first released in 1991. Python files are stored with
the extension “ .py”.
Two major versions of Python are currently in active use:
• Python 3.x is the current version and is under active development.
• Python 2.x is the legacy version
IDLE (Integrated Development and Learning Environment) is a simple editor for Python that comes bundled
with Python.
FEATURES OF PYTHON
Python's features include-
 Easy to learn, read and maintain: Python has few keywords, simple structure and a clearly defined
syntax.This allow students pick up the language quickly.Also, Python's source code is easy-to-maintain.
 Interactive Mode: Python has support for an interactive mode, which allows interactive testing and
debugging of snippets of code.
 Portable: Python can run on a wide variety of hardware platforms and has the same interface on all
platforms.
USES OF PYTHON PROGRAMMING LANGUAGE: Python is used in virtually every industry and scientific
field, including: Web development, Software development, Machine Learning and Artificial Intelligence,
Medicine and Pharmacology, Astronomy, Robotics, autonomous vehicles, etc.

HOWTO CREATEA SIMPLE “HELLO WORLD” PROGRAM IN PYTHON SHELL ENVIRONMENT
 Open IDLE on your system of choice.
 Locate and open the “Cisco” folder on your desktop.
 Double-click on the Python IDLE shortcut to launch Python IDLE.
 It will open a shell with options along the top. In the shell, there is a prompt of three right angle brackets:
>>>
 Now write the following code in the prompt:
>>> print("HelloWorld")
 Hit Enter key.
 You should see HelloWorld displayed on your screen.You have successfullywritten your first Python program
While on the Shell (console) window, your code is executed after every line. To write multiple lines of code in IDLE,
you have to create a file for it. You can do this by Clicking on New file under the File menu or by pressing Ctrl + N,
you save your file by pressingCtrl + S.
Python ShellWindow Python File Window
COMMENTS IN PYTHON
A hash sign (#) that is not inside a string literal is the beginning of a comment. All characters after the #, up to the
end of the physical line, are part of the comment and the Python interpreter ignores them. You can also type a
comment on the same line after a statement or expression .

For example:
TASK I: Make 2 comments on a Python program
PYTHONVARIABLES
In computer programming, variables are used to store information to be referenced and used by programs. It is
helpful to think of variables as containers that hold information. Their sole purpose is to label and store data in
memory.This data can then be used throughout your program.Variables are also called identifiers in Python.
When you are naming variables, try your best to make sure that the name you assign your variables to is accurately
descriptive and understandable to another reader. Sometimes that other reader is yourself when you revisit a
program that you wrote months or even years earlier. For example, it is more appropriate to use “username” to store
a user’s name than to use “x”. To create a variable in Python, all you need to do is specify the variable name, and then
assign a value to it. The name of the variable goes on the left and the value you want to store in the variable goes on
the right. Syntax: <variable name> = <value>
EXAMPLE:

• x=2
• y="Hello"
• mypython="PythonGuides"
• my_python="PythonGuides"
• _mypython="PythonGuides"
• MYPYTHON="PythonGuides"
• myPython7="PythonGuides"
N.B: Variable assignment works from left to right. So the following will give you an syntax error.
0 = x => Output: SyntaxError: can't assign to literal
VARIABLE NAMING RULES IN PYTHON
There are some rules we need to follow while giving a name for a Python variable.
1. You must start variable names with an alphabet or underscore(_) character.
2. A variable name can only contain A-Z, a-z, 0-9, and underscore (_).
3. You cannot start the variable name with a number.
4. You cannot use special characters with the variable name such as such as $,%,#,&,@.-,^ etc.
5. Variable names are case sensitive. For example username and Username are two different variables.
6. Variable names cannot contain whitespace(s).
7. Do not use reserve keywords as a variable name for example keywords like class, for, def, del, is, else, etc.
# Examples of variable names allowed # Examples of variable names not allowed
• 7mypython"PythonGuides"
• -mypython="PythonGuides"
• myPy@thon="PythonGuides"
• my Python="PythonGuides"
• for="PythonGuides"
TASK II: Declare 5 variables in a Python program

Python Data types
A data type is a classification of data which tells the computer how the programmer intends to use the data.
Most programming languages support various types of data, including integer, real, character or string, and
Boolean.
Python variables do not need explicit declaration, the declaration happens automatically when you assign a
value to a variable. The equal sign (=) is used to assign values to variables. The operand to the left of the =
operator is the name of the variable and the operand to the right of the = operator is the value stored in the
variable. For example – counter = 100
miles = 1000.0
name = "John"
# An integer assignment
# A floating point
# A string
Python has 6 standard data types that are used to define the operations possible on them:
• Numbers
• String
• Boolean
• List
• Tuple
• Dictionary

PYTHON NUMBERS
Number data types store numeric values. Python supports three different numerical types −
i. int (signed integers): positive and negative whole numbers. e.g. 10, 100, -789, 1 etc.
ii. float (floating point real values): positive and negative floating (decimal / fractional) numbers. e.g. 0.0,
15.20, -21.9, 1.0, 32.8+e18 etc.
iii. complex (complex numbers): positive and negative complex numbers. e.g. 3.14j, .876j, -.6545+0j,
3e+26j etc.
A complex number consists of an ordered pair of real floating-point numbers denoted by x + yj, where x and y are real
numbers and j is the imaginary unit.
To verify the type of any object in Python, use the type() function:
You can convert from one type to another with the int(), float(), and complex() methods.This is also know as casting.
An error will be raised if you try to convert a complex value: TypeError: can't convert complex to float

PYTHON STRINGS
Strings (single lines) in python are surrounded by either single quotation marks, or double quotation marks. You
can assign a multiline string to a variable by using three quotes: 'hello' is the same as "hello". You can display a
string literal with the print() function: print("Hello")
 You can assign a multiline string to a variable by using three quotes:
 Assigninga string to a variable is done with the variable name followed by an equal sign and the string:
To modify Python strings, there are set of built-in methods that you can use on the strings.
• The upper() method returns the string in upper case.
• The lower() method returns the string in lower case.
• The capitalize() method returns a capitalized version of the string.
• The replace() method replaces a string with another string.
PYTHON BOOLEANS
Booleans represent one of two values: True or False. In programming you often need to know if an expression is
True or False. You can evaluate any expression in Python, and get one of two answers, True or False. When you
compare two values, the expression is evaluated and Python returns the Boolean answer.
EXAMPLE:

PYTHON INPUT FUNCTION
Python allows for user input. That means we are able to ask the user for input. Python 3x uses the input() method.
The following example asks for the username, and when you entered the username, it gets printed on the screen:
Python stops executing when it gets to the input() function and continues when the user has given some input. By
default, the data taken through the input() function is seen as strings by the interpreter. This can be changed by
casting the input.
Casting is when you convert a variable value from one datatype to another. This is done with functions such
as int() or float() or str(). A very common pattern is that you convert a number, currently as a string into a proper
number. Casting in python is therefore done using constructor functions:
• int() - constructs an integer number from an integer literal, a float literal (by removing all decimals), or a
string literal (providing the string represents a whole number).
• float() - constructs a float number from an integer literal, a float literal or a string literal (provided the string
represents a float or an integer).
• str() - constructs a string from a wide variety of data types, including strings, integer literals and float
literals.
FOR EXAMPLE: The snippetbellows takes two inputs converts it to integer and prints the sum.
NB:An error will be raised if you enter letters.This is because the interpreter cannot convert letters to an integer.
TASK III: Write a Python program to calculate the area of a rectangle

Python Operators
Operators are used to perform operations on variables and values.
In the example ahead, I used the + operator to add together two values:
Python divides the operators in the following groups:
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
PYTHON ARITHMETIC OPERATORS
Arithmetic operators are used with numeric values to perform common mathematical operations:
Operator Name Example
+ Addition x + y
- Subtraction x - y
/ Division x/y
* Multiplication x * y
** Exponentiation x ** y
// Floor division x // y
% Modulus x % y
• Identity operators
• Membership operators
• Bitwise operators

PYTHON ASSIGNMENT OPERATORS
Assignment operators are used to assign values to variables:
Operator Example Same As
= x = 5 x = 5
+= x += 3 x = x + 3
-= x -= 3 x = x - 3
*= x *= 3 x = x * 3
/= x /= 3 x = x / 3
PYTHON COMPARISON OPERATORS
Comparison operators are used to compare two values:
Operator Name Example
== Equal x == y
!= Not equal x != y
> Greater than x > y
< Less than x < y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y

PYTHON LOGICALOPERATORS
Logical operators are used to combine conditional statements:
Operator Description Example
and ReturnsTrue if both statements are true x < 5 and x < 10
or ReturnsTrue if one of the statements is true x < 5 or x < 4
not Reverse the result, returns False if the result is true not(x < 5 and x < 10)
PYTHON MEMBERSHIP OPERATORS
Membership operators are used to test if a sequence is presented in an object:
Operator Description Example
in ReturnsTrue if a sequence with the specified
value is present in the object
x in y
not in ReturnsTrue if a sequence with the specified
value is not present in the object
x not in y

Decision Making & Conditional Statements
Decision-making is the anticipation of conditions occurring during the execution
of a program and specified actions taken according to the conditions. Conditional
expressions, involving keywords such as if, elif, and else, provide Python
programs with the ability to perform different actions depending on a boolean
condition. The image ahead is the general form of a typical decision making
structure found in most programming languages -
Python programming language provides the following types of decision-making
statements.
Statement Description
if statements An if statement consists of a Boolean expression followed
by one or more statements.
if...else statements An if statement can be followed by an optional else
statement, which executes when the boolean expression
is FALSE.
nested if statements You can use one if or else if statement inside
another if or else if statement(s).

In this example I used two variables, a and b, which are used as part of the if statement to test whether b is greater
than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so it prints to screen that "b is greater than
a". If statements are always indented in Python.
ELIF
The elif keyword is Python’s way of saying "if the previous conditions were not true, then try this condition".
ELSE
The else keyword catches anything which isn't caught by the preceding conditions.
In this example a is greater than b, so the first condition is not true, also the elif condition is not true, so we go to the
else condition and print to screen that "a is greater than b".
You can also have an else without the elif.
TASK IV: Write a simple Python program to check result
IF STATEMENT

“AND” & “OR”
The and & or keywords are logical operators and are used to combine conditional statements.
NESTED IF
You can have if statements inside if statements, this is called nested if statements.
THE PASS STATEMENT
if statements cannot be empty, but if you for some reason have an if statement with no content, put in the pass
statement to avoid getting an error.

Python Loops
Python has two primitive loop commands:
1. while loops
2. for loops
A.THE WHILE LOOP
With the while loop we can execute a set of statements as long as a condition is true.
EXAMPLE: Print i as long as i is less than 6 -
Remember to increment i, or else, the loop will continue forever.
The while loop requires relevant variables to be ready, in this example we need to define an indexing variable,
i, which we set to 1.
THE BREAK STATEMENT
With the break statement we can stop the loop even if the while condition is true:
EXAMPLE: Exit the loop when i is 3 -

THE CONTINUE STATEMENT
With the continue statement we can stop the current iteration, and continue with the next.
EXAMPLE: Continue to the next iteration if i is 3 -
THE ELSE STATEMENT
With the else statement we can run a block of code once when the condition no longer is true.
EXAMPLE: Print a message once the condition is false -
B.THE FOR LOOP
A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).
With the for loop we can execute a set of statements, once for each item in a list, tuple, set etc. The for
loop does not require an indexing variable to set beforehand.

EXAMPLE: Print each fruit in a fruit list: -
LOOPING THROUGH A STRING
Even strings are iterable objects, they contain a sequence of characters:
EXAMPLE: Loop through the letters in the word "banana" -
THE BREAK STATEMENT
With the break statement we can stop the loop before it has looped through all the items:
EXAMPLE: Exit the loop when x is "banana" -
The If “Continue” statement also works in for loops
EXAMPLE: Using the range() function -
Note that range(6) is not the values of 0 to 6, but the values 0 to 5.
THE RANGE() FUNCTION
To loop through a set of code a specified number of times, we can use the range() function. The range()
function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and
ends at a specified number.
TASKV: Write a simple
Python program to print even
numbers between 1 - 20

The range() function defaults to 0 as a starting value, however it is possible to specify the starting value by
adding a parameter: range(2, 6), which means values from 2 to 6 (but not including 6).
EXAMPLE: Using the start parameter -
The range() function defaults to increment the sequence by 1, however it is possible to specify the increment
value by adding a third parameter: range(2, 30, 3).
EXAMPLE: Increment the sequence with 3 (default is 1) -
NESTED LOOPS
A nested loop is a loop inside a loop. The "inner loop" will be executed one time for each iteration of the
"outer loop.
EXAMPLE: Print each adjective for every fruit: -
The If “pass” and “else” statements also work in for loops

 Error Handling, Functions and Modules
PART 2

CONTENTS
 Error Handling
 Exceptions
 Exception handling
 Python Data types (contd.)
 List
 Nested list
 List operations
 List functions
 Tuples
 Dictionaries
 Python Functions
 Arguments
 Returning From Functions
 Python Modules
 Installing packages
 Mini-projects
 Multiplication table
 Number checker
 Password checker
 Time counter
 Simple calculator

Error Handling
EXCEPTIONS
Exceptions occur when something goes wrong, due to incorrect code or input. When an exception occurs, the
program immediately stops. The following code produces the ZeroDivisionError exception by trying to
divide 7 by 0.
EXAMPLE:
Different exceptions are raised for different reasons. Common exceptions in Python include:
ImportError: an import fails;
IndexError: a list is indexed with an out-of-range number;
NameError: an unknown variable is used;
SyntaxError: the code can't be parsed properly;
TypeError: a function is called on a value of an inappropriate type;
ValueError: a function is called on a value of the correct type, but with an inappropriate value.
EXCEPTION HANDLING
To handle exceptions, and to call code when an exception occurs, you can use a try/except statement. The try
block contains code that might throw an exception. If that exception occurs, the code in the try block stops
being executed, and the code in the except block is run. If no error occurs, the code in the except block
doesn't run.

EXAMPLE:
A try statement can have multiple different except blocks to handle different exceptions. Multiple exceptions can
also be put into a single except block using parentheses, to have the except block handle all of them.
An except statement without any exception specified will catch all errors.
Exception handling is particularly useful when dealing with user input.
EXAMPLE:

Python Data types (contd.)
LIST
Lists are used to store items. A list is created using square brackets with commas separating the list items.
EXAMPLE:
In the example above, the words list contains three string items: hello, world and !
If you want to access a certain item in the list, you can do this by using its index in square brackets. In our
example, that would look like this:
The first list item's index is 0, rather than 1, as you might expect.
What's the result of this code?
Answer:

NESTED LIST
Nested lists can be used to represent 2D grids, such as matrices. This is useful because a matrix-like structure
can allow you to store data in row-column format, like in ticketing programs, that need to store the seat
numbers in a matrix, with their corresponding rows and numbers.
EXAMPLE:
LIST OPERATIONS
To check if an item is in a list, the in operator can be used. It returns True if the item occurs one or more times
in the list, and False if it doesn't
EXAMPLE:
LIST FUNCTIONS
The append method adds an item to the end of an existing list. EXAMPLE:
To get the number of items in a list, you can use the len() function. EXAMPLE:
Unlike the index of the items, len does not start with 0. So, the list above contains 5 items, meaning len will return 5.
The insert method is similar to append, except that it allows you to insert a
new item at any position in the list, as opposed to just at the end.

The index method finds the first occurrence of a list item and returns its index. If the item isn't in the list, it raises a
ValueError.
EXAMPLE:
There are a few more useful functions and methods for lists.
max(list): Returns the list item with the maximum value
min(list): Returns the list item with minimum value
list.count(item): Returns a count of how many times an item occurs in a list
list.remove(item): Removes an object from a list
list.reverse(): Reverses items in a list.
TUPLES
Tuples are very similar to lists, except that they are immutable (they cannot be changed). Also, they are created
using parentheses, rather than square brackets. You can access the values in the tuple with their index, just as you
did with lists:
EXAMPLE:
DICTIONARIES
Dictionaries are data structures used to map arbitrary keys to values. Lists can be thought of as dictionaries
with integer keys within a certain range. Dictionaries can be indexed in the same way as lists, using square
brackets containing keys. Each element in a dictionary is represented by a key:value pair.
EXAMPLE:

Functions
Code reuse is a very important part of programming in any language. Increasing code size makes it
harder to maintain. For a large programming project to be successful, it is essential to abide by the
Don't RepeatYourself (DRY) principle.
In addition to using pre-defined functions, you can create your own functions by using the def
statement. Here is an example of a function named my_func. It takes no arguments, and prints “The 8th
Gear Space" three times. It is defined, and then called. The statements in the function are executed only
when the function is called.
EXAMPLE:
The code block within every function starts with a colon (:) and is indented. Also,
So once you’ve defined a function, you can call them multiple times in your code. You must define
functions before they are called, in the same way that you must assign variables before using them.

ARGUMENTS
All the function definitions we've looked at so far have been functions of zero arguments, which are called with empty
parentheses. However, most functions take arguments. The example below defines a function that takes one
argument:
EXAMPLE:
You can also define functions with more than one argument; separate them with commas.
EXAMPLE:
Function arguments can be used as variables inside the function definition. However, they cannot be referenced
outside of the function's definition, else the code will throw an error. This also applies to other variables created inside
a function.
RETURNING FROM FUNCTIONS
Certain functions, such as int or str, return a value instead of outputting it. The
returned value can be used later in the code, for example, by getting assigned
to a variable. To do this for your defined functions, you can use the return
statement. Like this:
Once you return a value from a function, it immediately stops being executed. Any code after the return statement will never happen.

Modules
Modules are pieces of code that other people have written to fulfill common tasks, such as generating
random numbers, performing mathematical operations, etc.
The basic way to use a module is to add import module_name at the top of your code, and then using
module_name.var to access functions and values with the name var in the module.
For example, the following example uses the random module to generate random numbers:
EXAMPLE:
There is another kind of import that can be used if you only need certain functions from a module.
These take the form from module_name import var, and then var can be used as if it were defined
normally in your code.
For example, to import only the pi constant from the math module:
Use a comma separated list to import multiple objects. For example:
You can import a module or object under a different name using the as keyword.
This is mainly used when a module or object has a long or confusing name.

There are three main types of modules in Python:
1. Those you write yourself
2. Those you install from external sources, and
3. Those that are preinstalled with Python
The last type is called the standard library, and contains many useful modules. Some of the standard library's
useful modules include string, re, datetime, math, random, os, multiprocessing, subprocess, socket, etc.
Many third-party Python modules are stored on the Python Package Index (PyPI). The best way to install these
is using a program called pip. This comes installed by default with modern distributions of Python. If you don't
have it, it is easy to install online. Once you have it, installing libraries from PyPI is easy. Look up the name of the
library you want to install, go to the command line (for Windows it will be the Command Prompt), and enter pip
install library_name. Once you've done this, import the library and use it in your code.
It's important to enter pip commands at the command line, not the Python interpreter.
INSTALLING PACKAGES
1. Before you go any further, make sure you have Python and that the expected version is available from your
command line.You can check this by running: py --version
2. Additionally, you’ll need to make sure you have pip available.You can check this by running:
py -m pip --version
3. If pip isn’t already installed, then first try to bootstrap it from the standard library:
py -m ensurepip --default-pip
py -m pip install --upgrade pip setuptools wheel
4. To install the latest version of “SomeProject”: py -m pip install "SomeProject"
5. To upgrade an already installed SomeProject to the latest from PyPI:
py -m pip install --upgrade SomeProject
TASK I:
Install the
following
modules –
numpy,
pandas,
matplotlib

Mini-Project Solutions
1. MULTIPLICATION TABLE
# Program to print multiplication table
print("MULTIPLICATION TABLEn")
num = int(input("Enter the number: "))
# loop for the 12 multipliers
for counter in range(1,13):
#printing the result
print(num,"x",counter,"=",num*counter )
2. NUMBER CHECKER
# Program to check if a number is an even or odd number
print("This program checks if a number is even or odd")
num = int(input("Enter the number: "))
if num%2 == 0:
print("The number is EVEN")
else:
print("The number is ODD")

# Program to check if password match
print("PASSWORD")
password = input("Pls enter your password: ")
print("nCONFIRM PASSWORD")
password2 = input("Pls re-enter your password again: ")
if password != password2:
print("Your password doesn't match")
else:
print("Your password matches")
4.TIME COUNTER
#Program for time counter
import time #importing time module
val = int(input("Enter the number to countdown from (in seconds): "))
print("Starting...n")
#Giving range for the loop
for i in range(-1,val):
print(val)
val -=1
# Using the time module to delay for 1 second
time.sleep(1)
print("nDone...")
3. PASSWORDCHECKER

#Program for simple calculator
#Takes name of users and capitalize the input
username = input("Please enter your name: ").capitalize()
print()
print(username,"you are welcome!n")
print()
print("What will you like to do?")
print("1.
print("2.
Addition 3.
Subtraction 4.
Multiplicationn")
Division")
sel = input("Please make a selection... ")
# Loop to validate input
while sel != "1" and sel != "2" and sel != "3" and sel != "4":
print("Wrong input.nPlease enter 1, 2, 3 or 4")
print()
print("What will you like to do?")
print("1.
print("2.
Addition 3.
Subtraction 4.
Multiplicationn")
Division")
sel = input("Please make a selection... ")
print()
#Condition forAddition
if sel == "1":
print("ADDITION")
num1 = int(input("nEnter a number: "))
num2 = int(input("Enter another number: "))
ans = num1 + num2
print("The addition of",num1,"and",num2,"is",ans)
5. SIMPLE CALCULATOR

#Condition forSubtraction
elif sel == "2":
print("SUBTRACTION")
num1 = int(input("nEnter the first number: "))
num2 = int(input("Enter the second number: "))
ans = num1 - num2
print("The subtraction of",num2,"from",num1,"is",ans)
#Condition for Multiplication
elif sel == "3":
print("MULTIPLICATION")
num1 = int(input("nEnter a number: "))
num2 = int(input("Enter another number: "))
ans = num1 * num2
print("The product of",num1,"and",num1,"is",ans)
#Condition for Division
elif sel == "4":
print("DIVISION")
# Loop to validate input
while num2 == 0:
print("nError!!! Denominator cannot be Zero (0).nPlease enter a non-zero value.")
ans = num1/num2
print(num1,"divided by",num2,"is",ans)

 Python for Data Analysis
PART 3

CONTENTS
 Data Analysis & Statistics
 Intro
 Who uses data analysis?
 Mean
 Median
 Standard deviation
 NumPy
 Intro
 NumPy array
 Indexing and slicing
 Conditions
 Zeroes, Ones and Full methods
 NumPy array datatypes
 Advanced indexing techniques
 Statistics with NumPy
 Array operations
 Single array operations
 Multi-array operations
 Vector product
 Pandas
 Series & dataframes
 Creating dataframes
 Attribute of a dataframe
 Indexing
 Data selection
 Conditions
 Reading data
 Dropping columns
 Creating columns
 WORKSHOP I
 Grouping
 Multi-index / hierarchical
indexing
 Concatenation
 WORKSHOP II
 Matplotlib
 Intro
 Line plot
 Bar plot
 Box plot
 Histogram
 Area plot
 Scatter plot
 Pie chart
 Plot formatting

Data Analysis & Statistics
Data is everywhere and is said to double every 40 months. Hence the phenomenon “Big Data”.
Mastercard records 74 billion transactions per year
Twitter records 500 million tweets per day
Walmart an approximate 1 million transactions per hour
It is predicted that data would amount to 163 Zettabytes by 2025. To be in control of such enormous
volume of data, it is necessary that it be structured. From the structure, insights can be discovered. This is
where data analysis comes in. Data Analysis uses various techniques and methods to extract knowledge
and actionable insights from data using scientific methods, multidisciplinary knowledge and computing
technologies. Python is widely used in data science/analysis and has a robust suite of powerful tools to
communicate with data.
WHO USES DATAANALYSIS?
Aerospace, Agriculture, Automobiles, Banking, Communications, Entertainment, Finance, Fitness,
Government, Healthcare, Information Technology, Mining, Real Estate, Robotics, Sales, Travel, and many
more.

dataset, let's first order it in ascending order:
The median is 26, as that's the middle value.
If our dataset had an even number of values, we would take the two values in the middle and calculate their
average value.
In statistics, we have:
mean: the average of the values.
median: the middle value.
standard deviation: the measure of spread.
These statistics provide information about your data set and help you understand where your data values are
and how they are distributed.
Data Analysis refers to the process of examining in close detail the components of a given dataset – separating
them out and studying the parts individually and their relationship between one another. It is used to uncover
patterns, trends and anomalies lying within data, and thereby deliver the insights businesses need to enable
evidence-based decision making.
Let's dive into some basics of statistics first.These concepts form the main building blocks of data analysis.
As an example dataset, let's consider the prices of a group of products: [18, 24, 67, 55, 42, 14, 19, 26, 33]
MEAN
The given dataset includes prices of 9 products. The mean is the average value of the dataset. We can calculate it by
adding all prices together and dividing by the number of products:
MEDIAN
Another useful concept is median: the middle value of an ordered dataset. To calculate the median for our prices
mean = 298/9 = 33.1
[14, 18, 19, 24, 26, 33, 42, 55, 67]

NumPy
NumPy arrays are often called ndarrays, which stands for "N-dimensional array", because they can have
multiple dimensions.
FOR EXAMPLE:
This will create a 2-dimensional array, which has 3 columns and 3 rows, and output the value at the 2nd row and 3rd column.
NumPy (Numerical Python) is a Python library used to work with numerical data.
NumPy includes functions and data structures that can perform a wide variety
of mathematical operations.
To start using NumPy, we first need to import it: import numpy as np
np is the most common name used to import numpy.
NUMPY ARRAY
In Python, lists are used to store data. NumPy provides an array structure for performing operations with data.
NumPy arrays are faster and more compact than lists, but NumPy arrays are homogeneous, meaning they can
contain only a single data type, while lists can contain multiple different types of data.
A NumPy array can be created using the np.array() function, providing it a list as the argument:

Arrays have properties (attributes), which can be accessed using a dot.
ndim returns the number of dimensions of the array.
size returns the total number of elements of the array.
shape returns a tuple of integers that indicate the number of elements stored along each dimension of the array.
FOR EXAMPLE:
We can add, remove and sort an array using the np.append(), np.delete() and np.sort() functions. np.arange()
allows you to create an array that contains a range of evenly spaced intervals (similar to a Python range):
Reshape Method
NumPy allows us to change the shape of our arrays using the reshape() function. For example, we can change
our 1-dimensional array to an array with 3 rows and 2 columns:

INDEXING AND SLICING
NumPy arrays can be indexed and sliced the same way that Python lists are.
FOR EXAMPLE:
TASK I: Create an array of numbers below 100 that are
multiples of both 3 and 5, then output it
CONDITIONS
You can provide a condition as the index to select the elements that fulfill the given condition. Conditions can be
combined using the & (and) and | (or) operators.
FOR EXAMPLE:

NUMPY ARRAY
When you create an array, NumPy assign a default datatype based to the array, based on the elements within the
array and it tries to find the best one that fits that best fits the information. However, we can override this default
by using a different argument in our array creation.This argument is called dtype.
FOR EXAMPLE:
ZEROs, ONEs & FULL METHODS
To create an array of just zero values (blank matrix). We use the zeros() method. It takes the shape of the matrix to
be created as arguments. To create and all 1 matrix, use the ones() method. To assign a different number other
than 1 or 0, we use the full() method. It takes 2 argument.The second being the number desired to fill the matrix.
FOR EXAMPLE:

NUMPY ARRAY DATATYPES
The major datatypes in NumPy array are:
1. np.int16
2. np.int32
3. np.int64
Integer (-32768 to 32767)
Integer (-2147483648 to 2147483647)
Integer (-9223372036854775808 to 9223372036854775807)
4. np.uint16
5. np.uint32
6. np.uint64
Unsigned Integer (0 to 65535)
7. np.float Same as Python’s float etc
Remember that Python automatically assign datatypes to variables, the same applies to NumPy. However, we
may accidentally overflow a NumPy datatype.You can see the attribute of an array using .dtype .
ADVANCED INDEXING TECHNIQUES

We can use thisTrue & False array to index out the value from the original array:
Notice we get back a 1-dimensional array that only has elements that return True against this condition. Let’s take
this a step further by retaining the original array shape.
We can achieve this using the where() method which will take 3 arguments. The first being the condition, the next
argument is the value to be displayed when returned true, and the last argument is the false return.
To index out every 3rd and 4th value from every row.We can do this:
Python Negative indexing can be applied here also.
NumPy package can be used to create Boolean indexing. We do this by creating a separate array of the same size as
the array we want to index.This separate array will have True & False values:

To apply multiple conditions to the array, we use the logical_and() method.
Now that we have True & False matrix, we can apply it to the test_data array as we did before to give a
1-dimensional array.
TASK II: Upgrade the program above to display the array of similar shape as
the original, showing zeroes for element that do not meet the conditions
STATISTICS WITH NUMPY
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of
masses of numerical data. As discussed earlier, some common statistical tools and procedures include the
following: mean, median, variance, standard deviation etc. NumPy arrays have built-in functions to return those
values.
FOR EXAMPLE:

ARRAY OPERATIONS
It is easy to perform basic mathematical operations with arrays. For example, to find the sum of all elements, we
use the sum() function. Similarly, min() and max() can be used to get the smallest and largest elements.
FOR EXAMPLE:
SINGLE ARRAY OPERATIONS
We can sum along the axis, using keyword argument axis. Using value 1 to sum up the rows or 0 to get the column
summation.

There’s a lot more operations that we can apply to a NumPy array, but we won’t be able to cover them all. We'll
treat the most important/common ones which includes:
The axis keywords can be applied to any of these
MULTI-ARRAY OPERATIONS
We can use most math operators as we would in Python on NumPy arrays. It works by going element by element
and finding the corresponding element across both arrays and do the operations to both elements.
Vector Product:To take vector products of 2 arrays, we do this:

Pandas
You can think of a Series as a one-dimensional array, while a DataFrame is a multi-dimensional array.
Pandas (panel data) is one of the most popular data science libraries in Python.
Easy to use, it is built on top of NumPy and shares many functions and
properties. With Pandas, you can read and extract data from files, manipulate,
transform and analyze it, calculate statistics and correlations, and much more!
To start using pandas, we need to import it first: import pandas as pd
pd is the most common name used to import pandas.
SERIES & DATAFRAMES
The two primary components of pandas are the Series and the DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of
Series.
For example, the following DataFrame is made of two Series, ages and heights

CREATING DATAFRAMES
Dataframe is a 2-dimensional size-mutable, potentially heterogeneous tabular data structure with labelled
axes(rows & columns). Arithmetic operations align on both row and column labels. There are a lot of attributes
(properties) and methods.
Before working with real data, let's first create a DataFrame manually to explore its functions. The easiest way to
create a DataFrame is using a dictionary:
Each key is a column, while the value is an array representing the data for that column. Now, we can pass this
dictionary to the DataFrame constructor.
ATTRIBUTES OF A DATAFRAME
Some attributes/properties of a dataframe include:
1. .index -
2. .columns -
3. .dtypes -
pulls out the index
pulls out the columns
pulls out the datatype
We can see the index is given to us in a range, (start = 0, stop = 4 , step = 1)
.columns returns every column in our dataframe and they are all lined on the same index.
.dtypes returns the datatypes listed here.
It is important to remember that each column in a dataframe is a series, and each series has 1 datatype.

INDEXING
The DataFrame automatically creates a numeric index for each row. We can specify a custom index, when creating
the DataFrame. Also, we can access a row using its index and the loc[] function.
Note, that loc uses square brackets to specify the index.
We can select a single column by specifying its name in square brackets:
The result is a Series object.
If we want to select multiple columns, we can specify a list of column names:
This time, the result is a DataFrame, as it includes multiple columns.

Scalar values make up series and series values make up a DataFrame.
We use the square brackets “[ ]” when pulling a smaller object from a larger object.
So pulling a series from a DataFrame, we use a square bracket and to pull a scalar value from a series, we use square
brackets also.
This method of selecting data is fun for just working code, but it’s not advisable to use it in production code. We need
to use the selection methods available to use in pandas package.
This is because the use of square bracket may lead to an error called chained indexing. Chained indexing is a
problem when we have large dataset and the scripts behave in a unpredictable way.
DATA SELECTION

We have 2 accessing methods for groups of values and we have 2 accessing methods for single values. Both pull out
value either by the Labels or by their Integer Position in the DataFrame.
Accessing DataFrame Values
By Integer Position
By Label
SingleValue
.at
Group ofValues
.loc
SingleValue
.iat
Group ofValues
.iloc
Let’s say we want to pull out the James’ age using both method for pulling single values from a dataframe.
First, using the .at []method.The syntax is - at[row_select , column_select]. Where row_select is the Row
Label and column_select is the Column Label
Secondly, using the integer positions.The syntax is - iat[row_select , column_select]. Where row_select &
column_select are the integer position of the value we want to pull.

Let’s say we want to pull out group of values. Let’s say the ages and heights of James and Bob.
We can use the .loc[] or .iloc[] method.The syntax is for .loc[]- .loc[row_select , column_select]
But instead of 1 argument, we pass either:
1. Label/List of label:
2. Logical boolean:
3. Slice:
The syntax for using .iloc[] method is .iloc[row_select , column_select].
Also, we can pass either:
1. List:
2. Slice:
CONDITIONS
We can also select the data based on a condition. For example, let's select all rows where age is greater than 18 and
height is greater than 180:
Similarly, the or | operator can be used to combine conditions.
Similarly, .isin() can also be used, passing the needed values as a list

READING DATA
It is quite common for data to come in a file format. One of the most popular formats is the CSV (comma-
separated values). Pandas supports reading data from a CSV file directly into a DataFrame.
The read_csv() function reads the data of a CSV file into a DataFrame. We need to provide the file path to the
read_csv() function:
Pandas also supports reading from Excel, HTML, JSON files, as well as SQL databases.
Once we have the data in a DataFrame, we can start exploring it. We can get the first rows of the data using the
head() function of the DataFrame:
By default it returns the first 5 rows. You can instruct it to return the number of rows you would like as an
argument. For example, df.head(8) will return the first 8 rows.
Similarly, you can get the last rows using the tail() function.

The info() function is used to get essential information about your dataset, such as number of rows, columns, data
types, etc:
From the result, we can see that our dataset contains 10 rows and 4 columns: S/N, Name, Net worth and Source.
We also see that Pandas has added an auto generated index.
We can set our own index column by using the set_index() function:
The S/N column is a good choice for our index, as there is one row for each S/N.

DROPPING COLUMNS
We can remove a particular row/column from a dataset using the .drop() method.
drop()
axis=1
axis=0
Inplace
- deletes rows and columns
- specifiesthat we want to drop a column
- will drop a row
- saves the configuration to same dataframe
CREATING COLUMNS
Pandas allows us to create our own columns. For example, we can add an age column to the dataset.
The length of the list passed must be same as the numbers of rows in the dataset

WORKSHOP I – Analyzing California’s Covid-19 dataset
Our date is in DD.MM.YY format, which is why we need to specify the format attribute.
What can you deduce from this data?
Now, we can add a month column based on the date column. We do this by converting the date column to
datetime and extracting the month name from it, assigning the value to our new month column.

Now that our dataset is clean and set up, we are ready to look into some stats!
The describe() function returns the summary statistics for all the numeric columns:
This function will show main statistics for the numeric columns, such as std, mean, min, max values, etc.
From the result, we see that the maximum cases that have been recorded in a day is 64987, while the average
daily number of new cases is 6748.
We can also get the summary stats for a single column, for example: df['cases'].describe()
GROUPING
Since we have a month column, we can see how many values each month has, by using the value_counts()
functions:
value_counts()
returns how many
times a value appears
in the dataset, also
called the frequency of
the values
We can see that, for example, January has only 7 records, while
the other months have data for all days.

Now we can calculate data insights!
For example, let's determine the number of total infections in each month. To do this, we need to group our data
by the month column and then calculate the sum of the cases column for each month:
The groupby() function is used to group our dataset by the given column.
We can also calculate the number of total cases in the entire year:
We can see that California had 2,307,769 infection cases in 2020.
Similarly, we can use min(), max(), mean(), etc. to find the corresponding values for each group.
TASK III: Using the COVID dataset for California (2020), find the
day when the deaths/case ratio was highest

FOR EXAMPLE:
MULTI-INDEX / HIERARCHICAL INDEXING
We’ll look at what a pandas multi-index object is using a dummy stock excel file. It includes a column for stock
takers, another for month and a last column for values of the stocks.
We see we get an ordinary dataframe.
A pandas multi index object allows us to take a data frame that has a higher number of dimension and reduce
those dimensions down into a lower dimensional structure.
So far, we have only seen dataframes index with 1 column, by
setting index to 2 columns we are creating a multi index
object.
Instead of having something very similar to the excel spreadsheet, we are returned a
DataFrame that has a multi-index.
Seeing the result, we have dropped all the duplications from the stock column and we can
distinguish each stock very clearly as each month January to May.

A multi-index object can be thought of as an array of tuples, and in our case the tuples of MSFT & Jan, MSFT & Feb
and so on.
We are able to access the data more efficiently and we have a cleaner DataFrame in the terminal.
Let’s say we want to access all the values for stockWM (waste management):
We can see that we pulled the values out successfully, without the multi-index we’d have to go into the
DataFrame manually and pull out all the WMs ourselves. Multi-index helps us to better organise our data so we
can pull it more effectively.
TASK IV: Pull out the stock of Microsoft for the month of
January
Let’s talk about one more way to pull data from a multi-indexed DataFrame;
Let’s say we want to return values for each stock in the month of January. We would use pandas function of
.IndexSlice() to do this:
We don’t need to master as we can achieve the same with plain DataFrames. The idea is that it makes it easier to
pull certain data belonging to certain groups

CONCATENATION
We use the concat() function to join 2 series objects or 2 DataFrames together.
A keyword key can be defined to specify values that will be used to pull out individual data for each DataFrames. This
will create a multi-indexed DataFrame.
Other methods available in pandas include: .merge(), .stack(), .unstack(), .pivot_table(), .duplicated(), .map(),
.rename(), .cut(), .agg(), .reindex(), .replace(), .rank(), .crosstab(), .idxmax(), idxmin(), idxmax(), .sort_value(), etc.
WORKSHOP II
Rolling windows calculation with pandas

Matplotlib
Matplotlib is a library used for visualization, basically to create graphs, charts, and
figures. It also provides functions to customize your figures by changing the
colors, labels, etc.
To start using matplotlib, we need to import it first: import matplotlib.pyplot as plt
pyplot is the module we will be using to create our plots.
plt is a common name used for importing this module.
Matplotlib works really well with Pandas! To demonstrate the power of matplotlib, let's create a chart from
dummy data. We will create a pandas Series with some numbers and use it to create our chart:
The .plot() function is used to create a plot from
the data in a Pandas Series or DataFrame.
The attribute “kind” defines the type of chart to
plot. If it is not specified, a line graph is given
instead.

The data from the series is using theY axis, while the index is plotted on the X axis.
As we have not provided a custom index for our data, the default numeric index is
used.
plt.savefig('plot.png') is used to save and display the chart in the terminal.
In most environments this step is not needed, as calling the plot() function automatically
displays the chart. Also, the show() function does same.
LINE PLOT
Matplotlib supports the creation of different chart types. Let's start with the most basic one – a line chart.
We will use the COVID-19 data from the pandas module to create our charts.
For example: Let's show the number of cases in the month of December.
To create a line chart we simply need to call the plot() function on our DataFrame, which contains the
corresponding data:
Here is the result:

We can also include multiple lines in our chart.
For example, let's also include the deaths column in our DataFrame:
We first group the data by the month column, then calculate the
sum of the cases in that month.
The plot() function can also take a “grid” argument with valueTrue or False to display grids in the chart
As you can from the result, matplotlib automatically added a legend to show the colours of the lines for the columns
BAR PLOT
The plot() function can take a “kind” argument, specifying the type of the plot we want to produce. For bar plots,
provide kind="bar".
For example, let's make a bar plot for the monthly infection cases:

We can also plot multiple columns. The stacked property can be used to specify if the bars should be stacked on top
of each other.
For example:
kind="barh" can be used to create a horizontal bar chart.
BOX PLOT
A box plot is used to visualize the distribution of values in a column, basically visualizing the result of the describe()
function.
For example, let's create a box plot for the cases in June:
• The green line shows the median value.
• The box shows the upper and lower quartiles (25% of the data is greater or less than these values).
• The circles show the outliers, while the black lines show the min/max values excluding the outliers.

HISTOGRAM
Similar to box plots, histograms show the distribution of data. Visually histograms are similar to bar charts, however,
histograms display frequencies for a group of data rather than an individual data point; therefore, no spaces are
present between the bars.Typically, a histogram groups data into chunks (or bins).
For example:
The histogram grouped the data into 9 bins and shows their frequency. You
can see that, for example, only single data points are greater than 6000.
You can manually specify the number of bins to use using the bins attribute: e.g. plot(kind="hist", bins = 10)
AREA PLOT
kind='area' creates an Area plot:
Area plots are stacked by default, which is why we provided
stacked=False explicitly.

SCATTER PLOT
A scatter plot is used to show the relationship between two variables.
For example, we can visualize how the cases/deaths are related. We need to specify the x and y columns to be
used for the plot.
The plot contains 30 points since we used the data for each day in June.
The data points look "scattered" around the graph, giving this type of data visualization its name.
PIE CHART
Pie charts are generally used to show percentage or proportional data. We can create a pie chart using kind="pie".
Let's create one for cases by month:

PLOT FORMATTING
Matplotlib provides a number of arguments to customize your plot. The legend argument specifies whether or not
to show the legend.You can also change the labels of the axis by setting the xlabel and ylabel arguments:
By default, pandas select the index name as xlabel, while leaving it empty for ylabel.
The suptitle() function can be used to set a plot title:

We can also change the colors used in the plot by setting the color attribute. It accepts a list of color hexes.
For example, let's set the cases to blue, deaths to red colors:
These attributes work for almost all chart types.
Other methods available in matplotlib.pyplot include: .colormaps(), .figure(), .pause(), .polar(), minortick_on(),
minortick_off(), .twinx(), .twiny(), .tripcolor(), .xcorr(), .xlim(), .xscale(), .xticks(), .ylim(),.yscale(), .yticks() etc.
© 2022
Feedbacks: mailemmydee@gmail.com

• https://www.sololearn.com/
• https://packaging.python.org/en/latest/tutorials/installing-packages/
• https://numpy.org/devdocs/user/absolute_beginners.html
• https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe
• https://matplotlib.org/stable/users/index.html
• https://jovian.ai/learn/data-analysis-with-python-zero-to-pandas
• https://www.freecodecamp.org/learn/data-analysis-with-python/
• https://www.simplilearn.com/getting-started-data-science-with-python-skillup
• https://www.udemy.com/course/learn-data-analysis-using-pandas-and-python/
• https://medium.com/@springboard_ind/data-science-vs-data-analytics-how-to-decide-which-one-is-right-for-you-41e7bdec080e?p=7849b2fe573e
• https://itchronicles.com/big-data/data-analytics-vs-data-analysis-whats-the-difference/amp/
• https://youtube.com/channel/UCJHs6RO1CSM85e8jIMmCySw
• https://youtube.com/user/DrNoureddinSadawi
• https://youtube.com/c/cs50
References

Python (Data Analysis) cleaning and visualize

Recommended

Recommended

More Related Content

Similar to Python (Data Analysis) cleaning and visualize

Similar to Python (Data Analysis) cleaning and visualize (20)

Recently uploaded

Recently uploaded (20)

Python (Data Analysis) cleaning and visualize