3. Classical data analysis
The Bayesian approach incorporates prior probability distribution
knowledge into the analysis steps:
Probability distribution:
The probability distribution is one of the important concepts in statistics.
It has huge applications in business, engineering, medicine and other major
sectors.
It is mainly used to make future predictions based on a sample for a random
experiment.
For example, in business, it is used to predict if there will be profit or loss to the
company using any new strategy or by proving any hypothesis test in the medical
field, etc.
4. Data analysts and data scientists freely mix the steps mentioned in the
preceding approaches to get meaningful insights from the data.
In addition to that, it is essentially difficult to judge or estimate which
model is best for data analysis.
All of them have their paradigms and are suitable for different types of
data analysis.
5. Software tools available for EDA
There are several software tools that are available to facilitate EDA. Here, we are going to
outline some of the open-source tools:
1. Python: This is an open-source programming language widely used in data analysis, data
mining, and data science.
2. R programming language: R is an open-source programming language that is widely utilized in
statistical computation and graphical data analysis.
3. Weka: This is an open-source data mining package that involves several EDA tools and
algorithms.
4. KNIME: This is an open-source tool for data analysis and is based on
Eclipse
7. NumPy - Basics
NumPy which stands for Numerical Python.
Travis Oliphant created NumPy package in 2005
What is Numpy?
NumPy is a module for Python that allows you to work
with multidimensional arrays and matrices.
It’s perfect for scientific or mathematical
calculations because it’s fast and efficient.
Why Numpy?
NumPy provides a convenient and efficient way to handle the vast
amount of data.
NumPy is also very convenient with Matrix multiplication and data
reshaping.
9. Example
import numpy as np
# Creating array object
arr = np.array( [[ 1, 2, 3],
[ 4, 2, 5]] )
# Printing type of arr object
print("Array is of type: ", type(arr))
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
10. NumPy Array Creation
import numpy as np
# Creating array from list with type float
a = np.array([[1, 2, 4], [5, 8, 7]], dtype = 'float')
print ("Array created using passed list:n", a)
# Creating array from tuple
b = np.array((1 , 3, 2))
print ("nArray created using passed tuple:n", b)
11. NumPy Array Creation
# Creating a 3X4 array with all zeros
c = np.zeros((3, 4))
print ("An array initialized with all zeros:n", c)
# Create an array with random values
e = np.random.random((2, 2))
print ("A random array:n", e)
12. Arange Function in numpy
# Create a sequence of integers
# from 0 to 30 with steps of 5
f = np.arange(0, 30, 5)
print ("A sequential array with steps of 5:n", f)
13. Reshape From 1-D to 2-D
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
newarr = arr.reshape(4, 3)
print(newarr)
14. Searching Arrays
Find the indexes where the value is 4:
Search using where()method
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
15. Data Distribution
Data Distribution is a list of all possible values, and how often each value occurs.
Such lists are important when working with statistics and data science.
The random module offers methods that returns randomly generated data
distributions.
16. Random Distribution
A random distribution is a set of random numbers.
It can be created using choice() methods of a random module.
The choice() method allows us to specify the probability for each value.
The probability is set by a number between 0 and 1, where
0 means that the value will never occur and
1 means that the value will always occur.
17. Example:
Generate a 1-D array containing 100 values, where each value has to be 3, 5, 7 or 9.
The probability for the value to be 3 is set to be 0.1
The probability for the value to be 5 is set to be 0.3
The probability for the value to be 7 is set to be 0.6
The probability for the value to be 9 is set to be 0
19. What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.
What is a DataFrame?
A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and
columns, much like a spreadsheet.
DataFrames are data structures used in modern data analytics because they are a flexible and
intuitive way of storing and working with data.
20. Why Use Pandas?
Pandas allow us to analyze big data and make conclusions based
on statistical theories.
Pandas can clean messy datasets, and make them readable and
relevant.
Relevant data is very important in data science.
21. What Can Pandas Do?
Pandas give you answers about the data.
• Is there a correlation between two or more columns?
What is the average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or
contain wrong values, like empty or NULL values. This is
called cleaning the data.
22. Create Labels
Example :Create your own labels
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
23. Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional
data structure, like a 2 dimensional array,
or a table with rows and columns.
24. Pandas Read CSV and JSON
Read CSV Files
A simple way to store big data sets is to use CSV files.
CSV files contain plain text and are a well-known format that
can be read by everyone including Pandas.
In our examples we will be using a CSV file called 'data.csv'.
28. Pandas: Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and
is well-known in the world of programming, including
Pandas.
30. Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
31. Pandas - Analyzing DataFrames
Viewing() method
One of the most used methods for getting a quick overview of
the DataFrame, is the head() method.
Get a quick overview by printing the first 10 rows of the
DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))