The document describes a data cleaning project to prepare a credit card application dataset for analysis. It involves removing missing values by imputing the mean for continuous variables and mode for categorical variables. The original dataset is from the UCI machine learning repository and contains 690 instances with 15 attributes plus a class attribute. Initial examination found 67 missing values across several attributes. The cleaning process imports the data, analyzes missing values, imputes them, and saves the cleaned dataset for further analysis.
Presentation on data preparation with pandasAkshitaKanther
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
Knowledge representation In Artificial IntelligenceRamla Sheikh
facts, information, and skills acquired through experience or education; the theoretical or practical understanding of a subject.
Knowledge = information + rules
EXAMPLE
Doctors, managers.
Deals with CSV Files operations in Pandas like reading, writing, performing joins and other operations in python using dataframes and Series in Pandas.
This presentation educates you about R - data types in detail with data type syntax, the data types are - Vectors, Lists, Matrices, Arrays, Factors, Data Frames.
For more topics stay tuned with Learnbay.
Hashing is the process of converting a given key into another value. A hash function is used to generate the new value according to a mathematical algorithm. The result of a hash function is known as a hash value or simply, a hash.
A talk on Data Science in Piano, contains the following:
1. Tips on how to make sure your data are analysis-friendly
2. A short introduction into how to do data science with a for loop (partially stolen from https://goo.gl/wHwZKv)
3. A brief look on output evolution for paywall health check for our clients (publishers)
4. A sneak peek into challenges we face currently
Presentation on data preparation with pandasAkshitaKanther
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
Knowledge representation In Artificial IntelligenceRamla Sheikh
facts, information, and skills acquired through experience or education; the theoretical or practical understanding of a subject.
Knowledge = information + rules
EXAMPLE
Doctors, managers.
Deals with CSV Files operations in Pandas like reading, writing, performing joins and other operations in python using dataframes and Series in Pandas.
This presentation educates you about R - data types in detail with data type syntax, the data types are - Vectors, Lists, Matrices, Arrays, Factors, Data Frames.
For more topics stay tuned with Learnbay.
Hashing is the process of converting a given key into another value. A hash function is used to generate the new value according to a mathematical algorithm. The result of a hash function is known as a hash value or simply, a hash.
A talk on Data Science in Piano, contains the following:
1. Tips on how to make sure your data are analysis-friendly
2. A short introduction into how to do data science with a for loop (partially stolen from https://goo.gl/wHwZKv)
3. A brief look on output evolution for paywall health check for our clients (publishers)
4. A sneak peek into challenges we face currently
The project is based on detecting frequent patterns on a web-server by using the server log. It generates multi-level statistics based on Apriori algorithm with each level determined by Pattern length.
Currently this is a dynamic programming form of Apriori where we have implemented the rule that to be a frequent pattern at a level k (i.e patter of size k), one of the elements from its powerset of size (k-1) must be frequent. This helps in improving the efficiency to less than quadratic time on average.
Agile frameworks (Scrum, Kanban) do not effectively address the specifics of how financial targets are set and achieved. This presentation proposes a specific set of activities to address these gaps and improve the effectiveness of business agility efforts.
AGENDA:
Introductions and Company Overviews
Stein Mart’s Business Objectives
Leveraging the IBM Cloud and QueBIT FrameWORQ to maximize time to value
Stein Mart’s Reporting Solution
Results Achieved
PA 1c. Decision VariablesabcdCalculated values0.21110.531110.09760.docxgerardkortney
PA 1c. Decision VariablesabcdCalculated values0.21110.531110.09760.16019TotalObjective Function0.860.940.930.850.90772Constraints1111110.774-0.094-0.093-0.0850.09077>=0-0.0860.846-0.093-0.0850.40847>=0-0.086-0.0940.837-0.0851.90E-17>=0-0.086-0.094-0.0930.7650.04539>=00.94-2.790.22693>=00.86-1.86-2.00E-16>=0-0.129-0.141-0.13950.72256.90E-17>=0
a.
Let the weights be a, b, c and d to midterm, final, individual assignment and Participation respectively.
Korey would like to maximize the course grade. Therefore the course grade (Maximization):
=0.86a + 0.94b + 0.93c + 0.85d
Restrictions to course grade working: a+b+c+d=1
The weights must be non-negative, Non negativity constraints: a, b, c, d ≥ 0
The four components for each should determine 10% of the sum of the grade at least.
0.86a ≥ 0.1 (0.86a + 0.94b + 0.93c + 0.85d)
0.86a ≥ 0.086a + 0.094b + 0.093c + 0.085d
0.774a – 0.094b – 0.093c -0.085d ≥ 0
0.94b ≥ 0.1 (0.86a + 0.94b + 0.93c + 0.85d)
0846b ≥ 0.086a + 0.094b + 0.093c + 0.085d
0.846b – 0.086a – 0.093c – 0.085d ≥ 0
0.93c ≥ 0.1 (0.86a + 0.94b + 0.93c + 0.85d)
0.93c ≥ 0.086a +0.094b +0.093c + 0.085d
0.837c – 0.086a – 0.094b – 0.085d ≥ 0
0.85d ≥ 0.1 (0.86a + 0.94b + 0.93c + 0.85d)
0.85d ≥ 0.086a + 0.094b + 0.093c + 0.085d
0.765d – 0.086a – 0.094b – 0.093c ≥ 0
Here it is three times the particular assignment grade.
0.94b ≥ 3(0.93c)
0.94b ≥ 2.79c
0.94b – 2.79c ≥ 0
Midterm grade must count at least twice as much as the individual assignment score.
0.86a ≥ 2(0.93c)
0.86a ≥ 1.86c
0.86a – 1.86c ≥ 0
The presence of the grade should be less than the 15% of the whole grade.
0.85d ≤ 0.15(0.86a + 0.94b +0.93c +0.85d)
0.85d ≤ 0.129a + 0.141b +0.1395c + 0.1275d
0.7225d – 00.129a – 0.141b – 0.1395c ≥ 0
b.
The complete optimization model is Course grade (Maximization):
= 0.86a + 0.94b + 0.93c + 0.85d
a+b+c+d=1
0.774a – 0.094b - 0.093c – 0.085d ≥ 0
0.846b – 0.086a – 0.093c – 0.085d ≥ 0
0.837c – 0.086a – 0.094b – 0.085d ≥ 0
0.765d – 0.086a – 0.094b – 0.093c ≥ 0
0.94b – 2.79c ≥ 0
0.86a – 1.86c ≥ 0
0.7225d – 0.129a – 0.141b – 0.1395c ≥ 0
c.
Therefore midterm weights should be 21%, final weights 53%, individual assignment 10%, Participation should be 16%.
The maximum course grade is 90%.
PA 5b.Rosenberg Land DevelopmentDataOneTwoThreeBedroomBedroomBedroomUnitUnitUnit1BR2BR3BRAvailableConstruction cost$450,000$600,000$750,000$180,000,000Total units325Profit/ unit$45,000$60,000$75,000Minimum15%25%25%ModelTotalUnits Build4067162270Minimum406767Construction cost$18,202,247$40,449,438$121,348,315$180,000,000Contribution in profit$1,820,225$4,044,944$12,134,831$18,000,000c.ModelTotalUnits Build4981195325Minimum498181Construction cost$21,937,500$48,750,000$146,250,000$216,937,500Contribution in profit$2,193,750$4,875,000$14,625,000$21,693,750
a.
1BR = number of one bedroom units produced
2BR = number of two bedroom units produced
3BR = number of three bedroom units produced
Maximize Total Profit = $45,000 (1BR) + $60,000 (2BR) + $75,000 (3BR)
(1BR) + (2BR) + (.
Run Your Business 6X Faster at Lower Costs!Scott Hayes
This presentation demonstrates how easy it is to improve database performance and verify success with DBI\'s DB2 LUW Performance Tools. Visit www.DBIsoftware.com to learn more.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. 1 | Page
DATA SCIENCE PORTFOLIO
Jombaba.s7@gmail.com
I enjoy working on data and highlighted in this portfolio is one of the projects I have
implemented using the python-based tools and programming platform. The entire project is
presented in five sub-projects with each of them capturing specifics of the entire work.
Imagine that, a financial institution / bank wish to find a solution to a ‘Customer Acquisition and
Customer Retention’ related problem. As a data Scientist, this is my attempt at providing a
wholesome solution and the series of five projects illustrate a plausible approach in resolving
the problem.
Data Cleaning Project
Purpose:
This is the first in a series of projects. The purpose of the data cleaning project is to clean-up the
‘Credit Card Application’ dataset by removing missing and other out of place characters. Initial
examination of the dataset shows that we have missing values to contend with. There are 67
missing values in the dataset.
The dataset comprises continuous and nominal attributes of small and large values. For reasons
of privacy, the dataset was published with column labels A1 – A16 replacing the actual
descriptive labels.
Dataset:
The dataset is found at: - https://archive.ics.uci.edu/ml/datasets/credit+approval
Number of instances (observations) = 690
Number of attributes =15 (columns A1-A15)
There is one class attribute (column A16)
307 (44.5%) of the classifier is “+” and 383 (55.5%) is “-“
Attribute
Label
Value Type
A1 Nominal
A2 continuous
A3 continuous
A4 Nominal
A5 Nominal
A6 Nominal
2. 2 | Page
A7 Nominal
A8 continuous
A9 Nominal
A10 Nominal
A11 Continuous
(Integer)
A12 Nominal
A13 Nominal
A14 Continuous
(Integer)
A15 Continuous
(Integer)
A16 Class
attribute
Process:
Step1:
The original dataset was in .txt format and below is an abstract of it.
b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+
a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560,+
The .txt file was imported into excel using the ‘import from text’ Excel menu to obtain a .xls file
which looks like the following:
A1 A2 A3 A4 A5 A6 A7 A8
A
9
A1
0 A11
A1
2
A1
3
A1
4 A15
A1
6
B 30.83 0 u G w v 1.25 t t 1 F g 202 0 +
A 58.67 4.46 u G q h 3.04 t t 6 F g 43 560 +
Step2:
To prepare the required environment using Python pandas and numpy libraries, we issue the
following commands.
import pandas as pd
import numpy as np
Step3:
Next, to detect the character and distribution of the missing values we use the following codes.
3. 3 | Page
missing_values = ["?"]
df=pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC.csv', na_values=missing_values)
df.isnull().sum()
Output
A1 12
A2 12
A3 0
A4 6
A5 6
A6 9
A7 9
A8 0
A9 0
A10 0
A11 0
A12 0
A13 0
A14 13
A15 0
A16 0
dtype: int64
The total number of missing values is derived using the following code
df.isnull().sum().sum()
Output: 67
To confirm whether or not the dataset was imported correctly, examine the first five (5) and the
last five (5) records of the dataset.
df.head(5)
4. 4 | Page
Output:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 b 30.83 0.000 u g w v 1.25 t t 1 f g 202 0 +
1 a 58.67 4.460 u g q h 3.04 t t 6 f g 43 560 +
2 a 24.5 0.500 u g q h 1.50 t f 0 f g 280 824 +
3 b 27.83 1.540 u g w v 3.75 t t 5 t g 100 3 +
4 b 20.17 5.625 u g w v 1.71 t f 0 f s 120 0 +
df.tail(5)
Output:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
685 b 21.08 10.085 y p e h 1.25 f f 0 f g 260 0 -
686 a 22.67 0.750 u g c v 2.00 f t 2 t g 200 394 -
687 a 25.25 13.500 y p ff ff 2.00 f t 1 t g 200 1 -
688 b 17.92 0.205 u g aa v 0.04 f f 0 f g 280 750 -
689 b 35 3.375 u g c h 8.29 f f 0 t g 0 0 -
Next, we confirm that the entire dataset is captured and to do this a glimpse at the basic statistics
of numerical features is revealed through the following codes.
df.shape
Output: (690, 16)
df.describe()
Output:
A2 A3 A8 A11 A14
count 678.000000 690.000000 690.000000 690.00000 677.000000
mean 31.568171 4.758725 2.223406 2.40000 184.014771
std 11.957862 4.978163 3.346513 4.86294 173.806768
min 13.750000 0.000000 0.000000 0.00000 0.000000
25% 22.602500 1.000000 0.165000 0.00000 75.000000
50% 28.460000 2.750000 1.000000 0.00000 160.000000
75% 38.230000 7.207500 2.625000 3.00000 276.000000
max 80.250000 28.000000 28.500000 67.00000 2000.000000
A15
count 690.000000
5. 5 | Page
mean 1017.385507
std 5210.102598
min 0.000000
25% 0.000000
50% 5.000000
75% 395.500000
max 100000.000000
Furthermore, to detect unique values and counts for all variables using the following code. We
do this in order to determine inputs for nominal columns.
df.apply(pd.Series.value_counts)
Output:
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11
0.0 NaN NaN 19.0 NaN NaN NaN NaN 70.0 NaN NaN 395.0
0.04 NaN NaN 5.0 NaN NaN NaN NaN 33.0 NaN NaN NaN
0.08 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
0.085 NaN NaN 1.0 NaN NaN NaN NaN 26.0 NaN NaN NaN
0.125 NaN NaN 5.0 NaN NaN NaN NaN 30.0 NaN NaN NaN
0.165 NaN NaN 8.0 NaN NaN NaN NaN 22.0 NaN NaN NaN
0.17 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
0.205 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN
0.21 NaN NaN 3.0 NaN NaN NaN NaN 6.0 NaN NaN NaN
0.25 NaN NaN 6.0 NaN NaN NaN NaN 35.0 NaN NaN NaN
0.29 NaN NaN 6.0 NaN NaN NaN NaN 12.0 NaN NaN NaN
0.335 NaN NaN 6.0 NaN NaN NaN NaN 5.0 NaN NaN NaN
0.375 NaN NaN 9.0 NaN NaN NaN NaN 7.0 NaN NaN NaN
0.415 NaN NaN 4.0 NaN NaN NaN NaN 8.0 NaN NaN NaN
0.42 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
6. 6 | Page
0.455 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN
0.46 NaN NaN 4.0 NaN NaN NaN NaN 1.0 NaN NaN NaN
0.5 NaN NaN 15.0 NaN NaN NaN NaN 28.0 NaN NaN NaN
0.54 NaN NaN 8.0 NaN NaN NaN NaN 6.0 NaN NaN NaN
0.58 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
0.585 NaN NaN 10.0 NaN NaN NaN NaN 6.0 NaN NaN NaN
0.625 NaN NaN 3.0 NaN NaN NaN NaN 2.0 NaN NaN NaN
0.665 NaN NaN 4.0 NaN NaN NaN NaN 6.0 NaN NaN NaN
0.67 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
0.705 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
0.71 NaN NaN 1.0 NaN NaN NaN NaN 1.0 NaN NaN NaN
0.75 NaN NaN 16.0 NaN NaN NaN NaN 12.0 NaN NaN NaN
0.79 NaN NaN 4.0 NaN NaN NaN NaN 1.0 NaN NaN NaN
0.795 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN
0.83 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN
... .. ... ... ... ... ... ... ... ... ...
b 468.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
bb NaN NaN NaN NaN NaN NaN 59.0 NaN NaN NaN NaN
c NaN NaN NaN NaN NaN 137.0 NaN NaN NaN NaN NaN
cc NaN NaN NaN NaN NaN 41.0 NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN 30.0 NaN NaN NaN NaN NaN
dd NaN NaN NaN NaN NaN NaN 6.0 NaN NaN NaN NaN
e NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN NaN
f NaN NaN NaN NaN NaN NaN NaN NaN 329.0 395.0 NaN
ff NaN NaN NaN NaN NaN 53.0 57.0 NaN NaN NaN NaN
g NaN NaN NaN NaN 519.0 NaN NaN NaN NaN NaN NaN
gg NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN
7. 7 | Page
h NaN NaN NaN NaN NaN NaN 138.0 NaN NaN NaN NaN
i NaN NaN NaN NaN NaN 59.0 NaN NaN NaN NaN NaN
j NaN NaN NaN NaN NaN 10.0 8.0 NaN NaN NaN NaN
k NaN NaN NaN NaN NaN 51.0 NaN NaN NaN NaN NaN
l NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN
m NaN NaN NaN NaN NaN 38.0 NaN NaN NaN NaN NaN
n NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN
o NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN
p NaN NaN NaN NaN 163.0 NaN NaN NaN NaN NaN NaN
q NaN NaN NaN NaN NaN 78.0 NaN NaN NaN NaN NaN
r NaN NaN NaN NaN NaN 3.0 NaN NaN NaN NaN NaN
s NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
t NaN NaN NaN NaN NaN NaN NaN NaN 361.0 295.0 NaN
u NaN NaN NaN 519.0 NaN NaN NaN NaN NaN NaN NaN
v NaN NaN NaN NaN NaN NaN 399.0 NaN NaN NaN NaN
w NaN NaN NaN NaN NaN 64.0 NaN NaN NaN NaN NaN
x NaN NaN NaN NaN NaN 38.0 NaN NaN NaN NaN NaN
y NaN NaN NaN 163.0 NaN NaN NaN NaN NaN NaN NaN
z NaN NaN NaN NaN NaN NaN 8.0 NaN NaN NaN NaN
A12 A13 A14 A15 A16
0.0 NaN NaN 132.0 295.0 NaN
0.04 NaN NaN NaN NaN NaN
0.08 NaN NaN NaN NaN NaN
0.085 NaN NaN NaN NaN NaN
0.125 NaN NaN NaN NaN NaN
0.165 NaN NaN NaN NaN NaN
8. 8 | Page
0.17 NaN NaN NaN NaN NaN
0.205 NaN NaN NaN NaN NaN
0.21 NaN NaN NaN NaN NaN
0.25 NaN NaN NaN NaN NaN
0.29 NaN NaN NaN NaN NaN
0.335 NaN NaN NaN NaN NaN
0.375 NaN NaN NaN NaN NaN
0.415 NaN NaN NaN NaN NaN
0.42 NaN NaN NaN NaN NaN
0.455 NaN NaN NaN NaN NaN
0.46 NaN NaN NaN NaN NaN
0.5 NaN NaN NaN NaN NaN
0.54 NaN NaN NaN NaN NaN
0.58 NaN NaN NaN NaN NaN
0.585 NaN NaN NaN NaN NaN
0.625 NaN NaN NaN NaN NaN
0.665 NaN NaN NaN NaN NaN
0.67 NaN NaN NaN NaN NaN
0.705 NaN NaN NaN NaN NaN
0.71 NaN NaN NaN NaN NaN
0.75 NaN NaN NaN NaN NaN
0.79 NaN NaN NaN NaN NaN
0.795 NaN NaN NaN NaN NaN
0.83 NaN NaN NaN NaN NaN
... ... ... ... ...
b NaN NaN NaN NaN NaN
bb NaN NaN NaN NaN NaN
9. 9 | Page
c NaN NaN NaN NaN NaN
cc NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
dd NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
f 374.0 NaN NaN NaN NaN
ff NaN NaN NaN NaN NaN
g NaN 625.0 NaN NaN NaN
gg NaN NaN NaN NaN NaN
h NaN NaN NaN NaN NaN
i NaN NaN NaN NaN NaN
j NaN NaN NaN NaN NaN
k NaN NaN NaN NaN NaN
l NaN NaN NaN NaN NaN
m NaN NaN NaN NaN NaN
n NaN NaN NaN NaN NaN
o NaN NaN NaN NaN NaN
p NaN 8.0 NaN NaN NaN
q NaN NaN NaN NaN NaN
r NaN NaN NaN NaN NaN
s NaN 57.0 NaN NaN NaN
t 316.0 NaN NaN NaN NaN
u NaN NaN NaN NaN NaN
v NaN NaN NaN NaN NaN
w NaN NaN NaN NaN NaN
x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN
10. 10 | Page
z NaN NaN NaN NaN NaN
[944 rows x 16 columns]
Step4:
Using the results of step 3 above, the ‘mean’ value will serve as imputes for missing values of
numerical attributes while the ‘mode’ serves as imputes for nominal columns.
Given that the ratio of “missing values to total number of observations is small, between 0.8%
and 1.9%, using the mean and the mode as imputes for missing values is not expected to distort
the dataset in any way.
The following is the code we use to effect these imputations. The seven (7) variables involved
are treated individually.
df['A1'].fillna('b', inplace=True)
df['A2'].fillna(31.568, inplace=True)
df['A4'].fillna('u', inplace=True)
df['A5'].fillna('g', inplace=True)
df['A6'].fillna('c', inplace=True)
df['A7'].fillna('v', inplace=True)
df['A14'].fillna(184, inplace=True)
Step5:
Next, save the clean and new dataset. It is this ‘missing-value free’ dataset that we shall use in
implementing project (all 5 sub-projects) discussed in this portfolio.
df.to_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC-1.csv')
To assess the clean-up exercise, compare the contents of the old and new files for a specific
row / column by using the following codes.
df=pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC.csv', na_values=missing_values)
df.iloc[248,:]
11. 11 | Page
Output:
A1 NaN
A2 24.5
A3 12.75
A4 u
A5 g
A6 c
A7 bb
A8 4.75
A9 t
A10 t
A11 2
A12 f
A13 g
A14 73
A15 444
A16 +
Name: 248, dtype: object
df=pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC-1.csv')
df.iloc[248,:]
Output [21]:
Unnamed: 0 248
A1 b
A2 24.5
A3 12.75
A4 u
12. 12 | Page
A5 g
A6 c
A7 bb
A8 4.75
A9 t
A10 t
A11 2
A12 f
A13 g
A14 73
A15 444
A16 +
Name: 248, dtype: object
From the results above, we observe that for element (row) number 248, the missing value for
feature labeled A1 is NaN in the first file has been imputed with value b in the second file. This
is the expected result.
Summary and Conclusion:
The comparison of the files before and after cleaning-up, illustrate that the exercise was
successfully implemented and that all 67 missing values were appropriately replaced. Thus,
using Excel and the Python library as our wrangling tools, we have been able to prepare the
data for analysis. The improved quality of our data boosts the predictability level of analytical
tools thereby promoting plausible insights from the dataset.
For our purpose, we assume that this dataset characterizes the activities of a specific institution
(e.g. a Bank) over a given period, a quarter say. In a nutshell, the basic statistics ( df.describe() )
used in this sub-project captures a description of what happened fully. Again, for our purpose
this is a satisfactory coverage of descriptive analytics.
tyJA