Our final presentation on our project: Cleansing Big Data, where we scraped over 70k notable deaths on Wikipedia to create a statistically analyzable dataset.
3. Background
The Client: Aleksi Aaltonen
The Data:
● User generated content from Wikipedia
● 70k+ celebrity, influencer, and politician deaths across the globe
● Key metrics: name, age, occupation, and nationality
● Range: 2004 to 2018
Develop a program that reads notable deaths
extracted from Wikipedia and transforms that
‘raw’ dataset to match the ‘ground truth’
dataset as closely as possible.
The Objective:
4. Strategy
Coding Process
Research While
Coding. Meeting
with Client for
Consultations.
Examine Ground
Truth Dataset
Determine objectives
and time range of the
data.
Match Data
Visually Compared
Our Solution to the
Ground Truth Dataset.
Decide On Python
Used Python with
Spyder IDE: Editing,
Interactive Testing and
Debugging.
Evaluate Research
Tools
Bioinformatic Studio
Study Group,
Online Research
with w3schools.
5. Process
Consult with Client
Demonstrate coding approach to
client and request feedback.
Reevaluate Code
Utilize client feedback, review code,
and implement changes.
Coding
Apply knowledge and research to
extract data.
Research
Leverage resources such as
Stack Overflow and w3schools to
better understand Python.
Use Excel
Cleaning the dataset with
VLOOKUP and Filter to match with
Ground Truth dataset.
6. Demo: Importing Data
● Import Packages and
Module
● Variable correspond to
Month and Year
7. Demo: Concatenation
● Use Numpy to
Concatenate Into One
List
● While Loop to Find the
Data
● Try Statement to Find
Names and Titles
● The Exception Clause to
Caught Error
8. Demo: Exporting Data
● Find the Third Values
● Get the First Values in
The Strings
● The Exception Clause to
Caught Error
● Use CSV to Write These
Variables into Their
Respective Columns