Cleansing Big Data

•

0 likes•63 views

Our final presentation on our project: Cleansing Big Data, where we scraped over 70k notable deaths on Wikipedia to create a statistically analyzable dataset.

Data & Analytics

Cleansing Big
Data
MIS 4596 - Team 3
Brendon Lee, Brody McGillen, John Dinh, Kashif
Malik, and Khuong Tang

Agenda
Background
Strategy
Process & Demo
Results
01
02
04
05 Q&A
01
03

Background
The Client: Aleksi Aaltonen
The Data:
● User generated content from Wikipedia
● 70k+ celebrity, influencer, and politician deaths across the globe
● Key metrics: name, age, occupation, and nationality
● Range: 2004 to 2018
Develop a program that reads notable deaths
extracted from Wikipedia and transforms that
‘raw’ dataset to match the ‘ground truth’
dataset as closely as possible.
The Objective:

Strategy
Coding Process
Research While
Coding. Meeting
with Client for
Consultations.
Examine Ground
Truth Dataset
Determine objectives
and time range of the
data.
Match Data
Visually Compared
Our Solution to the
Ground Truth Dataset.
Decide On Python
Used Python with
Spyder IDE: Editing,
Interactive Testing and
Debugging.
Evaluate Research
Tools
Bioinformatic Studio
Study Group,
Online Research
with w3schools.

Process
Consult with Client
Demonstrate coding approach to
client and request feedback.
Reevaluate Code
Utilize client feedback, review code,
and implement changes.
Coding
Apply knowledge and research to
extract data.
Research
Leverage resources such as
Stack Overflow and w3schools to
better understand Python.
Use Excel
Cleaning the dataset with
VLOOKUP and Filter to match with
Ground Truth dataset.

Demo: Importing Data
● Import Packages and
Module
● Variable correspond to
Month and Year

Demo: Concatenation
● Use Numpy to
Concatenate Into One
List
● While Loop to Find the
Data
● Try Statement to Find
Names and Titles
● The Exception Clause to
Caught Error

Demo: Exporting Data
● Find the Third Values
● Get the First Values in
The Strings
● The Exception Clause to
Caught Error
● Use CSV to Write These
Variables into Their
Respective Columns

What's hot

What's New In Neo4j 3.4 & Bloom UpdateNeo4j

4. Document Discovery with Graph Data ScienceNeo4j

Measuring the benefit effect for customers with Bayesian predictive modelingJeongMin Kwon

Chest TermSet GDPR ScanR PresentationJenny Carroll

Ieee 2015 16 java titles for me-mtech @triple n infotec-trichysubhu8430

Diffusion in platform-based markets: big data driven agent-based modelJari Jussila

Triple n infotech IEEE java titles 2015subhu8430

What's hot (7)

What's New In Neo4j 3.4 & Bloom Update

4. Document Discovery with Graph Data Science

Measuring the benefit effect for customers with Bayesian predictive modeling

Chest TermSet GDPR ScanR Presentation

Ieee 2015 16 java titles for me-mtech @triple n infotec-trichy

Diffusion in platform-based markets: big data driven agent-based model

Triple n infotech IEEE java titles 2015

Similar to Cleansing Big Data

Borys Pratsiuk "How to be NVidia partner"Lviv Startup Club

The Analytics and Data Science LandscapePhilip Bourne

The NIH as a Digital Enterprise: Implications for PAGPhilip Bourne

DataSpryng Overviewjkvr

BD2K UpdatePhilip Bourne

CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...Sean Ekins

Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu

Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j

Where does Data Democracy begin? [Segment-Synapse, 2019]aj_cache

Department of Commerce App Challenge: Big Data DashboardsBrand Niemann

Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne

Building the Analytics CapabilityBala Iyer

Adding Open Data Value to 'Closed Data' ProblemsSimon Price

STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin

Proposal for the Theme on Big Data.pdfshayamiticharles

Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...CambridgeshireInsight

Minor_project_PPT_final_covid_prediction.pptxAMANSHARMA891906

IoT 2014 Value Creation Workshop: SDILTill Riedel

BigData-Challenges.pptxamanyosama12

Advancing Alcohol Behavior ChangeChad Travis

Similar to Cleansing Big Data (20)

Borys Pratsiuk "How to be NVidia partner"

The Analytics and Data Science Landscape

The NIH as a Digital Enterprise: Implications for PAG

DataSpryng Overview

BD2K Update

CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...

Analytical Innovation: How to Build the Next Generation Data Platform

Neo4j GraphDay Seattle- Sept19- Connected data imperative

Where does Data Democracy begin? [Segment-Synapse, 2019]

Department of Commerce App Challenge: Big Data Dashboards

Big Data as a Catalyst for Collaboration & Innovation

Building the Analytics Capability

Adding Open Data Value to 'Closed Data' Problems

STI 2022 - Generating large-scale network analyses of scientific landscapes i...

Proposal for the Theme on Big Data.pdf

Cambridgeshire Insight Open Data: What we’ve learnt from the unexpected - He...

Minor_project_PPT_final_covid_prediction.pptx

IoT 2014 Value Creation Workshop: SDIL

BigData-Challenges.pptx

Advancing Alcohol Behavior Change

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

B2 Creative Industry Response Evaluation.docxStephen266013

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Industrialised data - the key to AI success.pdfLars Albertsson

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Recently uploaded (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

100-Concepts-of-AI by Anupama Kate .pptx

B2 Creative Industry Response Evaluation.docx

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

RA-11058_IRR-COMPRESS Do 198 series of 1998

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Industrialised data - the key to AI success.pdf

E-Commerce Order PredictionShraddha Kamble.pptx

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Customer Service Analytics - Make Sense of All Your Data.pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Unveiling Insights: The Role of a Data Analyst

FESE Capital Markets Fact Sheet 2024 Q1.pdf

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Cleansing Big Data

1. Cleansing Big Data MIS 4596 - Team 3 Brendon Lee, Brody McGillen, John Dinh, Kashif Malik, and Khuong Tang

2. Agenda Background Strategy Process & Demo Results 01 02 04 05 Q&A 01 03

3. Background The Client: Aleksi Aaltonen The Data: ● User generated content from Wikipedia ● 70k+ celebrity, influencer, and politician deaths across the globe ● Key metrics: name, age, occupation, and nationality ● Range: 2004 to 2018 Develop a program that reads notable deaths extracted from Wikipedia and transforms that ‘raw’ dataset to match the ‘ground truth’ dataset as closely as possible. The Objective:

4. Strategy Coding Process Research While Coding. Meeting with Client for Consultations. Examine Ground Truth Dataset Determine objectives and time range of the data. Match Data Visually Compared Our Solution to the Ground Truth Dataset. Decide On Python Used Python with Spyder IDE: Editing, Interactive Testing and Debugging. Evaluate Research Tools Bioinformatic Studio Study Group, Online Research with w3schools.

5. Process Consult with Client Demonstrate coding approach to client and request feedback. Reevaluate Code Utilize client feedback, review code, and implement changes. Coding Apply knowledge and research to extract data. Research Leverage resources such as Stack Overflow and w3schools to better understand Python. Use Excel Cleaning the dataset with VLOOKUP and Filter to match with Ground Truth dataset.

6. Demo: Importing Data ● Import Packages and Module ● Variable correspond to Month and Year

7. Demo: Concatenation ● Use Numpy to Concatenate Into One List ● While Loop to Find the Data ● Try Statement to Find Names and Titles ● The Exception Clause to Caught Error

8. Demo: Exporting Data ● Find the Third Values ● Get the First Values in The Strings ● The Exception Clause to Caught Error ● Use CSV to Write These Variables into Their Respective Columns

9. Demo: Excel

10. Comparison: Ground Truth Dataset

11. Comparison: Our Team’s Dataset

12. Thank You! Questions?

13. Appendix

14. Appendix (Cont.)