SlideShare a Scribd company logo
Daniel Burseth 
Co-president MIT Big Data Explorers 
dburseth@mit.edu 
@dmbnyc 
Github: dburseth
 Acronyms abound 
 Tremendous complexity 
 Use building blocks not code
 This is easy 
EPPM of 10 requires 500 professionals
 http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work. 
html?emc=eta1&_r=0 
Data preparation and cleansing: 
• Missing 
• Duplicative 
• Conventions (dates, time, 
geographies) 
• Spacing 
• Can we measure data 
cleanliness? 
• What’s our Pareto point?
 AWS -> EC2 
 Launch instance: ami-c6b61fae (US-EAST) 
 Instance type m3.medium 
 Connect 
 You should see some software on the desktop
 Scrape all of Craiglist’s Boston apartment listings using WebHarvy 
 Examine, clean, and prepare the data set using OpenRefine 
 Map our data and apply filters using Tableau 
……all without writing a single line of code.
 A hyper-intelligent utility to scrape website 
data. 
 SysNucleus, makers of USBTrace 
 Heavy duty alternatives: Scrapy (scrappy.org), 
Beautiful Soup
HTTP://SHOUTKEY.COM/WIRE 
1. Start Config 
2. Click on Hungry Mother – 
capture text 
3. Click on Hungry Mother – 
capture URL 
4. Click on Kendall 
Square/MIT – capture text 
5. Click lasts review– 
capture text 
CLEAR 
1. Mine -> Scrape a list of 
similar links 
2. Click on Hungry Mother
 Let’s start collecting 
information in the first sub-page.
 Edit Clear 
 Navigate into a sub-page 
 Start Config 
 Set as Next Page Link
 Scheduler 
 Input keywords 
 Puase Inject (word of caution: scraping often violates TOS. Potentially not viable 
for apps, commercial purposes!) 
 TRY VISITING CRAIGSLIST IN AWS BTW!! 
 Proxy 
 Database export
 Download Craigslist Boston from http://shoutkey.com/glorify 
 Look at our data: open Boston Dirty.csv (20k rows of mess!) 
 Time to CLEAN: Launch GOOGLE-REFINE.EXE 
 Within MOZILLA, navigate to http://127.0.0.1:3333/ 
 Create Project -> This Computer -> Browse 
 Parse by tab 
 Create Project
1. First, sort your column. 
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of 
the middle of the data table. 
3. Then invoke Edit cells and Blank down on the Title column. 
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown 
menu. 
6. Remove the facet.
 Then run the “To Number” transform again
 Increment the radius to 7 
and make judgment calls 
along the way. 
 Change the Distance 
Function and do the same 
thing
 Looks like we have SOME really expensive real 
estate. Data errors????
 Boston Clean.csv
 Load Boston clean.csv 
 “Go to Worksheet”
 Great “semantic” example. Tableau understands that this text translates to a 
lat/long
 Look on the map in the lower right corner 
 Let’s “Filter Data”
 Under “Measures”, drag “Price” onto size in “Marks” 
 Change sum(Price) to avg(Price) 
 Drag Price, change to max(price) into Filters and select an “At Most” 
 Right click on the filter and show “Quick Filter” 
 Drag “City” onto “Label” 
 Menu Map -> Map Options 
 Click on a node for info and drill down potential
1. Explored various webpage structures and scraped them 
2. Exported the data to Refine 
3. Parsed columns to extract critical price and location information 
4. Used clustering algorithms to merge related geographies 
5. Applied filters to identify errant prices 
6. Exported the data to Tableau 
7. Completed a real cursory mapping visualization
 Please come talk to me
MIT Big Data Explorers - presentation by Daniel Burseth

More Related Content

Viewers also liked

Harappan civilisation
Harappan civilisationHarappan civilisation
Harappan civilisationswamiayyappan
 
4Design Building Material Effect Rendering Software Introduction
4Design Building Material Effect Rendering Software Introduction4Design Building Material Effect Rendering Software Introduction
4Design Building Material Effect Rendering Software Introduction
Nanjing 4Design Software Co.,Ltd.
 
Barrett's digital brown bag understanding the new language of the vivid brand
Barrett's digital brown bag   understanding the new language of the vivid brandBarrett's digital brown bag   understanding the new language of the vivid brand
Barrett's digital brown bag understanding the new language of the vivid brand
Barrett Pryce
 
CV-SANAL-MAY15
CV-SANAL-MAY15CV-SANAL-MAY15
CV-SANAL-MAY15
sanal vinayan
 
Textos 304
Textos 304Textos 304
Textos 304
Yair Carrillo
 
Matt Wertz 10th Anniversary Tour Submission
Matt Wertz 10th Anniversary Tour SubmissionMatt Wertz 10th Anniversary Tour Submission
Matt Wertz 10th Anniversary Tour Submission
mollygaller
 
1.tugas keamanan sistem dan jaringan komputer
1.tugas keamanan sistem dan jaringan komputer1.tugas keamanan sistem dan jaringan komputer
1.tugas keamanan sistem dan jaringan komputer
Husain-M-Ali
 
Dark souls 2 connects
Dark souls 2 connectsDark souls 2 connects
Dark souls 2 connects
Siddharth Varma
 
Power point
Power pointPower point
Power point
Asha Raju J
 
M1(1) zaman pra sejarah
M1(1) zaman pra sejarahM1(1) zaman pra sejarah
M1(1) zaman pra sejarah
cikgumurnicute
 
Mirizzi syndrome history, present and
Mirizzi syndrome history, present andMirizzi syndrome history, present and
Mirizzi syndrome history, present and
Ekaterina Gissell Alvarado Jirón
 
Dintelligence Credentials
Dintelligence CredentialsDintelligence Credentials
Dintelligence Credentials
D.hive
 
Asma ul husna
Asma ul husnaAsma ul husna
Asma ul husna
cikgumurnicute
 
Bahan ajar unsur-senyawa-campuran
Bahan ajar unsur-senyawa-campuranBahan ajar unsur-senyawa-campuran
Bahan ajar unsur-senyawa-campuran
Herman Mursito
 
Catedra upesista reglamento estudiantil
Catedra upesista reglamento estudiantilCatedra upesista reglamento estudiantil
Catedra upesista reglamento estudiantil1082244009
 
Bahan ajar listrik magnet herman mursito
Bahan ajar listrik magnet herman mursitoBahan ajar listrik magnet herman mursito
Bahan ajar listrik magnet herman mursitoHerman Mursito
 
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
D.hive
 

Viewers also liked (17)

Harappan civilisation
Harappan civilisationHarappan civilisation
Harappan civilisation
 
4Design Building Material Effect Rendering Software Introduction
4Design Building Material Effect Rendering Software Introduction4Design Building Material Effect Rendering Software Introduction
4Design Building Material Effect Rendering Software Introduction
 
Barrett's digital brown bag understanding the new language of the vivid brand
Barrett's digital brown bag   understanding the new language of the vivid brandBarrett's digital brown bag   understanding the new language of the vivid brand
Barrett's digital brown bag understanding the new language of the vivid brand
 
CV-SANAL-MAY15
CV-SANAL-MAY15CV-SANAL-MAY15
CV-SANAL-MAY15
 
Textos 304
Textos 304Textos 304
Textos 304
 
Matt Wertz 10th Anniversary Tour Submission
Matt Wertz 10th Anniversary Tour SubmissionMatt Wertz 10th Anniversary Tour Submission
Matt Wertz 10th Anniversary Tour Submission
 
1.tugas keamanan sistem dan jaringan komputer
1.tugas keamanan sistem dan jaringan komputer1.tugas keamanan sistem dan jaringan komputer
1.tugas keamanan sistem dan jaringan komputer
 
Dark souls 2 connects
Dark souls 2 connectsDark souls 2 connects
Dark souls 2 connects
 
Power point
Power pointPower point
Power point
 
M1(1) zaman pra sejarah
M1(1) zaman pra sejarahM1(1) zaman pra sejarah
M1(1) zaman pra sejarah
 
Mirizzi syndrome history, present and
Mirizzi syndrome history, present andMirizzi syndrome history, present and
Mirizzi syndrome history, present and
 
Dintelligence Credentials
Dintelligence CredentialsDintelligence Credentials
Dintelligence Credentials
 
Asma ul husna
Asma ul husnaAsma ul husna
Asma ul husna
 
Bahan ajar unsur-senyawa-campuran
Bahan ajar unsur-senyawa-campuranBahan ajar unsur-senyawa-campuran
Bahan ajar unsur-senyawa-campuran
 
Catedra upesista reglamento estudiantil
Catedra upesista reglamento estudiantilCatedra upesista reglamento estudiantil
Catedra upesista reglamento estudiantil
 
Bahan ajar listrik magnet herman mursito
Bahan ajar listrik magnet herman mursitoBahan ajar listrik magnet herman mursito
Bahan ajar listrik magnet herman mursito
 
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
[Mezzomedia] 메조미디어 디지털마케팅 컨퍼런스 2015
 

Similar to MIT Big Data Explorers - presentation by Daniel Burseth

Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Dave Stokes
 
Querying a custom table in google big query
Querying a custom table in google big queryQuerying a custom table in google big query
Querying a custom table in google big query
Ajibade Benjamin
 
Informatica complex transformation i
Informatica complex transformation iInformatica complex transformation i
Informatica complex transformation i
Amit Sharma
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
Eng Teong Cheah
 
Winforms
WinformsWinforms
Scraping Handout
Scraping HandoutScraping Handout
Scraping Handout
Cindy Royal
 
Bridging data analysis and interactive visualization
Bridging data analysis and interactive visualizationBridging data analysis and interactive visualization
Bridging data analysis and interactive visualization
Nacho Caballero
 
E mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCTE mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCT
dinesh2vasu
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
Vivian S. Zhang
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
Mr bi
Mr biMr bi
Mr bi
renjan131
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
IDEAS - Int'l Data Engineering and Science Association
 
Hands on With Advanced Data Grid
Hands on With Advanced Data GridHands on With Advanced Data Grid
Hands on With Advanced Data Grid
OutSystems
 
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
WithTheBest
 
PATTERNS07 - Data Representation in C#
PATTERNS07 - Data Representation in C#PATTERNS07 - Data Representation in C#
PATTERNS07 - Data Representation in C#
Michael Heron
 
Excel Training
Excel TrainingExcel Training
Excel Training
James Ramsey
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
Dr Anjan Krishnamurthy
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
Datafiniti
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
Dharmesh Tank
 

Similar to MIT Big Data Explorers - presentation by Daniel Burseth (20)

Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
 
Querying a custom table in google big query
Querying a custom table in google big queryQuerying a custom table in google big query
Querying a custom table in google big query
 
Informatica complex transformation i
Informatica complex transformation iInformatica complex transformation i
Informatica complex transformation i
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Winforms
WinformsWinforms
Winforms
 
Scraping Handout
Scraping HandoutScraping Handout
Scraping Handout
 
Bridging data analysis and interactive visualization
Bridging data analysis and interactive visualizationBridging data analysis and interactive visualization
Bridging data analysis and interactive visualization
 
E mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCTE mine by V.DINESH KUMAR KSRCT
E mine by V.DINESH KUMAR KSRCT
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
 
Benefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSsBenefits of Using MongoDB Over RDBMSs
Benefits of Using MongoDB Over RDBMSs
 
Mr bi
Mr biMr bi
Mr bi
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Hands on With Advanced Data Grid
Hands on With Advanced Data GridHands on With Advanced Data Grid
Hands on With Advanced Data Grid
 
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
Line Graph Analysis using R Script for Intel Edison - IoT Foundation Data - N...
 
PATTERNS07 - Data Representation in C#
PATTERNS07 - Data Representation in C#PATTERNS07 - Data Representation in C#
PATTERNS07 - Data Representation in C#
 
Excel Training
Excel TrainingExcel Training
Excel Training
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 

MIT Big Data Explorers - presentation by Daniel Burseth

  • 1. Daniel Burseth Co-president MIT Big Data Explorers dburseth@mit.edu @dmbnyc Github: dburseth
  • 2.  Acronyms abound  Tremendous complexity  Use building blocks not code
  • 3.  This is easy EPPM of 10 requires 500 professionals
  • 4.
  • 5.  http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work. html?emc=eta1&_r=0 Data preparation and cleansing: • Missing • Duplicative • Conventions (dates, time, geographies) • Spacing • Can we measure data cleanliness? • What’s our Pareto point?
  • 6.  AWS -> EC2  Launch instance: ami-c6b61fae (US-EAST)  Instance type m3.medium  Connect  You should see some software on the desktop
  • 7.  Scrape all of Craiglist’s Boston apartment listings using WebHarvy  Examine, clean, and prepare the data set using OpenRefine  Map our data and apply filters using Tableau ……all without writing a single line of code.
  • 8.
  • 9.  A hyper-intelligent utility to scrape website data.  SysNucleus, makers of USBTrace  Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup
  • 10. HTTP://SHOUTKEY.COM/WIRE 1. Start Config 2. Click on Hungry Mother – capture text 3. Click on Hungry Mother – capture URL 4. Click on Kendall Square/MIT – capture text 5. Click lasts review– capture text CLEAR 1. Mine -> Scrape a list of similar links 2. Click on Hungry Mother
  • 11.  Let’s start collecting information in the first sub-page.
  • 12.  Edit Clear  Navigate into a sub-page  Start Config  Set as Next Page Link
  • 13.  Scheduler  Input keywords  Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!)  TRY VISITING CRAIGSLIST IN AWS BTW!!  Proxy  Database export
  • 14.  Download Craigslist Boston from http://shoutkey.com/glorify  Look at our data: open Boston Dirty.csv (20k rows of mess!)  Time to CLEAN: Launch GOOGLE-REFINE.EXE  Within MOZILLA, navigate to http://127.0.0.1:3333/  Create Project -> This Computer -> Browse  Parse by tab  Create Project
  • 15. 1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.
  • 16.
  • 17.
  • 18.
  • 19.  Then run the “To Number” transform again
  • 20.
  • 21.
  • 22.
  • 23.  Increment the radius to 7 and make judgment calls along the way.  Change the Distance Function and do the same thing
  • 24.
  • 25.
  • 26.  Looks like we have SOME really expensive real estate. Data errors????
  • 28.  Load Boston clean.csv  “Go to Worksheet”
  • 29.  Great “semantic” example. Tableau understands that this text translates to a lat/long
  • 30.  Look on the map in the lower right corner  Let’s “Filter Data”
  • 31.  Under “Measures”, drag “Price” onto size in “Marks”  Change sum(Price) to avg(Price)  Drag Price, change to max(price) into Filters and select an “At Most”  Right click on the filter and show “Quick Filter”  Drag “City” onto “Label”  Menu Map -> Map Options  Click on a node for info and drill down potential
  • 32.
  • 33. 1. Explored various webpage structures and scraped them 2. Exported the data to Refine 3. Parsed columns to extract critical price and location information 4. Used clustering algorithms to merge related geographies 5. Applied filters to identify errant prices 6. Exported the data to Tableau 7. Completed a real cursory mapping visualization
  • 34.  Please come talk to me

Editor's Notes

  1. http://datacleaner.org/ Certain algorithms This aspect has certainly lagged technology