MIT Big Data Explorers - presentation by Daniel Burseth

Daniel Burseth
Co-president MIT Big Data Explorers
dburseth@mit.edu
@dmbnyc
Github: dburseth

 Acronyms abound
 Tremendous complexity
 Use building blocks not code

 This is easy
EPPM of 10 requires 500 professionals

 http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.
html?emc=eta1&_r=0
Data preparation and cleansing:
• Missing
• Duplicative
• Conventions (dates, time,
geographies)
• Spacing
• Can we measure data
cleanliness?
• What’s our Pareto point?

 AWS -> EC2
 Launch instance: ami-c6b61fae (US-EAST)
 Instance type m3.medium
 Connect
 You should see some software on the desktop

 Scrape all of Craiglist’s Boston apartment listings using WebHarvy
 Examine, clean, and prepare the data set using OpenRefine
 Map our data and apply filters using Tableau
……all without writing a single line of code.

 A hyper-intelligent utility to scrape website
data.
 SysNucleus, makers of USBTrace
 Heavy duty alternatives: Scrapy (scrappy.org),
Beautiful Soup

HTTP://SHOUTKEY.COM/WIRE
1. Start Config
2. Click on Hungry Mother –
capture text
3. Click on Hungry Mother –
capture URL
4. Click on Kendall
Square/MIT – capture text
5. Click lasts review–
capture text
CLEAR
1. Mine -> Scrape a list of
similar links
2. Click on Hungry Mother

 Let’s start collecting
information in the first sub-page.

 Edit Clear
 Navigate into a sub-page
 Start Config
 Set as Next Page Link

 Scheduler
 Input keywords
 Puase Inject (word of caution: scraping often violates TOS. Potentially not viable
for apps, commercial purposes!)
 TRY VISITING CRAIGSLIST IN AWS BTW!!
 Proxy
 Database export

 Download Craigslist Boston from http://shoutkey.com/glorify
 Look at our data: open Boston Dirty.csv (20k rows of mess!)
 Time to CLEAN: Launch GOOGLE-REFINE.EXE
 Within MOZILLA, navigate to http://127.0.0.1:3333/
 Create Project -> This Computer -> Browse
 Parse by tab
 Create Project

1. First, sort your column.
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of
the middle of the data table.
3. Then invoke Edit cells and Blank down on the Title column.
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank.
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown
menu.
6. Remove the facet.

 Then run the “To Number” transform again

 Increment the radius to 7
and make judgment calls
along the way.
 Change the Distance
Function and do the same
thing

 Looks like we have SOME really expensive real
estate. Data errors????

 Load Boston clean.csv
 “Go to Worksheet”

 Great “semantic” example. Tableau understands that this text translates to a
lat/long

 Look on the map in the lower right corner
 Let’s “Filter Data”

 Under “Measures”, drag “Price” onto size in “Marks”
 Change sum(Price) to avg(Price)
 Drag Price, change to max(price) into Filters and select an “At Most”
 Right click on the filter and show “Quick Filter”
 Drag “City” onto “Label”
 Menu Map -> Map Options
 Click on a node for info and drill down potential

1. Explored various webpage structures and scraped them
2. Exported the data to Refine
3. Parsed columns to extract critical price and location information
4. Used clustering algorithms to merge related geographies
5. Applied filters to identify errant prices
6. Exported the data to Tableau
7. Completed a real cursory mapping visualization

MIT Big Data Explorers - presentation by Daniel Burseth

MIT Big Data Explorers - presentation by Daniel Burseth

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to MIT Big Data Explorers - presentation by Daniel Burseth

Similar to MIT Big Data Explorers - presentation by Daniel Burseth (20)

Recently uploaded

Recently uploaded (20)

MIT Big Data Explorers - presentation by Daniel Burseth

Editor's Notes