Your SlideShare is downloading. ×
Exploratory Data Analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Exploratory Data Analysis

484
views

Published on

Talk given by me at Gnunify 2014 on Exploratory Data Analysis

Talk given by me at Gnunify 2014 on Exploratory Data Analysis

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
484
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Exploratory Data Analysis Aditya Laghate Twitter: @thinrhino 1
  • 2. Who am I? • A pseudo geek • Freelance software consultant • Wildlife photographer Twitter: @thinrhino 2
  • 3. Agenda • • • • Data gathering Data cleaning Usage of classic unix tools Data analysis Twitter: @thinrhino 3
  • 4. Data Gathering • Public data websites o data.gov.in o databank.worldbank.org • Social websites o facebook.com o twitter.com • Blogs / websites /etc via scrapping Twitter: @thinrhino 4
  • 5. Data cleaning • Eg: openrefine o OpenRefine (ex-Google Refine) is a powerful tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase o openrefine.org Twitter: @thinrhino 5
  • 6. Classic Unix Tools • sed /awk • Shell scripts • GNU parallel o Examples: o cat rands20M.txt | awk '{s+=$1} END {print s}’ o cat rands20M.txt | parallel --pipe awk '{s+=$1}END{print s}' | awk '{s+=$1} END {print s}’ o wc -l bigfile.txt o cat bigfile.txt | parallel {print s}' Twitter: @thinrhino --pipe wc -l | awk '{s+=$1} END 6
  • 7. Data Analysis Twitter: @thinrhino 7
  • 8. Questions @thinrhino me@adityalaghate.in Twitter: @thinrhino 8