Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Exploratory Data Analysis
Aditya Laghate

Twitter: @thinrhino

1
Who am I?
• A pseudo geek
• Freelance software consultant
• Wildlife photographer

Twitter: @thinrhino

2
Agenda
•
•
•
•

Data gathering
Data cleaning
Usage of classic unix tools
Data analysis

Twitter: @thinrhino

3
Data Gathering
• Public data websites
o data.gov.in
o databank.worldbank.org

• Social websites
o facebook.com
o twitter.c...
Data cleaning
• Eg: openrefine
o OpenRefine (ex-Google Refine) is a powerful tool for working with messy
data, cleaning it...
Classic Unix Tools
• sed /awk
• Shell scripts
• GNU parallel
o Examples:
o cat rands20M.txt | awk '{s+=$1} END {print s}’
...
Data Analysis

Twitter: @thinrhino

7
Questions
@thinrhino
me@adityalaghate.in

Twitter: @thinrhino

8
Upcoming SlideShare
Loading in …5
×

Exploratory Data Analysis

2,537 views

Published on

Talk given by me at Gnunify 2014 on Exploratory Data Analysis

Published in: Technology

Exploratory Data Analysis

  1. 1. Exploratory Data Analysis Aditya Laghate Twitter: @thinrhino 1
  2. 2. Who am I? • A pseudo geek • Freelance software consultant • Wildlife photographer Twitter: @thinrhino 2
  3. 3. Agenda • • • • Data gathering Data cleaning Usage of classic unix tools Data analysis Twitter: @thinrhino 3
  4. 4. Data Gathering • Public data websites o data.gov.in o databank.worldbank.org • Social websites o facebook.com o twitter.com • Blogs / websites /etc via scrapping Twitter: @thinrhino 4
  5. 5. Data cleaning • Eg: openrefine o OpenRefine (ex-Google Refine) is a powerful tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase o openrefine.org Twitter: @thinrhino 5
  6. 6. Classic Unix Tools • sed /awk • Shell scripts • GNU parallel o Examples: o cat rands20M.txt | awk '{s+=$1} END {print s}’ o cat rands20M.txt | parallel --pipe awk '{s+=$1}END{print s}' | awk '{s+=$1} END {print s}’ o wc -l bigfile.txt o cat bigfile.txt | parallel {print s}' Twitter: @thinrhino --pipe wc -l | awk '{s+=$1} END 6
  7. 7. Data Analysis Twitter: @thinrhino 7
  8. 8. Questions @thinrhino me@adityalaghate.in Twitter: @thinrhino 8

×