Exploratory Data Analysis
Aditya Laghate

Twitter: @thinrhino

1
Who am I?
• A pseudo geek
• Freelance software consultant
• Wildlife photographer

Twitter: @thinrhino

2
Agenda
•
•
•
•

Data gathering
Data cleaning
Usage of classic unix tools
Data analysis

Twitter: @thinrhino

3
Data Gathering
• Public data websites
o data.gov.in
o databank.worldbank.org

• Social websites
o facebook.com
o twitter.com

• Blogs / websites /etc via scrapping

Twitter: @thinrhino

4
Data cleaning
• Eg: openrefine
o OpenRefine (ex-Google Refine) is a powerful tool for working with messy
data, cleaning it, transforming it from one format into another, extending it
with web services, and linking it to databases like Freebase
o openrefine.org

Twitter: @thinrhino

5
Classic Unix Tools
• sed /awk
• Shell scripts
• GNU parallel
o Examples:
o cat rands20M.txt | awk '{s+=$1} END {print s}’
o cat rands20M.txt | parallel --pipe awk '{s+=$1}END{print
s}' | awk '{s+=$1} END {print s}’
o wc -l bigfile.txt
o cat bigfile.txt | parallel
{print s}'

Twitter: @thinrhino

--pipe wc -l | awk '{s+=$1} END

6
Data Analysis

Twitter: @thinrhino

7
Questions
@thinrhino
me@adityalaghate.in

Twitter: @thinrhino

8

Exploratory Data Analysis