Google refine tutotial

Vinod Gupta School of Management, IIT Kharagpur

Google Refine Analysis
A Business Perspective

April, 08 2012

Sathishwaran.R - 10BM60079
Vijaya Prabhu - 10BM60097

This Tutorial was created using Google Refine Version 2.5 on a Windows 7 platform

Data Cleansing
• Data cleansing is identifying the wrong or inaccurate
records in the data set and making appropriate
corrections to the records.
• It involves identifying incomplete, inaccurate, and
incorrect parts of data and then either replacing them
with correct data or deleting the incorrect data
• Data cleansing results in data which is consistent with
the other standard data and is useful for performing
various analysis
• The error in the data could be due to data entry error
by the user, failure during transmission of data or
improper data definitions.

2

Need for Data Cleansing
• Incorrect or inaccurate data may lead to false
conclusions and can cause investments to be
misdirected in finance.
• Also government needs accurate data on
population and census for directing the funds to
the deserving areas.
• Many organizations tap into customer
information. If the data is not accurate, for eg. If
the address is not accurate then the business
runs the risk of send wrong information, thus
losing customers.

3

Challenges Data Cleansing
• Loss of Information: In many cases the record may be
incomplete, hence the whole record may require to be
deleted which leads to loss of information. It could
become costly if huge number of data is deleted.
• Maintenance of Data: Once the data is cleansed then
any change in the data specification needs to affect
only the new values. Hence data management
solutions should be designed in such a way that the
process of data entry and retrieval are altered to
provide correct data.
• Data cleansing is an iterative process which needs
significant work in exploration and corrction of entries.

4

About Google Refine
• Google Refine is a powerful tool that can be effectively
used for data cleansing.
• It helps in working with raw data, cleaning it
up, transforming from one format to
other, encompassing it with web services and linking it
to databases.
• It is very easy to use and has a web interface.
• It is freely available and works well with any browser.
• Google Refine is a desktop application and it runs a
small web server on your system and we need to point
our browser to the server to use refine.
5

Getting Started - Installation
1. Download the zip file (appropriate
Windows, Mac, Linux versions) from the link
http://code.google.com/p/google-
refine/wiki/Downloads?tm=2
2. Uncompress the files from the zip file.
3. Run the “google-refine.exe” file.
4. A command window opens and Google refine
runs taking the user to the home page in the
default browser.
6

Google Refine Homepage

7

Importing Data
• Google Refine supports TSV, CSV, Excel (.xls
and .xlsx), JSON, XML, and Google data
document formats.
• Once imported the data is in Google Refine’s
own data format.
• We have used TSV data on Disasters
worldwide from 1900-2008 available from
http://www.infochimps.com/datasets/disaster
s-worldwide-from-1900-2008 for the tutorial.

8

Data
Uploaded Creating Project

11

Creating Project Project
Created

12

Faceting
• Faceting is about seeing the big picture and
filtering based on rows to work on data you
want to change in bulk.
• We can create a facet for a column to get the
details about that column and then we can
filter to a subset of rows with a constraint.
• We can perform text facet, Numeric
facet, timeline facet and scatterplot facet. Also
various customized facets can be designed.

13

Faceting

The Column
Type has 18
unique
options

15

Removing Redundancy

Even though
they are of same
type, shows as
different options
due to case

16

Removing Redundancy

17

Removing Redundancy

18

Removing Redundancy

19

Removing Redundancy

Reduced to 15
unique options

20

Numeric Faceting

21

Numeric Faceting

Highly clustered
towards low
values

22

Numeric Faceting

23

Numeric Faceting

24

Numeric Faceting

Cost column is
blank and has no
value

25

Numeric Faceting

Calamities with
low cost

26

Numeric Faceting

Calamities with
high cost

27

Clustering
• Clustering is used to merge choices which look similar.

28

Clustering

Data Merged

30

Using Expressions
• Expressions are used to transform existing data to create new data

31

Using Expressions

32

Using Expressions

33

Data Augmentation
• Reconciliation option in Google refine allows
data to be linked to web pages. Suppose we
want details on the country where the
calamity has struck we can perform the
following steps

34

How to Use Twitter Data

Step 1

Step 2

45

Friends Events using Facebook data

51


52


53


54


55


56


57


58


59


60

• After splitting the cell using separator },{

61


62

• After updating for other columns and rearranging it we get the events as

63

Google refine tutotial

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Google refine tutotial

Similar to Google refine tutotial (20)

Recently uploaded

Recently uploaded (20)

Google refine tutotial