Vinod Gupta School of Management, IIT Kharagpur              Google Refine Analysis                   A Business Perspecti...
Data Cleansing• Data cleansing is identifying the wrong or inaccurate  records in the data set and making appropriate  cor...
Need for Data Cleansing• Incorrect or inaccurate data may lead to false  conclusions and can cause investments to be  misd...
Challenges Data Cleansing• Loss of Information: In many cases the record may be  incomplete, hence the whole record may re...
About Google Refine• Google Refine is a powerful tool that can be effectively  used for data cleansing.• It helps in worki...
Getting Started - Installation1. Download the zip file (appropriate   Windows, Mac, Linux versions) from the link   http:/...
Google Refine Homepage                         7
Importing Data• Google Refine supports TSV, CSV, Excel (.xls  and .xlsx), JSON, XML, and Google data  document formats.• O...
Importing Data                 9
Importing Data                 10
DataUploaded   Creating Project                              11
Creating Project   Project                   Created                             12
Faceting• Faceting is about seeing the big picture and  filtering based on rows to work on data you  want to change in bul...
Faceting           14
FacetingThe ColumnType has 18  unique  options                         15
Removing Redundancy  Even thoughthey are of same type, shows asdifferent options   due to case                            ...
Removing Redundancy                      17
Removing Redundancy                      18
Removing Redundancy                      19
Removing RedundancyReduced to 15unique options                                       20
Numeric Faceting                   21
Numeric FacetingHighly clustered  towards low     values                                      22
Numeric Faceting                   23
Numeric Faceting                   24
Numeric Faceting                    Cost column is                   blank and has no                         value       ...
Numeric Faceting                   Calamities with                      low cost                                     26
Numeric Faceting              Calamities with                 high cost                                27
Clustering•   Clustering is used to merge choices which look similar.                                                     ...
Clustering             29
ClusteringData Merged                           30
Using Expressions•   Expressions are used to transform existing data to create new data                                   ...
Using Expressions                    32
Using Expressions                    33
Data Augmentation• Reconciliation option in Google refine allows  data to be linked to web pages. Suppose we  want details...
Reconciliation                 35
Reconciliation                 36
Reconciliation                 37
Reconciliation                 38
Reconciliation                 39
Data Enrichment                  40
Data Enrichment                  41
Data Enrichment                  42
Data Enrichment                  43
Export         44
How to Use Twitter DataStep 1Step 2                             45
Step 3         46
Step 4Step 5         47
Step 6         48
Step 7   Step 8                  49
Output         50
Friends Events using Facebook data                                 51
Friends Events using Facebook data                                 52
Friends Events using Facebook data                                 53
Friends Events using Facebook data                                 54
Friends Events using Facebook data                                 55
Friends Events using Facebook data                                 56
Friends Events using Facebook data                                 57
Friends Events using Facebook data                                 58
Friends Events using Facebook data                                 59
Friends Events using Facebook data                                 60
Friends Events using Facebook data• After splitting the cell using separator },{                                          ...
Friends Events using Facebook data                                 62
Friends Events using Facebook data•   After updating for other columns and rearranging it we get the events as            ...
LIKEDDISLIKED           64
Thank You            65
Upcoming SlideShare
Loading in …5
×

Google refine tutotial

443 views
364 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
443
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Google refine tutotial

  1. 1. Vinod Gupta School of Management, IIT Kharagpur Google Refine Analysis A Business Perspective April, 08 2012 Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097This Tutorial was created using Google Refine Version 2.5 on a Windows 7 platform
  2. 2. Data Cleansing• Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records.• It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data• Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis• The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions. 2
  3. 3. Need for Data Cleansing• Incorrect or inaccurate data may lead to false conclusions and can cause investments to be misdirected in finance.• Also government needs accurate data on population and census for directing the funds to the deserving areas.• Many organizations tap into customer information. If the data is not accurate, for eg. If the address is not accurate then the business runs the risk of send wrong information, thus losing customers. 3
  4. 4. Challenges Data Cleansing• Loss of Information: In many cases the record may be incomplete, hence the whole record may require to be deleted which leads to loss of information. It could become costly if huge number of data is deleted.• Maintenance of Data: Once the data is cleansed then any change in the data specification needs to affect only the new values. Hence data management solutions should be designed in such a way that the process of data entry and retrieval are altered to provide correct data.• Data cleansing is an iterative process which needs significant work in exploration and corrction of entries. 4
  5. 5. About Google Refine• Google Refine is a powerful tool that can be effectively used for data cleansing.• It helps in working with raw data, cleaning it up, transforming from one format to other, encompassing it with web services and linking it to databases.• It is very easy to use and has a web interface.• It is freely available and works well with any browser.• Google Refine is a desktop application and it runs a small web server on your system and we need to point our browser to the server to use refine. 5
  6. 6. Getting Started - Installation1. Download the zip file (appropriate Windows, Mac, Linux versions) from the link http://code.google.com/p/google- refine/wiki/Downloads?tm=22. Uncompress the files from the zip file.3. Run the “google-refine.exe” file.4. A command window opens and Google refine runs taking the user to the home page in the default browser. 6
  7. 7. Google Refine Homepage 7
  8. 8. Importing Data• Google Refine supports TSV, CSV, Excel (.xls and .xlsx), JSON, XML, and Google data document formats.• Once imported the data is in Google Refine’s own data format.• We have used TSV data on Disasters worldwide from 1900-2008 available from http://www.infochimps.com/datasets/disaster s-worldwide-from-1900-2008 for the tutorial. 8
  9. 9. Importing Data 9
  10. 10. Importing Data 10
  11. 11. DataUploaded Creating Project 11
  12. 12. Creating Project Project Created 12
  13. 13. Faceting• Faceting is about seeing the big picture and filtering based on rows to work on data you want to change in bulk.• We can create a facet for a column to get the details about that column and then we can filter to a subset of rows with a constraint.• We can perform text facet, Numeric facet, timeline facet and scatterplot facet. Also various customized facets can be designed. 13
  14. 14. Faceting 14
  15. 15. FacetingThe ColumnType has 18 unique options 15
  16. 16. Removing Redundancy Even thoughthey are of same type, shows asdifferent options due to case 16
  17. 17. Removing Redundancy 17
  18. 18. Removing Redundancy 18
  19. 19. Removing Redundancy 19
  20. 20. Removing RedundancyReduced to 15unique options 20
  21. 21. Numeric Faceting 21
  22. 22. Numeric FacetingHighly clustered towards low values 22
  23. 23. Numeric Faceting 23
  24. 24. Numeric Faceting 24
  25. 25. Numeric Faceting Cost column is blank and has no value 25
  26. 26. Numeric Faceting Calamities with low cost 26
  27. 27. Numeric Faceting Calamities with high cost 27
  28. 28. Clustering• Clustering is used to merge choices which look similar. 28
  29. 29. Clustering 29
  30. 30. ClusteringData Merged 30
  31. 31. Using Expressions• Expressions are used to transform existing data to create new data 31
  32. 32. Using Expressions 32
  33. 33. Using Expressions 33
  34. 34. Data Augmentation• Reconciliation option in Google refine allows data to be linked to web pages. Suppose we want details on the country where the calamity has struck we can perform the following steps 34
  35. 35. Reconciliation 35
  36. 36. Reconciliation 36
  37. 37. Reconciliation 37
  38. 38. Reconciliation 38
  39. 39. Reconciliation 39
  40. 40. Data Enrichment 40
  41. 41. Data Enrichment 41
  42. 42. Data Enrichment 42
  43. 43. Data Enrichment 43
  44. 44. Export 44
  45. 45. How to Use Twitter DataStep 1Step 2 45
  46. 46. Step 3 46
  47. 47. Step 4Step 5 47
  48. 48. Step 6 48
  49. 49. Step 7 Step 8 49
  50. 50. Output 50
  51. 51. Friends Events using Facebook data 51
  52. 52. Friends Events using Facebook data 52
  53. 53. Friends Events using Facebook data 53
  54. 54. Friends Events using Facebook data 54
  55. 55. Friends Events using Facebook data 55
  56. 56. Friends Events using Facebook data 56
  57. 57. Friends Events using Facebook data 57
  58. 58. Friends Events using Facebook data 58
  59. 59. Friends Events using Facebook data 59
  60. 60. Friends Events using Facebook data 60
  61. 61. Friends Events using Facebook data• After splitting the cell using separator },{ 61
  62. 62. Friends Events using Facebook data 62
  63. 63. Friends Events using Facebook data• After updating for other columns and rearranging it we get the events as 63
  64. 64. LIKEDDISLIKED 64
  65. 65. Thank You 65

×