Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open refine to update and clean up your messy data

453 views

Published on

This presentation provides a brief introduction to the open source tool called Open Refine

Published in: Education
  • Be the first to comment

Open refine to update and clean up your messy data

  1. 1. OPEN REFINE TO PROFILE AND CLEAN UP YOUR MESSY DATA
  2. 2. THE WOES OF DATA NOTE ALL DATA ARE MADE EQUAL! DATA ARE CREATED FROM A VARIETY OF SOURCES: • AUTOMATICALLY GENERATED • MANUALLY CREATED AND MANAGED • SCRAPPED FROM THE WEB • TAKEN FROM ANOTHER SOURCE • ETC. EACH SOURCE TYPICALLY HAS ITS OWN METHOD OF HANDLING DATA. THEREFORE DATA SOURCES ARE OFTEN HETEROGENEOUS, WHICH CREATES INCONSISTENT AND INACCURATE DATA. FURTHER, THERE IS NO INFORMATION PROVIDED ABOUT THE DATA TO HELP BETTER UNDERSTAND WHAT THE DATA ARE ABOUT.
  3. 3. WHAT TO DO? ESSENTIALLY, YOU ARE LOOKING FOR CONSISTENT AND ACCURATE DATA. TO DO THIS, THERE ARE OPTIONS: • CREATE A SCRIPT TO CLEAN DATA • HAVE SOMEONE CLEAN YOUR DATA • USE AN APPLICATION TO HELP YOU CLEAN YOUR DATA TODAY, WE’RE GOING TO LOOK AT OPEN REFINE, AN APPLICATION THAT CAN HELP YOU CLEAN YOUR DATA.
  4. 4. WHAT IS OPEN REFINE? OPEN REFINE ORIGINATED WITH METAWEB TECHNOLOGIES AND THEN GOOGLE. FORMERLY KNOWN AS FREEBASE GRIDWORKS, IT BECAME KNOWN AS GOOGLE REFINE AND THEN OPEN REFINE. IT IS A FULLY OPEN SOURCE APPLICATION. YOU CAN DOWNLOAD IT AT: HTTP://OPEN REFINE.ORG/DOWNLOAD.HTML
  5. 5. OPEN REFINE AND THE WOES OF DATA WHETHER YOU HAVE HETEROGENEOUS DATA OR DATA YOU’RE UNFAMILIAR WITH, OPEN REFINE CAN HELP WITH: • DATA CLEANUP • CORRECT INCONSISTENCIES AND INACCURACIES • DATA PROFILING • CREATE AN ANALYSIS OF UNFAMILIAR DATA BY LEARNING WHAT THE DATA ARE
  6. 6. 2 BASIC OPERATIONS OF OPEN REFINE BOTH DATA CLEANUP AND DATA PROFILING RELY ON A DATABASE, CALLED FREEBASE, TO VISUALIZE AND MANIPULATE YOUR DATA. VISUALIZING YOUR DATA INCLUDES BEING ABLE TO SEE THE “TYPE” OF DATA. IS IT A DATE, NUMBER, TEXT? ARE ALL THE ENTRIES THE SAME? VISUALIZING ALSO INCLUDES BEING ABLE TO SEE TIMELINES FOR ENTRIES THAT ARE DATES. MANIPULATING DATA INCLUDES GLOBAL SEARCHING AND REPLACING, UPDATING ENTRIES SINGLY OR BY BATCH, ADDING NEW DATA, REMOVING DATA, CHANGING THE DATA TYPES, AND MORE!
  7. 7. LET’S DO THE TOUR • CREATE A PROJECT • DIFFERENT VIEWS • SEEING YOUR DATA • MANIPULATING ROWS/COLUMNS • CORRECTING ERRORS
  8. 8. HOW TO MAKE CHANGES TO DATA • CHANGES TO COLUMNS • SPLIT • ADD • REMOVE • RENAME • MOVE • CHANGES CELLS (TO DATA ANYWHERE) • EDIT CELLS USING CUSTOM OR DEFAULT TRANSFORMS, FILL DOWN/BLANK, SPLIT/JOIN CELLS, CLUSTER/EDIT
  9. 9. AVAILABLE TRANSFORMATIONS TRIM OR COLLAPSE WHITESPACES UNESCAPE HTML ENTITIES CHANGE THE CASE (TITLE, LOWER OR UPPER CASE) CHANGE THE DATA TO NUMERIC, DATE OR TEXT
  10. 10. CUSTOM TRANSFORMATIONS • JYTHON • HTTP://WWW.JYTHON.ORG/ • HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/JYTHON • GREL (OPEN REFINE EXPRESSION LANGUAGE) • HTTPS://GITHUB.COM/OPENREFINE/OPENREFINE/WIKI/GREL-FUNCTIONS • REGULAR EXPRESSIONS • HTTP://EN.WIKIPEDIA.ORG/WIKI/REGULAR_EXPRESSION • HTTP://WWW.REGULAR-EXPRESSIONS.INFO/QUICKSTART.HTML
  11. 11. GREL • SLICE • EXAMPLE CHANGE “2010-05-31T01:10:0Z” TO “05/31/2010” • VALUE.SLICE(5,7) + ‘/’ + VALUE.SLICE(8,10) + ‘/’ + VALUE.SLICE(0,4) • YOU CAN ALSO USE: • ADD A PREFIX • “PREFIX” + VALUE • SPLIT AND JOIN • A:B:C:D:E -> B:C:D
  12. 12. REGULAR EXPRESSIONS • TEXT PATTERN THAT ONE CAN USE WITH MANY MODERN APPLICATIONS AND PROGRAMMING LANGUAGES • REGULAR EXPRESSIONS COME IN FLAVORS • .NET, JAVA, JAVASCRIPT, PERL, PYTHON, RUBY, … • METACHARACTERS: 12 PUNCTUATION CHARACTERS THAT MAKE REGULAR EXPRESSIONS WORK
  13. 13. REGULAR EXPRESSIONS • CHARACTER CLASS ABBREVIATIONS
  14. 14. REGULAR EXPRESSIONS • REMOVE THE “.” AT THE END OF A PHRASE • VALUE.REPLACE(/[.]$/, “”) • WHAT HAPPENS IF YOU JUST PUT /.$/ WITHOUT THE BRACKETS? • REMOVE A STRING AT THE BEGINNING OF THE PHRASE WHERE THE FIRST LETTER OF THE STRING IS UPPER OR LOWER CASE • VALUE.REPLACE(/^W+S/, “”) • TEST IT OUT • HTTP://REGEXPAL.COM/
  15. 15. QUESTIONS?
  16. 16. RESOURCES • USING OPEN REFINE … THE BOOK, QA76.9.D343 V47 2013 • OPEN REFINE DOCUMENTATION: HTTP://OPEN REFINE.ORG/DOCUMENTATION.HTML • OPEN REFINE COMMUNITY: HTTP://OPEN REFINE.ORG/COMMUNITY.HTML • CODE PROJECT ON REGULAR EXPRESSIONS: HTTP://WWW.CODEPROJECT.COM/ARTICLES/9099/THE-MINUTE- REGEX-TUTORIAL • REGULAR EXPRESSIONS CHEAT SHEET: HTTP://WWW.CHEATOGRAPHY.COM/DAVECHILD/CHEAT- SHEETS/REGULAR-EXPRESSIONS/ • REGULAR EXPRESSIONS TESTER: HTTP://REGEXPAL.COM/ , HTTP://REGEX101.COM/

×