Your SlideShare is downloading. ×
0
Analyzing Unstructured Data         for Stories      eugenewu@mit.edu
What Am I Talking About?• Example• Structured Data 101• Structured Data Continuum• More Examples
http://projects.propublica.org/drywall/
http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
1056. Plaintiffs - Intervenors, Robert and Tasha Lambertare citizens of Alabama and together own real propertylocated at 5...
1056. Plaintiffs - Intervenors, Robert and Tasha Lambertare citizens of Alabama and together own real propertylocated at 5...
541 Lynn Hurst Court,Montgomery, Alabama 3611737.0625, -95.677068
541 Lynn Hurst Court,Montgomery, Alabama 3611737.0625, -95.677068Jefferson County
http://projects.propublica.org/drywall/
http://projects.propublica.org/drywall/
ScannedDocumentsAddressesGoogle Maps
Scanned       UnstructuredDocuments     InformationAddresses     Structured DataGoogle Maps   Visualization
Scanned       UnstructuredDocuments     InformationAddresses     Structured DataGoogle Maps   Visualization
Scanned       UnstructuredDocuments     InformationAddresses     Structured DataGoogle Maps   Visualization
Scanned        UnstructuredDocuments      InformationAddresses      Structured DataGoogle Maps    Visualization       Who ...
Who Cares?Software              Store                        Databases                        PANDAVisualization         A...
Who Cares?SoftwareVisualizationMashups
Who Cares?Software             Tainted House Data                       + Economic Data                       + Health Sta...
Structured Data
Structured DataAttribute   Name   Data typeConsistent
Structured DataAttribute  Name  Data typeConsistent
Structured DataAttribute   Name   Data typeConsistent
Structured DataAttribute            Florida’s Lee County                     has 1518 addresses   Name   Data typeConsistent
Structured DataAttribute   Name   Data typeConsistent
Structured DataAttribute   Name   Data typeConsistent
Structured DataAttribute               Numeric                        (integers, dollars,…)   Name                        ...
Structured DataAttribute               Numeric                        (integers, dollars,…)   Name                        ...
Structured DataAttribute            FLORIDA   Name              FL                     Flroida   Data type                ...
Structured DataAttribute            FLORIDA        5   Name              FL             10                     Flroida    ...
Structured DataAttribute   Name   Data typeConsistent
What Am I Talking About?• Structured Data 101• Structured Data Continuum• More Examples
unstructured               structured               Continuum
Images       Images unstructured                                                                                          ...
Images   Images      Text Blobunstructured                                           structured  1056. Plaintiffs - Interv...
Images   Images      Text Blob   Emailunstructured                       structured
Images   Images           Text Blob               Emailunstructured                                            structured ...
Images   Images      Text Blob   Email   Excelunstructured                        structured
Images      Text Blob   Email   Excelunstructured                        structured
Images      Text Blob   Email   Excelunstructured                        structured “It’s sunny  in texas”
Images      Text Blob              Email         Excelunstructured                                         structured “It’...
Images      Text Blob              Email         Excelunstructured                                         structured “It’...
Whe    You have unstructured data  n       What structure do I need?Ask       Attributes with simple typesFind
What Am I Talking About?• Structured Data 101• Structured data continuum• More Examples
2011 State of the Unionhttp://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/
Name   Type/MeaningWord   String
Mr. Speaker, Mr. Vice President,members of Congress,distinguished guests, and fellowAmericans:Tonight I want to begin byco...
Mr. Speaker, Mr. Vice President,members of Congress,distinguished guests, and fellow                   WordAmericans:     ...
Bin Laden Tweets/Sechttp://www.flickr.com/photos/twitteroffice/5681263084/
Name   Type/MeaningTime   Time
Deadly Day in Baghdadhttp://www.nytimes.com/interactive/2010/10/24/world/1024-surge-graphic.html?pagewanted=all
Name         Type/MeaningLocation     Lat, LonBody Count   Number
http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all
14, 12                                                                             Killed in Action                       ...
Sentiment of NZ Earthquakehttp://twitinfo.csail.mit.edu/detail/4/
Name        Type/MeaningHappiness   -1 to 1
PatternMatching           Great, 7AM           meeting            7:00AM
InterpretMeaning            Great, 7AM            meeting            Not Happy
InterpretMeaning               Great, 7AM  It’s still   meeting      new                Happy!
InterpretMeaning  It’s still      new
InterpretMeaning               Earthquakes  It’s still      new Lack of context
Extracting meaning isby far the most difficult
What if it’s just unstructured?
CrowdSourcingLots of humans dotasks computers suckatTrainingQuality Issues
Dealing with Forms
Dealing with Forms
Entity Information
Pattern Matching• Regex  – Describe and find patterns  – Killed in action     (?P<n>d{1,3})(s[A-Z]{1,3})?sKIA
DBTruck demo?
Structure = Super Valuable
Structure = Super ValuableWhen       You have unstructured data Ask       What structure do I need?Find       Attributes w...
Structure = Super ValuableWhen       You have unstructured data Ask       What structure do I need?Find       Attributes w...
IRE 2012 Unstructured Data Talk
Upcoming SlideShare
Loading in...5
×

IRE 2012 Unstructured Data Talk

192

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
192
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hi I’m eugenewu.I was asked to talk about unstructured data, and after some thought, I figured I’ll..
  • Actually talk about structured dataIn particular, I want you to walk away with three thingsWhat is SD and why you should careHow to think about structured data in contrast to unstructured data. Specifically that data isn’t just …finallyA bunch of stories and visualizations and quick stories of how the author went from unstructured to structured dataLet me start with an example before talking about what structure means
  • Jeff Larson and Joaquin Sapien, ProPublica and Aaron Kessler, Sarasota Herald Tribunedid a really nice data journalism piece on the impact of tainted drywall on home ownerslot of homes built using drywall from China, emitted foul odors and frequently caused mysterious electronics failures. health problem in residentsAnd produced a really nice visualization of the counties affected by the tainted drywall. Darker blue = more tainted homesLet’s walk through How they went from unstructured data to this visualization?
  • They started with court documents from class action lawsuits and tax forms
  • And extracted the plain textFor example, This is a partial list of plaintiffs. There were about 2000 in this document
  • And they manually extracted the state and address information from the text.
  • They then geocoded the addresses to get latitude longitude information,
  • and finally the county that house belongs in.Doing this process for nearly 7000 addresses
  • reveals the number of tainted homes in each of the 150 counties.This table is imported into a visualization tool to construct…
  • The map that is shown on the propublica page.That was a fairly large number of steps.
  • If we take a quick look at their process, we can grossly simplify it down to the following steps.Take text from docsSpecifically address informationPlot it on google maps
  • And stepping back, to bring this back into the context of this talk, they start with unstructured information extract specific structured data, and visualize
  • What we’ll talk about in this talk is how to go from unstructured information to structured data.
  • But the first thing to do is to describe…
  • why the heck we carewhat structured data is
  • Who cares? Structured data makes your life easier in a number of ways.There’s lots of software Databases, panda to help you store and analyze structured data
  • In a similar vein, practically all visualization tools expect your data in some kind of structured format.
  • It can easily take a long of time to extract structured data from your documents. But now that you’ve got structured data about tainted homes in each county it can be easier to create mashups with other data.In contrast, there are not a lot of tools that work with unstructured data.
  • The canonical example of structured data is a table like this, that I’m sure you’ve seen either on the web in the wild, or on sites like google fusion tables. What makes structured data .. Structured?
  • For practical purposes, think of structured
  • as a bunch of attributesFor example each of 3 columns.Each attribute has a name and a data type
  • Why are names important?Let’s say you want to create that propublica map of each county
  • If I just stored the data in the table in a text life like on the right, Google maps has no idea what its trying to plot.I can’t point a map at that tex.
  • What I can say is “create a map and use county”. Since the attribute has a name the map can easily get the county names
  • The data type embodies the “meaning” of the attribute. It says “what does this attribute represent?”The more specific you can be, the better.
  • If the data type is a number then we can sort it, or take the sum or average.If we know it’s a type of numer (date/time) then we can use the hour, or month dataLat, lon can be plotted on a mapNon-numeric but still important are structured strings
  • Non-numeric but still important are structured strings. These are special because for any given thing like florida, there’s only one way to spell it.
  • This is important because something like florida could be spelled in numerous ways. The computer doesn’t know how to reconcile the differences.If we wanted the total number of tainted walls in florida, we would end up with
  • Getting a program to extract florida in a single unambiguous way is generally pretty hard, but its important.
  • Finally they should be consistent. In the sense that each row in your table, or each document in your dataset contains these attributesSometimes your strucutred data may not be in this kind of tabular format, but rather data attached to individual documents.
  • Hopefully I’ve convinced you that structured data is a good idea.Now I want to describe how sturctured data relates to unstructured data…
  • Specifically that Data isn’t unstructured or structured. It all lies on a continuum.I want to give you examples that span this spectrum and what data we may want out of them.
  • The name of the is moving towards the right,
  • Concretely, let’s say we have a bunch of tweets and we want to understand how the weather reported by the twitterverse differs across geographic areas.
  • We want to extract two pieces of structured data. Weather is a string containing “sunny”Location is a string corresponding to locationOr we could extract even more specific data type
  • By using ageocoding app to turn string texas into the latitude longitude coordinates.
  • I’ve summarized the process into something that helps calm my nerves, which iswhen I have …Is to ask What structure? Is it dollars? Adddresses?That helps target my search for finding…
  • I figure it would be nice to end with more examples.
  • http://www.wordle.net/createLast year, the globe produced a world cloud of Obama’s state of the union speech
  • An attribute that represents a single word in the speech. Perhaps with the punctuation removed
  • So we would start from the speech text and
  • Construct this single attribute table
  • Twitter released this graphic of the number of tweets per second referencing bin laden when he was captured earlier last year.
  • In this case tweets already contain the information we want – time.
  • Per capita availability of boneless, trimmed meat
  • We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.The nice part of this data is that it is often considered important, and can be found in a consistent location in the documents
  • Another example is the Deadly Day in Baghdad visualization produced byJACOB HARRIS and others the NYTimes, depicts the distribution of deaths in baghdad for a single day.Location of circle is latlon of where it happenedSize is how many peolp
  • This is an example of a wikileaks document the NYTimes had to work with.
  • KIA = killed in action. In this case, NYTimes extracted the data by hand. And sometimes this may be the case.But if the documents all looked like this (KIA at the top, WHERE:), it _may_ be possible to use pattern matching to extract this data.
  • Since much data about our lives is inexorably tied to where we live, we are often concerned with the regions that we live.This visualization shows number households per 1000 in regions throughout MA have lived there for 3+ generations – as a indicator on commitment to the region.
  • We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.
  • iN this case, we are starting with what looks like structured data, and further extracting info
  • Person’s name.Extracting this type of information is called entity extraction, wher an entity may be a business name, famous person, etcThis is typically quite difficult, and requires an existing dataset of “important entities”
  • Finally, a popular analysis is to classify the unstructured documents. Categorizing by topic, or emotionTwitinfo is a tool by marcua to analyze tweets about particular topics. One of its features is analyzing the sentiment of the tweets.Here are 4 example tweets from last year talking about the Christchurch earthquake. Blue = +Red = -The pie chart shows that the tweets are overwhelmingly positive.
  • The structured data would then by happiness, and its type is a number between -1 and 1.there exist tools for specific types of analysis like sentiment or topicHowever
  • Be really careful with these types of automatic categorization tools
  • In all of the examples until that last one, what we’ve talked about amounted to pattern matching.This is really good. Tons of tools to do a good job
  • For example, the extracted sentiment of tweets about the new zealand earthquake was really positive!This is surprising because earthquakes are generally considered not so good.Because the tweets are all wishing the survivors the best, but these extractors don’t understand.
  • You can give your pile of documents to a thousand people who will extract the data you want quickly and cheaply.Mturk, crowd flower have more of an “anonymous workers” approach where someone will do your work, but you don’t know whoOdesk is more like directly hiring a contractorIn both cases, you’ll need to train the worker and deal with quality issues.
  • If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
  • If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
  • If you care looking for people or places, Open Calais is a tool that automatically finds entities.Mario Monti is prime minister of Italy
  • But I’m going to give you a tip sheet later that also contains this and the other tools.
  • Just say the text!
  • Number of users, number of posts per day. Major posts that have been censored
  • ----- Meeting Notes (6/12/12 00:16) -----put chi chu here instead
  • Thankfully the journalism and media studies program ----- Meeting Notes (6/12/12 00:39) -----change tweet to post
  • Shorter. Bo xilai falls from power.
  • Shorter. Bo xilai falls from power.
  • Shorter. Bo xilai falls from power.
  • We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
  • We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
  • The most difficult is completely unstructured data. For exampleHand written letters, where we want the sender and recipient names
  • Or a scanned typewritten letter, and we want company and cate information
  • Or text files like the pro-publica example, where we want state and address data
  • A non text example would be scanned forms.In this case, Federal election contribution reports. Where we want the committee name and donation amounts and dates
  • Going towards the structured end, there is data that smells unstructured, but actually contains some structured data.For example, a tweet I wrote about trends in the database community contains more than just the text
  • In addition to the tweet text, which is unstructured, the Twitter API provides structured information Timestampof when the tweet was posted, my username, number of retweets, etc etc.That are all valuable to analyze without needing to process the actual text.
  • Similarly, emails contain structured data in the form of….
  • Subject, date, sender and tons more information.Later, Sudheendra will describe his email analyses tool that extractsspecific pieces of structured data and visualizes it.
  • Working directly with unstructured data is really really hard.Often times this requires manual work of analyzing documents one by one.
  • convince you that you can do a lot without messing too much with actual unstructured data.
  • Hello, my name Is eugene wu. I’m actually a student right across the river at MIT. I study databases. Not part of my PhD, but what I’m interested in is how reporters are dealing with and analyzing your data.
  • When I was asked to talk about anaylzing unstructured data for stories,hard time coming up with a talk.This is a fairly open ended topic, and I could talk about data scraping, visualization, extraction.The reason why there are so many techniques is thatDealing with unstructured data is very difficult and computers are terrible at it.
  • Also didn’t want to talk about a single tool because they are often used for specific types of data/analysesLooking for something that is useful for a general audienceThen I thought, hey’ I’m a database student, and we work with tables all the time!
  • The best ones are numerical data types. Computers are really really good at processing numerical values. They can easily show you the sum, or average, or look for trends.In fact pretty much every visualization tool, and analysis program will expect numerical data
  • If you can specify the type of numeric, then better. For example, lat lon then you can plot it on a map
  • Next are structured strings. These words where the meaning is different if the values are different. That is, there’s one way to say florida - capitalized florida.This is important when you want to ask “whats the total number of addresses in florida”?
  • Finally is random text. This is very akin to saying “this attribute is unstructured text”. Computers are horrible with this type of data because it’s so ambiguous----- Meeting Notes (6/12/12 18:11) -----know is a number, we know we can sort them, lat lon we can put it on a map. stop.
  • Transcript of "IRE 2012 Unstructured Data Talk"

    1. 1. Analyzing Unstructured Data for Stories eugenewu@mit.edu
    2. 2. What Am I Talking About?• Example• Structured Data 101• Structured Data Continuum• More Examples
    3. 3. http://projects.propublica.org/drywall/
    4. 4. http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
    5. 5. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambertare citizens of Alabama and together own real propertylocated at 541 Lynn Hurst Court, Montgomery, Alabama36117. Plaintiffs are participating as class representativesin the class and subclasses as set forth in the schedulesaccompanying this complaint which are incorporatedherein by reference. 1057. Plaintiff-Intervenor, BrendaOwens, is a citizen of Alabama and owns real propertylocated at 2105 Lane Avenue, Birmingham, Alabama35217. Plaintiff is participating as a class representative inthe class and subclasses as set forth in the schedulesaccompanying this complaint which are incorporatedherein by reference. 1058. Plaintiffs-Intervenors, Danieland Nicole Smith are citizens of Alabama and togetherown real property located at 766 Tabernacle Road,Monroeville, Alabamahttp://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
    6. 6. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambertare citizens of Alabama and together own real propertylocated at 541 Lynn Hurst Court, Montgomery, Alabama36117. Plaintiffs are participating as class representativesin the class and subclasses as set forth in the schedulesaccompanying this complaint which are incorporatedherein by reference. 1057. Plaintiff-Intervenor, BrendaOwens, is a citizen of Alabama and owns real propertylocated at 2105 Lane Avenue, Birmingham, Alabama35217. Plaintiff is participating as a class representative inthe class and subclasses as set forth in the schedulesaccompanying this complaint which are incorporatedherein by reference. 1058. Plaintiffs-Intervenors, Danieland Nicole Smith are citizens of Alabama and togetherown real property located at 766 TabernacleRoad, Monroeville, Alabamahttp://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
    7. 7. 541 Lynn Hurst Court,Montgomery, Alabama 3611737.0625, -95.677068
    8. 8. 541 Lynn Hurst Court,Montgomery, Alabama 3611737.0625, -95.677068Jefferson County
    9. 9. http://projects.propublica.org/drywall/
    10. 10. http://projects.propublica.org/drywall/
    11. 11. ScannedDocumentsAddressesGoogle Maps
    12. 12. Scanned UnstructuredDocuments InformationAddresses Structured DataGoogle Maps Visualization
    13. 13. Scanned UnstructuredDocuments InformationAddresses Structured DataGoogle Maps Visualization
    14. 14. Scanned UnstructuredDocuments InformationAddresses Structured DataGoogle Maps Visualization
    15. 15. Scanned UnstructuredDocuments InformationAddresses Structured DataGoogle Maps Visualization Who cares? What is it?
    16. 16. Who Cares?Software Store Databases PANDAVisualization Analyze Fusion tables Excel DatabasesMashups R/Python/Ruby
    17. 17. Who Cares?SoftwareVisualizationMashups
    18. 18. Who Cares?Software Tainted House Data + Economic Data + Health StatsVisualization + Crime Stats + Corruption DataMashups
    19. 19. Structured Data
    20. 20. Structured DataAttribute Name Data typeConsistent
    21. 21. Structured DataAttribute Name Data typeConsistent
    22. 22. Structured DataAttribute Name Data typeConsistent
    23. 23. Structured DataAttribute Florida’s Lee County has 1518 addresses Name Data typeConsistent
    24. 24. Structured DataAttribute Name Data typeConsistent
    25. 25. Structured DataAttribute Name Data typeConsistent
    26. 26. Structured DataAttribute Numeric (integers, dollars,…) Name Date/Time Data type Lat, LonConsistent
    27. 27. Structured DataAttribute Numeric (integers, dollars,…) Name Date/Time Data type Lat, LonConsistent Structured strings (Florida)
    28. 28. Structured DataAttribute FLORIDA Name FL Flroida Data type FloridaState Florida’sConsistent
    29. 29. Structured DataAttribute FLORIDA 5 Name FL 10 Flroida 1 Data type FloridaState 1 Florida’s 1Consistent
    30. 30. Structured DataAttribute Name Data typeConsistent
    31. 31. What Am I Talking About?• Structured Data 101• Structured Data Continuum• More Examples
    32. 32. unstructured structured Continuum
    33. 33. Images Images unstructured structuredhttp://www.whatisstephenharperreading.ca/2010/03/01/book-number-76-one-day-in-the-life-of-ivan-denisovich-by-alexander-solzhenitsyn/
    34. 34. Images Images Text Blobunstructured structured 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert are citizens of Alabama and together own real property located at 541 Lynn Hurst Court, Montgomery, Alabama 36117. Plaintiffs are participating as class representatives in the class and subclasses as set forth in the schedules accompanying this complaint which are incorporated herein by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a citizen of Alabama and owns real property located at 2105 Lane Avenue, Birmingham, Alabama 35217. Plaintiff is participating as a class representative in the
    35. 35. Images Images Text Blob Emailunstructured structured
    36. 36. Images Images Text Blob Emailunstructured structured Subject Re: IRE conference in Boston Date June 1, 3:08PM From jaimi@ire.org
    37. 37. Images Images Text Blob Email Excelunstructured structured
    38. 38. Images Text Blob Email Excelunstructured structured
    39. 39. Images Text Blob Email Excelunstructured structured “It’s sunny in texas”
    40. 40. Images Text Blob Email Excelunstructured structured “It’s sunny Tweet Weather Location It’s sunny in Sunny Texas in texas” texas
    41. 41. Images Text Blob Email Excelunstructured structured “It’s sunny Tweet Weather Location It’s sunny in Sunny (37.06, in texas” texas -95.67)
    42. 42. Whe You have unstructured data n What structure do I need?Ask Attributes with simple typesFind
    43. 43. What Am I Talking About?• Structured Data 101• Structured data continuum• More Examples
    44. 44. 2011 State of the Unionhttp://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/
    45. 45. Name Type/MeaningWord String
    46. 46. Mr. Speaker, Mr. Vice President,members of Congress,distinguished guests, and fellowAmericans:Tonight I want to begin bycongratulating the men andwomen of the 112th Congress, aswell as your new Speaker, JohnBoehner. And as we mark thisoccasion, were also mindful ofthe empty chair in this chamber,and we pray for the health of ourcolleague -- and our friend --Gabby Giffords.Its no secret that those of us heretonight have had our differencesover the last two years. Thedebates have been contentious;we have fought fiercely for ourbeliefs. And thats a good thing.
    47. 47. Mr. Speaker, Mr. Vice President,members of Congress,distinguished guests, and fellow WordAmericans: Mr SpeakerTonight I want to begin bycongratulating the men and Vicewomen of the 112th Congress, as Presidentwell as your new Speaker, John MembersBoehner. And as we mark thisoccasion, were also mindful of Congressthe empty chair in this chamber, Distinguishedand we pray for the health of our Guestscolleague -- and our friend --Gabby Giffords. Americans PeopleIts no secret that those of us here Jobstonight have had our differences Newover the last two years. Thedebates have been contentious; yearswe have fought fiercely for ourbeliefs. And thats a good thing.
    48. 48. Bin Laden Tweets/Sechttp://www.flickr.com/photos/twitteroffice/5681263084/
    49. 49. Name Type/MeaningTime Time
    50. 50. Deadly Day in Baghdadhttp://www.nytimes.com/interactive/2010/10/24/world/1024-surge-graphic.html?pagewanted=all
    51. 51. Name Type/MeaningLocation Lat, LonBody Count Number
    52. 52. http://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all
    53. 53. 14, 12 Killed in Action Lat Lonhttp://www.nytimes.com/interactive/world/iraq-war-logs.html?pagewanted=all
    54. 54. Sentiment of NZ Earthquakehttp://twitinfo.csail.mit.edu/detail/4/
    55. 55. Name Type/MeaningHappiness -1 to 1
    56. 56. PatternMatching Great, 7AM meeting 7:00AM
    57. 57. InterpretMeaning Great, 7AM meeting Not Happy
    58. 58. InterpretMeaning Great, 7AM It’s still meeting new Happy!
    59. 59. InterpretMeaning It’s still new
    60. 60. InterpretMeaning Earthquakes It’s still new Lack of context
    61. 61. Extracting meaning isby far the most difficult
    62. 62. What if it’s just unstructured?
    63. 63. CrowdSourcingLots of humans dotasks computers suckatTrainingQuality Issues
    64. 64. Dealing with Forms
    65. 65. Dealing with Forms
    66. 66. Entity Information
    67. 67. Pattern Matching• Regex – Describe and find patterns – Killed in action (?P<n>d{1,3})(s[A-Z]{1,3})?sKIA
    68. 68. DBTruck demo?
    69. 69. Structure = Super Valuable
    70. 70. Structure = Super ValuableWhen You have unstructured data Ask What structure do I need?Find Attributes with simple types
    71. 71. Structure = Super ValuableWhen You have unstructured data Ask What structure do I need?Find Attributes with simple types tinyurl.com/iredatatipsheet eugenewu@mit.edu @sirrice
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×