Lecture 11 Unstructured Data and the Data Warehouse


Published on

Building the Data WareHouse http://it-slideshares.blogspot.com

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Matching different formats of electricity—alternating current (AC) and direct current (DC). The unstructured world operates on AC and the structured world operates on DC. Problem in integrating by text: Misspelling—What if two words are found in the two environments— Chernobyl and Chernobile? Should there be a match made between these two worlds? Do they refer to the same thing or something different? Context—The term “bill” is found in the two worlds. Should they be matched? In one case, the reference is to a bird’s beak and in the other case, the reference is to how much money a person is owed. Same name —The same name, “Bob Smith,” appears in both worlds. Are they the same thing? Do they refer to the same person? Or, do they refer to entirely different people who happen to have matching names? Nicknames—In one world, there appears the name “Bill Inmon.” In another world there appears the name “William Inmon.” Should a match be made? Do they refer to the same person? Diminutives —Is 1245 Sharps Ct the same as 1245 Sharps Court? Is NY, NY, the same as New York, New York? Incomplete names —Is Mrs. Inmon the same as Lynn Inmon? Word stems —Should the word “moving” be connected and matched with the word “moved”?
  • A stop word is a word that occurs so frequently as to be meaningless to the document. Typical stop words include the following: a, an, the, for, to, by from, when, which… The second basic edit that must be done is the reduction of words back to their stem. For example, the following words all have the same grammatical Stem: moving, moved, moves, mover, removing  “move”
  • In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  • In a probabilistic match, as much data that might be used to indicate the “Bob Smith” that you’re looking for is gathered and is used as a basis for a match against similar data found where other “Bob Smiths” are located. Then, all the data that intersects is used to determine if a match on the name is valid.
  • The accounting theme would contain words and phrases such as the following: receivable, payable, cash on hand, asset, debit, due date, account… The finance theme would contain such information as the following: price, margin, discount, gross sale, net sale, interest rate, carrying loan, balance due There can be many industrially recognized themes for word collections. Some of the word themes might be the following: sales, marketing, finance, human resources, engineering, accounting, distribution…
  • In an organization by “natural” themes, the unstructured data is collected on a document-by-document basis. Once the data is collected, the words and phrases are ranked by number of occurrences. Then, a theme to the document is formed by ranking the words and phrases inside the document based on the number of occurrences.
  • Raw match of data: if a word is found anywhere in the structured environment and the word is part of the theme of a document, the unstructured document is linked to the structured record. But such a matching is not very meaningful and may actually be misleading.
  • In Figure 11-11, data in the unstructured environment includes such people as Bill Jones, Mary Adams, Wayne Folmer, and Susan Young. All of these people exist in records of data that have a data element called “Name.” Put another way, data exists at two levels in the structured environment—the abstract level and the actual occurrence level. Figure 11-12 shows this relationship of data. In Figure 11-12, data exists at an abstract level—the metadata level. In addition, data exists at the occurrence level—where the actual occurrences of data reside.
  • The data found in the unstructured data warehouse is in many ways similar to the data found in the structured data warehouse. Consider the following when looking at data in the unstructured environment: It exists at a low level of granularity. It has an element of time attached to the data. It is typically organized by subject area or “theme.”
  • The data that can be stored in each section includes the following: ■■ The first n bytes of the document ■■ The document itself (optional) ■■ The communication itself (optional) ■■ Context information ■■ Keyword information
  • An identifier is an occurrence of data that serves to specifically identify a record. Close identifiers are i dentifiers where there is a good probability that a solid identification has been made.
  • Lecture 11 Unstructured Data and the Data Warehouse

    1. 1. Building Data WareHouse by InmonChapter 11: Unstructured Data and the Data Warehousehttp://it-slideshares.blogspot.com/
    2. 2. ContentsOverviewIntegrating the Two WorldsA Themed MatchA Two-Tiered Data WarehouseA Self-Organizing Map (SOM)Fitting the Two Environments TogetherSummary
    3. 3. OverviewUnstructured data ◦ Casual, informal activities such as those found on the personal computer and the Internet ◦ Ex: Emails, Spreadsheets, Text files, Documents, Portable Document Format (.PDF) files, Microsoft PowerPoint (.PPT) filesStructured data ◦ Standard DBMSs, reports, indexes, databases, fields, records, and the like
    4. 4. Overview (cont’)The primary differences between structured data and unstructured data
    5. 5. Integrating the Two WorldsText — The Common Link Plenty of problems arise: • Misspelling • Context • Same name • Nicknames • Diminutives • Incomplete names • Word stems
    6. 6. Integrating the Two Worlds (con’t)A Fundamental Mismatch ◦ The unstructured environment represents documents and communications. ◦ The structured environment represents transactions.Matching Text across the Environments ◦ Remove extraneous stop words ◦ Reduction of words back to their stem
    7. 7. Integrating the Two Worlds (con’t)A Probabilistic Match
    8. 8. Integrating the Two Worlds (con’t)Matching All the Information
    9. 9. A Themed MatchIndustrially Recognized Themes ◦ The unstructured data is analyzed according to the existence of words that relate to industrialized themes.
    10. 10. A Themed MatchNaturally Occurring Themes • fire—296 occurrences • fireman—285 occurrences • hose—277 occurrences • firetruck—201 occurrences • alarm—199 occurrences • smoke—175 occurrences • heat—128 occurrences • fire—296 occurrences • Rock Springs, WY—2 • alabaster—1 • angel—2 • Rio Grande river – 1 • beaver dam—1
    11. 11. A Themed MatchLinkage through Themes and Themed Words
    12. 12. A Themed MatchLinkagethrough Abstraction and Metadata ◦ Is another way to link the two environments.
    13. 13. A Two-Tiered Data WarehouseTwo-Tiered Data Warehouse ◦ One tier of the data warehouse is for unstructured data and another tier of the data warehouse is for structured data.
    14. 14. A Two-Tiered Data WarehouseDividing the Unstructured Data Warehouse ◦ Unstructured communications ◦ Documents and libraries
    15. 15. A Two-Tiered Data WarehouseDocuments in the Unstructured Data Warehouse Factors determine whether or not the actual document is stored in the data warehouse:  How many documents are there?  What is the size of the documents?  How critical is the information in the document?  Can the document be easily reached if it is not stored in the warehouse?  Can subsections of the document be captured?
    16. 16. A Two-Tiered Data WarehouseVisualizing Unstructured Data ◦ Unstructured visualization is the counterpart to structured visualization. ◦ Structured visualization is known as Business Intelligence ◦ The essence of structured visualization is the display of numbers
    17. 17. A Two-Tiered Data WarehouseA Self-Organizing Map (SOM) ◦ Produces a display that appears to be a topographical map ◦ Shows how different words and the documents are clustered, and displayed according to themes
    18. 18. A Themed MatchThe Unstructured Data Warehouse ◦ Is divided into two basic organizations—one part for documents and another part for communications
    19. 19. A Themed MatchVolumesof Data and the Unstructured Data Warehouse ◦ Volumes of data are an issue ◦ Mitigate the volumes of data that can collect in the unstructured data warehouse
    20. 20. Fitting the Two EnvironmentsTogether the unstructured environment contains Maybe data that is incompatible with data from the structured environment However there are ways that the two environments can be related
    21. 21. Fitting the Two EnvironmentsTogether
    22. 22. http://it-slideshares.blogspot.com/SummaryWorld of information technology is really divided into two worlds—structured data and unstructured dataThe common bond between the two worlds is text.The structured environment and the unstructured environment can be matched at: ◦ the identifier level ◦ the close identifier level using a probabilistic match ◦ the keyword to metadata or repository level