Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Data WareHouse by InmonChapter 11: Unstructured Data and the Data Warehousehttp://it-slideshares.blogspot.com/
ContentsOverviewIntegrating the Two WorldsA Themed MatchA Two-Tiered Data WarehouseA Self-Organizing Map (SOM)Fittin...
OverviewUnstructured   data ◦ Casual, informal activities such as those found   on the personal computer and the Internet...
Overview (cont’)The  primary differences between structured data and unstructured data
Integrating the Two WorldsText   — The Common Link                 Plenty of problems arise:                 • Misspellin...
Integrating the Two Worlds (con’t)A   Fundamental Mismatch ◦ The unstructured environment represents   documents and comm...
Integrating the Two Worlds (con’t)A   Probabilistic Match
Integrating the Two Worlds (con’t)Matching   All the Information
A Themed MatchIndustrially   Recognized Themes ◦ The unstructured data is analyzed according   to the existence of words ...
A Themed MatchNaturally   Occurring Themes                    •   fire—296 occurrences                    •   fireman—285...
A Themed MatchLinkage   through Themes and Themed Words
A Themed MatchLinkagethrough Abstraction and Metadata ◦ Is another way to link the two environments.
A Two-Tiered Data WarehouseTwo-Tiered    Data Warehouse ◦ One tier of the data warehouse is for   unstructured data and a...
A Two-Tiered Data WarehouseDividing        the Unstructured Data Warehouse ◦ Unstructured communications ◦ Documents and ...
A Two-Tiered Data WarehouseDocuments      in the Unstructured Data Warehouse Factors determine whether or not the actual ...
A Two-Tiered Data WarehouseVisualizing   Unstructured Data ◦ Unstructured visualization is the counterpart   to structure...
A Two-Tiered Data WarehouseA   Self-Organizing Map (SOM) ◦ Produces a display that appears to be a   topographical map ◦ ...
A Themed MatchThe   Unstructured Data Warehouse ◦ Is divided into two basic organizations—one part   for documents and an...
A Themed MatchVolumesof Data and the Unstructured Data Warehouse ◦ Volumes of data are an issue ◦ Mitigate the volumes of...
Fitting the Two EnvironmentsTogether the unstructured environment contains      Maybe       data that is incompatible wit...
Fitting the Two EnvironmentsTogether
http://it-slideshares.blogspot.com/SummaryWorld   of information technology is really divided into two worlds—structured ...
Upcoming SlideShare
Loading in …5
×

Lecture 11 Unstructured Data and the Data Warehouse

2,307 views

Published on

Building the Data WareHouse http://it-slideshares.blogspot.com

Published in: Education
  • Be the first to comment

Lecture 11 Unstructured Data and the Data Warehouse

  1. 1. Building Data WareHouse by InmonChapter 11: Unstructured Data and the Data Warehousehttp://it-slideshares.blogspot.com/
  2. 2. ContentsOverviewIntegrating the Two WorldsA Themed MatchA Two-Tiered Data WarehouseA Self-Organizing Map (SOM)Fitting the Two Environments TogetherSummary
  3. 3. OverviewUnstructured data ◦ Casual, informal activities such as those found on the personal computer and the Internet ◦ Ex: Emails, Spreadsheets, Text files, Documents, Portable Document Format (.PDF) files, Microsoft PowerPoint (.PPT) filesStructured data ◦ Standard DBMSs, reports, indexes, databases, fields, records, and the like
  4. 4. Overview (cont’)The primary differences between structured data and unstructured data
  5. 5. Integrating the Two WorldsText — The Common Link Plenty of problems arise: • Misspelling • Context • Same name • Nicknames • Diminutives • Incomplete names • Word stems
  6. 6. Integrating the Two Worlds (con’t)A Fundamental Mismatch ◦ The unstructured environment represents documents and communications. ◦ The structured environment represents transactions.Matching Text across the Environments ◦ Remove extraneous stop words ◦ Reduction of words back to their stem
  7. 7. Integrating the Two Worlds (con’t)A Probabilistic Match
  8. 8. Integrating the Two Worlds (con’t)Matching All the Information
  9. 9. A Themed MatchIndustrially Recognized Themes ◦ The unstructured data is analyzed according to the existence of words that relate to industrialized themes.
  10. 10. A Themed MatchNaturally Occurring Themes • fire—296 occurrences • fireman—285 occurrences • hose—277 occurrences • firetruck—201 occurrences • alarm—199 occurrences • smoke—175 occurrences • heat—128 occurrences • fire—296 occurrences • Rock Springs, WY—2 • alabaster—1 • angel—2 • Rio Grande river – 1 • beaver dam—1
  11. 11. A Themed MatchLinkage through Themes and Themed Words
  12. 12. A Themed MatchLinkagethrough Abstraction and Metadata ◦ Is another way to link the two environments.
  13. 13. A Two-Tiered Data WarehouseTwo-Tiered Data Warehouse ◦ One tier of the data warehouse is for unstructured data and another tier of the data warehouse is for structured data.
  14. 14. A Two-Tiered Data WarehouseDividing the Unstructured Data Warehouse ◦ Unstructured communications ◦ Documents and libraries
  15. 15. A Two-Tiered Data WarehouseDocuments in the Unstructured Data Warehouse Factors determine whether or not the actual document is stored in the data warehouse:  How many documents are there?  What is the size of the documents?  How critical is the information in the document?  Can the document be easily reached if it is not stored in the warehouse?  Can subsections of the document be captured?
  16. 16. A Two-Tiered Data WarehouseVisualizing Unstructured Data ◦ Unstructured visualization is the counterpart to structured visualization. ◦ Structured visualization is known as Business Intelligence ◦ The essence of structured visualization is the display of numbers
  17. 17. A Two-Tiered Data WarehouseA Self-Organizing Map (SOM) ◦ Produces a display that appears to be a topographical map ◦ Shows how different words and the documents are clustered, and displayed according to themes
  18. 18. A Themed MatchThe Unstructured Data Warehouse ◦ Is divided into two basic organizations—one part for documents and another part for communications
  19. 19. A Themed MatchVolumesof Data and the Unstructured Data Warehouse ◦ Volumes of data are an issue ◦ Mitigate the volumes of data that can collect in the unstructured data warehouse
  20. 20. Fitting the Two EnvironmentsTogether the unstructured environment contains Maybe data that is incompatible with data from the structured environment However there are ways that the two environments can be related
  21. 21. Fitting the Two EnvironmentsTogether
  22. 22. http://it-slideshares.blogspot.com/SummaryWorld of information technology is really divided into two worlds—structured data and unstructured dataThe common bond between the two worlds is text.The structured environment and the unstructured environment can be matched at: ◦ the identifier level ◦ the close identifier level using a probabilistic match ◦ the keyword to metadata or repository level

×