• Like
PPT
Upcoming SlideShare
Loading in...5
×
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,167
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
18
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • What if…. You never got any spam . All your email automatically got categorized and took care of all the business processes and approvals.
  • What if…. In case of an accident, you could with one phone call, call the towing company, re-schedule your appointments, call the hospital (hopefully not), book a one-hour later flight, call a cab…. And it worked.
  • “ More data will be produced in the next 3 years than in all of recorded history!” - UCB
  • Research in Information and Interaction deals with management, retrieval and analytics on different data modalities -- records, XML, text, audio, video, images etc., as well as integration of these technologies into business processes and user interactions. Activities are in such areas as digital media, information management of structured and unstructured data, and user interface technologies. Our projects span databases, XML and querying, management of tera to petabytes of contents, digital rights management, natural language processing (text search, deep text analytics, machine translation, analytics of inter and intra-net content), machine learning, video compression, delivery, caching, search and analysis, speech (recognition, text to speech, conversational systems), and biometrics.
  • It is cheaper than ever to gather all sorts of data about the world around us: surveillance cameras are everywhere, medical procedures are based on extensive tests, our manufacturing plants are fully wired, our stores are monitored for security and customer-relationship management. Not only has the amount-- scale -- of data increased by many orders of magnitude, but the heterogeneity and ability to act on this data have changed. The new types of data are no longer tables of ordered numbers. They consist of a heterogeneous collection of sensor data, in all different forms and quality. Both the changing scale of this data and its heterogeneous form mean that it is much harder to extract meaning from the data. Information accessibility through the World Wide Web has raised the bar on user expectations, and has changed the nature of individual business activities. Addressing ease of access, breadth of relevant results and 'usability of results' issues will be key factors to enable agile business decision making. On-demand environments will require access to multiple types of data wherever and whenever needed. Making daily business decisions increasingly relies on the ability to analyze large volumes of both structured and unstructured data. Our research in Information and interaction is addressing these issues...
  • FSS: TBD Banking TBD: add special info on banking Financial Markets TBD: add special info on financial markets Insurance TBD: add special info on insurance
  • FSS: TBD Banking TBD: add special info on banking Financial Markets TBD: add special info on financial markets Insurance TBD: add special info on insurance
  • FSS: TBD Banking TBD: add special info on banking Financial Markets TBD: add special info on financial markets Insurance TBD: add special info on insurance
  • An example of multichannel self-service a la Duncan Ross: I'm a banking customer sitting at home on internet banking, moving money between accounts and altering personal profile preferences. By profile preferences I mean the internet banking interface is configurable..."I'd like always to see my three Account Balances as my first screen" is the choice I decide to set. System pops up a message "Mr Ross, would you like this change to apply to your phone banking too?" I click on "yes". Finished my transactions. I leave home, get in car, forget another transaction was required. I phone into self service phone banking , entirely interacting by speech. The system plays me my three account balances, just as I had asked for 10 mins earlier on the web channel. Also, the balances are the updated balances, reflecting the changes I made on my internet banking. The reason the system was able to instantly take what it learned from me on the web and implemented it on the phone, was because that bank used well architected IBM integration middleware, WebSphere. I believe not many banks would be in a position to make this scenario real.
  • Distillery provides a platform for adaptively assembling, configuring, managing, scheduling and running analytics. UIMA provides a development framework for building, describing & integrating component analytics.
  • Contact: Dave Ferrucci / Arthur Ciccolo 3/03 IBM’s Unstructured Information Management Architecture (UIMA) is being developed under a world-wide Research division effort to support the rapid integration and deployment of a wide variety of analytical techniques to assist in the integration and processing of large volumes of structured and unstructured information. Implementations of UIMA provide a middleware which will be used to build applications that need to sift through huge quantities of structured and unstructured information to discover, translate, summarize, categorize and organize the knowledge relevant to important applications ranging from national and business intelligence to bioinformatics. Core components include a powerful semantic search engine, a central document and meta-data store and an Analysis Engine framework -- all communicating through XML and web services. Current development is focused on the deployment of text analysis engines, however the architecture is being extended to other unstructured media included voice, audio and video. The primary design objectives of IBM’s UIMA are two fold - first, to provide a solid infrastructure for composing advanced search and analysis applications from reusable components which may be embedded in product platforms or delivered as highly-distributed and scalable stand-alone applications or services. The second primary design objective is to facilitate the rapid combination of analytical techniques in support of the Combination Hypothesis. This is the hypothesis that significant scientific advances can be made in the precision of analysis and search results, if independently developed techniques with different strengths and weaknesses may be quickly combined to produce superior and otherwise uncharted solutions. Different unstructured artifacts, examples include text documents, video , audio or voice files, are gathered and organized into collections. The application with the help of the analysis engine directory services selects the types of analyses engines that should be applied to the collection. [Example analysis engines include language translators, document summarizers, document classifiers, scene detectors, geography detectors, glossary builders etc. Each analysis engines specializes in discovering relevant concepts (or "semantic entities") otherwise unidentified in the document text (or video image for example)] Applications feed the collection through the collection processing manager (CPM) whose primary responsibility is to apply the selected analysis engine(s) to each element (e.g., document, video etc) in the collection and to further process the results ensuring, for example, that the results of the analysis may be associated as meta-data with the element in the document meta-data store. [The analysis of each collection element is captured in specialized structure XML structured called the CAS or Common Analysis System (not indicated in the diagram). The CAS holds all the results of analysis in a common representation medium for communication between UIMA components and for storing key meta-data.] [Analysis engines, to do their job, may consult a wide variety of structured knowledge sources. They do this in a uniform way through Structured Knowledge Access components called Knowledge Source Adapters (KSAs). These objects manage the technical communication and semantic mapping necessary to deliver knowledge encoded in databases, dictionaries, knowledge bases and other structured sources to the analysis engines in a uniform way and in the language the analysis engine can understand. Different KSA maybe discovered based on the type of knowledge the provide through the KSA Directory Service. ] The CPM may also be configured by the application to extract a variety of elements from the analysis (i.e.. the CAS) and index these results in the semantic search engine. Not only may tokens (i.e., simple words) be indexed for efficient search but the concepts (or "semantic entities") discovered as part of analysis may also be indexed and therefore made accessible at search time. UIMA calls for an advanced search capability that allows for querying for artifacts based on a combination of the concepts and tokens discovered by analysis engines. After the CPM has processed the collection, the application can use the results of analysis stored in the meta-data store as well as captured in the rich index of tokens and concepts processed by the search engine to deliver the key knowledge required by the user and in the ideal form. At this stage the application may additionally request translations, summaries, concept highlighting of the results. Post analysis may also be performed on the fly to further analyze input dynamically at the request of the application.
  • The UIMA Collection Processing Architecture supports the application of per-document Analysis Engines to collections of documents and to build analysis results over entire collections. The most aggregate structure in this architecture is the Collection Processing Engine . Lets take a look at how basic UIMA component aggregation and encapsulation builds up from a simple annotator to a collection processing engines. We start with a simple annotator. <click> This gets encapsulated into pluggable container called the analysis engine . <click> And these may be aggregated into a workflow and encapsulated to form an aggregate analysis engine with the same interface as its more primitive counter part. <click> This resultant capability may be inserted as the core analysis in a collection processing engine . <click> This component is an aggregate that includes a Collection Reader to acquire documents from and external source and CAS Consumers <click> to use the results of per-document analysis to build structured resources, like databases and indices. And of course behind the scenes is Common Analysis Structure (CAS) providing all these components with shared access to the artifact and evolving analysis. Worth noting, and as we shall see in detail later, all these UIMA components have what we call component descriptors which contain declarative metadata describing there structure and functional capabilities. This component metadata is key for supporting composition, workflow validation or optimization, discovery, reuse, and tooling. ======= For example, search engine indices over all the tokens in the collection, or term frequencies tables, or collection-wide entity co-reference chains. All of these, types of results that reflect inferences over the collection as a whole, and are built up from per document analyses
  • Another way of looking at the annotation process in terms of the structure that is built-up by each step in the flow. This animation helps illustrate how annotators iterate over annotations, infer new annotations and add them to a common data structure we call the Common Analysis Structure or CAS. A Parser, for example looks at tokens and infers and records grammatical structure annotations like <click> “Noun Phrase”, “Verb Phrase” and “Prepositional Phrase.” This are added to the CAS as stand-off annotations. Stand-off annotations are data structures that are not embedded in the original document, like inline XML tags, but rather point into the document indicating the span of text which they label. As is illustrated in this picture. (The community has established the benefits of stand-off annotations structures prior to our work on UIMA, -- they do not corrupt original, different interpretations, overlapping spans for example). Next <click> a Named Entity detector may get access to the CAS. It would consider the grammatical structure annotations, as well as perhaps the tokens, to infer and record named-entity annotations <click> like “Government Title”, “Person”, “Government Official”, “Country” etc.” A relationship annotator might iterate over named entities to infer and record relationships between them, <click>, like the “Located In” or what we often call the “at” relation. This is itself an object in the CAS that has properties linking the relation annotation to its argument annotations, the entity and the location.
  • Contact: Arthur Ciccolo The Combination Hypothesis states that significant advances can be realized in all aspects of Knowledge Management (KM) -- finding, organizing, and discovering knowledge -- if independently developed techniques with different strengths and weaknesses are made combinable and interoperable on large scale corpora. The idea of combining independent techniques as a scientific approach is not at all new: the best analogy is the multiple drug therapy, each with a completely different mechanism of action, which was successful in treating such diseases as tuberculosis and AIDS. However, historically, KM tools have been focused on one aspect of the problem: search, clustering, natural language parsing, machine translation, and so on. This seems to be the results of both scientific domain fragmentation -- scientists build systems focused on their are of expertise, and the limited scope of existent applications. Furthermore existing tools are not inter-operable in any non-trivial way. On the other hand, at this point, a combined approach has become necessary both from an economical and a technical point of view. The web revolution has exposed hundreds of millions of people to the experiences of search and taxonomy browsing, and has reshaped their expectations of the knowledge retrieval experience outside the web, in their workplaces. But study after study shows that at the enterprise level, these expectations are not met. Knowledge management in the enterprise setting and even simply document search are universally perceived as disappointing. Why is that so? The search technology per se has made enormous strides: web search engines return excellent results on one word queries on a 15 TB corpus even though not so long ago this would have been labeled impossible in principle, not as a matter of computational cost. On the other hand, a number of techniques from Natural Language Processing (NLP) such as statistical text analysis, computational linguistics, speech recognition, machine learning, and taxonomy generation and classification, have been combined with classic search methods and have shown significant benefit. As a result, there is growing confidence that many will move from the status of cutting edge research to commercial application in the near term. For example, the augmentation of standard search query methods with lexical information, such as thesauri, has been shown to improve precision by approximately 50% for short queries. (Mandala et al, Proc SIGIR (1999), W.A Woods et al , Proc 6th ANLP Conf (2000)). Automatic document categorization and classification became more accurate than human processing in the late 1990's and is now considered as an essential means of organizing large corpora for knowledge management systems. Automated summarization of documents based upon information extraction techniques has been demonstrated to improve search efficiency by supporting more focused examination of retrieved documents. Finally, statistical machine translation, while still far below the capabilities of skilled human translators, may be good enough to support cross-lingual information retrieval on the Web or across enterprise document collections. Although, at this point in time, some of these technologies, their computational demand might be too high for application to "the entire web", this should be less of a problem in the enterprise where the corpora are usually smaller. So it seems that search and knowledge management in the enterprise should be easier than on the web. The demand is there. The technologies are there. What is the missing part? Where is the problem? The answer lies in part in the essential differences between the public Web and the internal environment of the Enterprise. For example, one factor is that although enterprise corpora are smaller, they lack the highly hyper-linked nature of the web and thus some of the most successful techniques for the web based on link analysis do not apply in the enterprise. This results in lower relevancy of retrieved documents. Another factor is that in the enterprise there are additional security, reliability, and performance issues that complicate the problem. But the most important factor is independent of the public Web versus the enterprise, and rests with a fundamental character of the technologies. The advanced technologies described above, for the most part, simply do not work together. Typically, each one of these technologies has a completely different view of the world, represents the underlying documents in different ways, and is concerned with performance in different areas. This situation arises in part from their developers being "algorithmic centric". The computational requirements of these technologies are so great, that their developers tended to engage in "programming-in-the-small". That is to say, that they built highly integrated, optimized, and hence closed and rigid, narrow applications of their core competency. To build an end systems to be used by consumers of information, not by programmers, such narrow applications are usually piled on top of each other. If there is any cooperation at all, then it takes the form of one narrow application that consumes documents, performs its magic on their contents, and produces new documents as output. That output is then be consumed by another narrow application which starts by repeating much of the text parsing, tokenization, and so on, to convert the data to its representation. And so on, cascading inefficiency on inefficiency. Hence the synergies postulated by our Combination Hypothesis are lost either through sheer inefficiency or because the information gained in one stage can not be communicated to the next. Thus the first challenge in testing the Combination Hypothesis is to develop an infrastructure that allows ``plug-and-play'' combinations of these advanced KM technologies. This is the goal of IBM’s Unstructured Information Management Architecture (UIMA) which is being developed under a world-wide Research division effort to support the rapid integration and deployment of a wide variety of analytical techniques for processing large volumes of structured, unstructured, and semi-structured data. The Combination Hypothesis is not a foregone conclusion: it might well be that some aspects of Knowledge Management will not be advanced by the combination treatment; for instance, maybe classification accuracy is as good as it can be, maybe search can not be helped by parts-of-speech tagging. However building the right infrastructure will allow us to know rather than guess the answers to these questions, and if even only few particular combination offer significant improvements over the current state of the art, the marketplace impact would be considerable.
  • Engagemen workbook-notes based repository Acc. Team and SD team services delivery team Ledger- Financial information Claim tool- Labor tool; utilization of labor against contract Accts. Receivable Accts. Payable Lessons learned Source- web based repository; presentation materials, documents, workscopes…. If utilized, can drive reuse
  • This foil gives the general flavor of what metadata is, by providing examples. Metadata exists both in structured information as well as in unstructured information as shown by the columns of the table. People Proxies is a title topic from the 2004 GTO. Presented may want to add specific examples which are relevant to the audience. Presenter may want to change the title to “Metadata is everywhere – can you see it” after the IBM commercial for middleware running in 2004.
  • EXIF - EX changeable I mage F ile format for Digital Still Cameras http://www.exif.org/Exif2-2.PDF The EXIF standard covers in addition to camera / image related properties also properties which are associated with GPS which is beginning to find its way into camera’s and camera enabled cell-phones. This foil is useful in demonstrating the multitude of metadata tags found in everyday data – many of the tags are captured automatically by the camera while others such as title and comments can be added manually. The camera related metadata such as aperture , lighting conditions focus distance can be used by the printing equipment or enhancement software to improve the quality of final photos. Source; Andrei Broder (Image is a private image from Andrei)
  • Hierarchic Data Model Hierarchic data bases supported only queries that were designed by the original data administrator IMS – the first Hierarchic data model became commercial 1968 – (Introduction to data bases C.J. Date – 1985 p 503 Relational Data Model DB2 first version 1984? Extensible Data Model (XML) - XML 1 standard 1998 XML is the acronym for eXtensible Markup Language, the universal format for structured documents and data on the Web. XML is an industry-standard protocol administered by the World Wide Web Consortium (W3C). Domain Specific Ontologies - OWL standard 2004 OWL is a W eb O ntology L anguage. OWL builds on RDF and RDF Schema and adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. (more info here: http :// www . w3 . org / 2004 / OWL)
  • Hierarchic Data Model Hierarchic data bases supported only queries that were designed by the original data administrator IMS – the first Hierarchic data model became commercial 1968 – (Introduction to data bases C.J. Date – 1985 p 503 Relational Data Model DB2 first version 1984? Extensible Data Model (XML) - XML 1 standard 1998 XML is the acronym for eXtensible Markup Language, the universal format for structured documents and data on the Web. XML is an industry-standard protocol administered by the World Wide Web Consortium (W3C). Domain Specific Ontologies - OWL standard 2004 OWL is a W eb O ntology L anguage. OWL builds on RDF and RDF Schema and adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. (more info here: http :// www . w3 . org / 2004 / OWL)

Transcript

  • 1. Top Five Data Challenges for the Next Decade Dr. Pat Selinger IBM Fellow and VP, Area Strategist
  • 2. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 3. Research Challenges - Examples SPAM Keyword-based Search Engines ..xyz..
  • 4. Research Challenges – Examples
  • 5. Research Challenge – Examples Q: Can you spell your name please ? A: P.A.T. Q: One more time please… A: P..… A..… T..… Q: Sorry… connecting you to a live operator… one moment, please.
  • 6. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 7. Issue: HW and SW systems have changed since RDB was invented. Information mgmt architecture hasn’t kept pace
    • 1975
      • 1 MIPS processor
      • Mainframe uniprocessor
      • 14 inch disks
      • 24 bit addresses
      • 256K real memory
      • Channel to channel connections
      • Strings and numbers
    • Today
      • 2+ GigaHertz processors
      • 32 and 64-way SMPs
      • RAID disks, logical volume managers
      • 64 bit addresses
      • 100+ GB real memory
      • Gigabit Ethernet, Infiniband supporting clusters of systems
      • Rich data (audio, documents, XML, …)
  • 8. Issue: Data Volumes Exploding Workload 2005 1s TB 100s TB 1s TB 10s GB 1s GB 2010 10X 100X 100X 1,000X 10,000X The world produces 250MB of information every year for every man, woman and child on earth. 85% of the data is unstructured . Common Database Sizes Transactions 100-500GB Warehouses 100s GB – 10’s TB Marts 1 - 50 GBs Mobile 100s MB Pervasive 100s KB
  • 9.  
  • 10. Storage aerial density CGR continues at 100% per year to >100 Gbit/in2. The price of storage is now significantly cheaper than paper. Storage Trends Aid this Data Explosion
  • 11. Issue: CPU performance growing by 100% I/O performance by 5% every year Disk CPU
  • 12. Solution: Overlapping, deferring or avoiding I/O. Examples :
    • Multi-dimensional Clustering
    • Multiple bufferpools
    • Prefetching into Bufferpools
    • Page Cleaners use Async I/O
    • Indexes with added columns or tables in indexes
    • Index anding and oring
    • Pushdown predicates
    • Function-shipping on clusters
    • Materialized Query Tables
    • Compression
    • And much more….
  • 13. Multi-Dimensional Clustering via Cells and blocks: year dimension color dimension nation dimension Cell for (1997, Canada, yellow) 1997, Canada, blue 1997, Mexico, yellow 1997, Mexico, blue 1997, Canada, yellow 1998, Canada, yellow 1997, Mexico, yellow 1998, Mexico, yellow 1997, Canada, yellow 1998, Canada, yellow 1998, Mexico, yellow Each cell contains one or more blocks
  • 14. Research Challenge #1 Scalability: Massive Growth in Multiple Dimensions
    • Scaling directions:
      • Petabytes of storage
      • “ Fire hose” of data continuously loading
      • Millions of users,
      • Millions of processors
      • Larger and more complex data objects
      • Systems being only partly online
      • Partial answers, relevancy ranking
    Unlimited CPUs 1000 processors 10**6 processors
  • 15. Research Challenge #1
    • Design our DBMSs to keep pace with HW, SW, data changes
    • Scale without sacrificing user-visible availability or performance.
    • While always inventing new techniques to “cover up” the ever-increasing gap between processor speeds and disk speeds, e.g. exploit large memories
  • 16. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 17. Cost of Labor Increasing While Demands Rising
    • Labor-intensive management effort
    • Scarcity of skilled DBAs
    Rising Costs Changing Ecosystem Lower Cost Storage Tighter Integration More Dynamic Workloads Low Cost Clusters Structured and Unstructured Data Higher Availability
  • 18. Autonomic Computing: Deliver significantly lower total cost of ownership
    • Cost of application development, time to solution delivery
    • Labor cost and skills availability for database administration and management
  • 19. Autonomic Capabilities Available Today in DB2 for Linux, Unix, Windows
    • Available in v8.1.x
    • Up and running
      • Configuration advisor
        • Sets dozens of the most critical parameters in seconds. Heaps, process model, optimizer, and more.
    • Automated physical database design
      • Design Advisor
        • Automated index selection.
    • Runtime
      • Industry leading query optimizer,
        • automatic high quality plan selection.
      • Query Patroller workload manager.
        • Policy controlled management of SQL/ODBC.
        • Query throughput control with QP query classes
        • Usage trending reports with QP Historical Analysis
        • Real-Time monitoring and control of current running queries
      • Self tuning LOAD
      • Adaptive utility throttling for Backup
        • Allows maintenance to consume as much resource as possible without impacting the user workload throughput beyond Policy specified constraint.
      • Control Center scheduler
        • Task Center (within CC) can schedule/automate execution of OS or DB2 scripts.
    • Self healing, availability and diagnostics
      • Health Monitor
        • Ensures proper database operation by constantly monitoring key indicators.
        • Notification of alerts by e-mail, page, CLP, GUI, SQL.
        • Health Center tooling provides graphical tools to drill down on details.
      • Fault monitor
        • Automatically restarts DB2
      • Automatic Index Reorganization
        • Automatically defragment leaf pages.
      • Automatic continual I/O consistency checking
    • New in DB2 UDB v8.2
    • Automated physical database design
      • Design Advisor extensions
        • Combined (or individual) recommendations for indexes, MQTs, MDC, and DPF partitioning.
        • Automatic workload compression.
        • 4 workload capture techniques. (package cache, Query Patroller, event monitor, text file)
        • Exploits sampling and multi-query optimization
    • Runtime
      • Automated database maintenance
        • Automation of Backup, Runstats, Reorg
        • Statistics collection is online, throttled, with new locking protocols for non-intrusive collection.
        • Policy expression lets users select subset of schema, and available times of day.
        • Advanced algorithms detect “when” maintenance is really needed.
      • Automatic statistics profiling
        • Determines what statistics should be collected.
        • Automatic detection of column groups allows query optimizer to model correlation.
        • First “industrial version of “LEO” technology.
      • eWLM integration
        • Performance analysis for the IBM stack.
      • Utility throttling for Backup, Runstats, Rebalance .
        • The v8.1.2 BACKUP throttling technology is extended to a broader set of administrative utilities.
      • Self tuning BACKUP
        • Up to 4x faster than v8.1.x defaults
      • Simplified memory management
        • Heaps automatically grow when constrained
    • Self healing, availability and diagnostics
      • Common Logging across IBM software products.
      • HADR with automatic client reroute
      • Extensions to Health Monitor
        • Increase recommendations for user response to alerts.
    • Self protecting
      • Data Encryption
      • Common Criteria Certification
      • Enhanced Security for Windows users
  • 20. Example: DB2 Design Advisor
    • Makes recommendations for:
      • Indexes on the base tables
      • Materialized Query Tables
      • Indexes on the Materialized Query Tables
      • Converting non Multi-Dimensional Clustering tables to Multi-Dimensional Clustering tables
      • Partitioning existing tables
  • 21. Research Challenge #2 Examine radically simpler architectures and address total cost of ownership Complex, Unknown Simple, Understood Application Characteristics Business Segment Small businesses Enterprise Small DB engines Open Source Current Product Autonomic Efforts High End DBMS Research Challenge: Zero Admin. For Complex Apps Enterprise Class Scale and Performance
  • 22. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 23. Nature of “Interesting” Data is Changing How do we process these in an integrated way? Unstructured Information Management Information from Multi-Modal Interactions, e.g. speech Classic Information Management -- relational databases
    • Autonomous ?
    Employee Department Product Inventory Sales Data Bank Accounts Warehouses ... 85% unstructured and not in DBMS 1001 1000 Employee ID - 2 Jane Doe 1000 Manager 3 John K Dept Number Name 1 Corporate 2 Manufacturing Dept ID Dept Name
  • 24. Addressing the Changing Characteristics of Data Actionability Heterogeneity Scale Satellite & Surveillance Images and Video Gene Sequences Transactions Text and Web Increasing need to manage and analyze new data types Protein Folding
  • 25. Changing Characteristics of Data Volume growth versus semantics per unit of data Transactions and structured data Seat on an airplane: easy to find, structured data Actionability Scale Heterogeneity -High -Low -Low
  • 26. Changing Characteristics of Data Text and other human data Hard work to extract the pearl, but you know where to look Actionability Scale Heterogeneity Medium - -Medium -Medium
  • 27. Changing Characteristics of Data Machine-generated data There is gold somewhere in the pile, and you need to keep sifting Actionability Scale Heterogeneity Low - -High - High
  • 28. Extending “Mission-Critical” to Unstructured Data
    • XML View Of Relational Data
      • SQL data viewed and updated as XML
        • Done via document shredding and composition
      • DTD and Schema Validation
    • XML Documents As Monolithic Entities
      • Atomic Storage And Retrieval
      • Search Capabilities
    • Next: XML As A Rich Datatype
      • Full storage and indexing
      • Powerful querying capabilities
    XML has become the “data interchange” format.
  • 29. Example: XML Strategy for DB2 UDB Native XML capabilities inside the engine SERVER CLIENT Data management client Customer client application SQL(X) XQuery DB2 Server XML Interface Interface XML Storage Relational Storage Relational
  • 30. Content Management Solutions - Capability Content Solutions Information Integration Workflow/Business Process Management/Collaboration Archiving Document Management Web Content Management Output/Report Management IBM Content Management Portfolio Multimedia Management Imaging Digital Asset Management Content Integration Digital Rights Management Regulatory Compliance / Records Management
  • 31. Enterprise Content Management
    • Cross Industry
      • Customer Service
      • Human Resources
      • Accounts Payable
      • Records Management
      • Marketing Communications
      • Online Report Viewing
      • E-mail Archival
      • Business Continuity
    • Financial
      • Loan Origination, Signature Verification
      • Credit Card Dispute Handling
      • Retirement Account Management
      • Mutual Fund Processing
      • Leasing and Contract Management
    • Insurance
      • Claims, Underwriting, Policy Service
      • Agent Management
    • Government
      • Law Enforcement and Land Records
      • Permits, Licensing, Vital Records
      • Constituent Correspondence & Services
      • Tax Form Capture
    • Manufacturing
      • Engineering Documentation, Change Management and ISO 9000 Cert.
      • Product Management
      • Customer and Channel Service
      • SAP Data Archiving and Document Management
    • Retail/Distribution
      • Vendor Management
      • Claims and Loyalty Management Programs
      • Web Site Content Mgmt.
      • Digital Content Commerce
    • Transportation:
      • Proof of Deliveries, Service
      • Driver Management
    … Content-enabled Business Processes…Electronic Statements … e-Mail Management…e-Records Management… Transforming Processes with Digital Content
  • 32. Classic Data and Content Management Converging
    • Content Manager provides more “Data Management” services
      • Transactional and referential integrity
      • Optimized query
      • Scalable storage
    • RDB users want more “Content Management” services
      • Check-in, check-out and versioning
      • Integrated hierarchical storage management
      • Non-normal (i.e. hierarchical) metamodel
    • XML is accelerating this convergence
      • Sometimes it’s data – other times it’s content
  • 33. So, are we done? No!
  • 34. Research Challenge # 3
    • Every one of us should know Content APIs as well as we do SQL
    • Content Management has VERY different requirements than
      • Short atomic transactions with two phase commit
      • Two phase locking
      • B-tree indexing
      • Cursors
      • … .
  • 35. Research Challenge #3
    • We need to learn what managing content is all about, what is needed and forge new models:
        • Query and client interaction
        • Versioning
        • Foldering
        • Sub-document authorization
        • Sub-document checkin/out
        • Text search and analytics
  • 36. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 37. Research Challenge #4 Data Interaction Paradigms – What’s Next? Richness of Data Strings and Numbers Text Audio, Video, Sensor Ease of Access Ease of Data Access Programs Spreadsheets Search Engines Speech enhanced with semantics ? Relational DB Web
  • 38. Embracing richer data types and functionality in information management middleware Speech Technology will Enable New and Easier Applications IM Web Kiosks Email, SMS Mail, Fax, etc Customers Scheduling and Coordination Business Processes Contact Points Branch office Web IVR Business Intelligence Web logs Speech transcriptions Call logs … . Voice Call Center Face to face Analytics Workforce Integrated Interaction Channels Analytics Across Data Types Shared Infrastructure and Business Logic
  • 39. Speech Recognition Technology Evolution Across channel, domain, environment 2007-2010 Multiple Channels Variable Noise Overlapping Talkers Accented Speech Multiple Domains Graded Challenges Discrimination System 1 System 2 System 3 Fusion Recognizer Adaptation Goal: Surpass human ability to accurately transcribe speech across multiple domains and environments. IBM Value: This level of performance required to achieve truly pervasive conversational technologies. Toards SperHuman Speech Recognition Data driven with careful statistical modeling Wide variety of test data Regular benchmarks of human performance Basic Principles and New Techniques Cooperative User Immediate Feedback High Bandwidth Microphone 1997-2001 Transparent to user No feedback
  • 40. Text To Speech Generation Technology Impressive quality Can you guess what is TTS and what is recorded speech?
  • 41. Analytics bridge the Unstructured & Structured worlds Unstructured Information UIMA High-Value Most Current Content Fastest Growing BUT ... Buried in Huge Volumes – Lots of Noise Implicit Semantics Inefficient Search Explicit Structure Explicit Semantics Efficient Search Focused Content Text , Chat, Email, Audio, Video Indices DBs KBs
    • Identify Semantic Entities, Induce Structure
      • Chats, Phone Calls, Transfers
      • People, Places, Org, Events
      • Times, Topics, Opinions, Relationships
      • Threats, Plots, etc.
    UIMA - The Big Picture Structured Information
  • 42. Unstructured Information Management Architecture
    • Common Research infrastructure for advancing Text Analysis and NLP capability
      • Promotes re-use of best-of-breed components
      • Promotes combination hypothesis through ease of integration
    Unstructured Information Application Libraries Specialized Application Libraries Provide basic functions common to a broad class of application libraries & applications (e.g. Glossary Extraction Taxonomy Generation, Classification, Translation, etc.) Question Answering e-Commerce Semantic Search Engine Token and Concept Indexing Query Key words, concepts, spans, ranges -> Ranked Hit List National & Intelligence Business Bioinformatics Technical Support Document & Meta Data Store Documents with meta data based on key-value pairs Enables view & collection management (Text) Analysis Engine (TAEs) Combination of analysis engines employing a variety of analytical techniques and strategies Structured Knowledge Access Knowledge Source Adapters - (KSAs) deliver content from many structured knowledge sources according to central ontologies Collection Processing Manager KSA Directory Service Dynamic query & delivery of KSAs TAE Directory Service Dynamic query & delivery of TAEs UIMA Standard Application Libraries Relevant Application Knowledge Structured Data UIM Solutions
  • 43. UIMA Component Architecture from “Source to Sink” Collection Processing Engine CAS Consumer CAS Consumer CAS Consumer Aggregate Analysis Engine Analysis Engine Annotator CAS CAS Ontologies Indices DBs Knowledge Bases Collection Reader Text, Chat, Email, Audio, Video CAS Initializer Analysis Engine Annotator CAS
  • 44.
    • Language, Speaker Identifiers
    • Part of Speech Detectors
    • Document Structure Detectors
    • Tokenizers, Parsers, Translators
    • Named-Entity Detectors
    • Sentiment Detectors
    • Face Recognizers
    • Relationship Detectors
    • Classifiers
    What can analytics do?
  • 45. Basic Building Blocks: Annotators Iterate over a document to discover new annotations based on existing ones and update the Common Analysis Structure (CAS). Governor visits embassy Jones in Japan Located In Arg2:Location Arg1:Entity PP VP NP Parser Named Entity Annotator Relationship Annotator Gov Official Country Gov Title Person
  • 46. Research: The Combination Hypothesis Independent Analyzers Combined Analyzers via Common Annotation Structure (UIMA) If intimately integrated, various KM technologies will provide higher quality results (accuracy, recall, etc.) Data Mining Information Retrieval String & Graph Algorithms UI / Human Factors Privacy & Security Machine Learning Text analytics & NLP Unstructured Information Management Architecture Source Indexer Entity extractor Classifier Result 1 Result 2 Result 3 Application Source Ent. extractor Classifier Indexer Result Application
  • 47. Research Challenge #4
    • Include speech and text data and derived text analytics and context in our scope of data research work. How does that change:
      • Access techniques,
      • Search and optimization algorithms
      • Result sets and interaction mechanisms
      • Storage and indexing
      • Models of data
      • Framework for derived information, ways to query and search it
      • System architecture
  • 48. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 49. Data Heterogeneity in Enterprises Today data is in disparate locations; it is not easily accessible nor harnessed for key information What data do we have? Where is it? How can I find it? What format is it in? Is it searchable? Does “customer” mean the same in each system? How do I reconcile differences? What applications feed data to other applications? If I change something, what breaks?
    • Proposals
    • Contracts
    • Offerings
    • Historical
    Engagement Ledger Lessons Learned Intellectual Capital Marketing Delivery Notes Notes Notes Notes Claim CCMT .DOC .LWP .XLS .123 .DOC .LWP .XLS .123 Source
    • Competitors
    • Pricing
    • Demand
    • Offerings
    CLIENT DATA Demographics, configurations, current costs, financial, legal, existing contracts, RFI, etc. .DOC .LWP .XLS .123 .XLS .123 Sage 123 654…
    • Acct. Teams
    • SD
    .DOC .LWP
    • Proposals
    • Contracts
    • Negotiations
    Engagement Workbook
  • 50. Metadata: Today and Tomorrow Identifying Integrating Current Focus Current Challenge Opportunity
    • Store
    • Search
    • Discover
    • Linkages within domains
    • Linkages across domains
    Understanding
    • Definitions
    • Taxonomies
    • Complex relationships
    • Sophisticated semantics
  • 51. Metadata: Spectrum Metadata describes and adds meaning to data and business process Information Structures About Applications, Processes, Resources Vocabularies & Concepts
    • Text/Documents
      • Names, locations, phone numbers, language, …
    • Images
      • Name, date, time
      • Characteristics
    • Rich, Streaming Media
      • Location, timing, scene identification, participants, actions, ...
      • Formats
    • Relational data
      • Column attributes
      • Table values (domains)
    • XML Documents
      • Tags, XML Schema
    • Software Assets
      • Date, version, …
      • Interface definitions
    • Web Services
      • Name, attributes, …
      • Interface definitions
    • People Proxies
      • Name, location, serial no. …
    • System Information
      • System resource
      • Operating environment
    Unstructured Information Structured Information
  • 52. Metadata Associated With Digital Camera Pictures
    • Future Camera Generated
    • Metadata
    • Latitude
    • Longitude
    • Altitude
    • GPS time (atomic clock)
    • GPS satellites used for measurement
    • Measurement precision
    • Speed of GPS receiver
    • Direction of movement
    • Direction of image
    • Name of GPS processing method
    • GPS differential correction
    • ~ 30 Tags
    Camera Generated Metadata File Name 102-0299_IMG.JPG Camera Model Name Canon PowerShot S400 Shooting Date/Time 1/12/2004 12:12:07 PM Shooting Mode Auto Tv( Shutter Speed) 1/200 Av( Aperture Value) 2.8 Metering Mode Evaluative Focal Length 7.4mm Flash Off White Balance Auto AF Mode Single AF Drive Mode Single-frame shooting ~ 100 Tags Associated audio file
    • Additional tags (User input)
    • Title, Comments, Favorite Picture, Keywords, PrintMe, Categories, PrintOrder, etc
    EXIF 2.2 Standard -- Exchangeable Image File for Digital Cameras
  • 53. Metadata Explosion: Case Study at Customer
    • 1000’s systems in silos; only one third interconnected.
    • Those “connected systems” create over 2000 interfaces.
    • Significant maintenance or enhancement efforts.
    • Significant replication required thru “monolith” apps to accomplish any sharing.
    • No archiving or sharing strategy for decommissioned systems … data and “lessons lost”.
    • No common convention for enterprise elements … we are speaking different “languages”.
    • No traceability for security for data access. We have a huge vulnerability.
    • Access, update, backup and recovery, synchronization are provided through a complex tapestry of processes and technologies.
    • Inconsistent development, deployment, monitoring, tooling.
    • Cost prohibitive, redundant and/or specialized skills.
    • Escalating and redundant storage costs for HW/SW/People.
    • If we do nothing … it will only become worse!
    • Lag times in order to cash and order taking
    • Inability to handle any call from any center
    • Incomplete or errant customer information at point of contact
    • Unable to perform real time integration of data at point of use regardless of data form
    • Unable to provide complete / correct information to drive decisions; even in batch
    • No single view of our customers
    • Degraded (or errant) business decision making due to data corruption, data access (depth and breadth) and poor synchronization (event driven)
    • The problem is so big and has to be approached incrementally!
  • 54. Evolution of Metadata Hierarchical Data Model Rigid Metadata Single Application Domain Specific Ontologies Flexible Metadata Cross Industry Integration Increased Business Value of Metadata Syntactic annotations of data: what this data represents Semantic annotations of data: what this data means Relational Data Model Rigid Metadata Integration Within Enterprise Extensible Data Model (XML) Flexible Metadata Integration Within Industry 1970 1990 2000 2010 1980
  • 55. Metadata from UNSTRUCTURED data is growing exponentially Hierarchical Data Model Rigid Metadata Single Application Domain Specific Ontologies Flexible Metadata Cross Industry Integration Increased Business Value of Metadata Syntactic annotation of data: what this data represents Semantic annotations of data: what this data means Relational Data Model Rigid Metadata Integration Within Enterprise Extensible Data Model (XML) Flexible Metadata Integration Within Industry How to make metadata from unstructured data ACTIONABLE? 1970 1990 2000 2010 1980
  • 56. Research Challenge #5
    • Treat metadata as a first class research area
      • Access
      • Search
      • Sharing
      • Distribution
      • Consolidating
      • Aggregating
      • Deriving and Discovering new Metadata
      • Querying
      • ……
    • Don’t ignore existing in-place data sources and metadata
  • 57. The World of Data is Changing
    • Hardware gives us more choices than ever before
    • Cost of labor is rising
    • Data isn’t all (or even mostly) in the database
    • Data access paradigms evolving
    • Customers want integration and FAST access to the data they want
  • 58. Five Data Research Challenges for the Next Decade :
    • Reexamine DBMS architecture and invent ways to scale more, without sacrificing user-visible availability or performance
    • Address Total Cost of Ownership
    • Learn what managing content is all about, what is needed and forge new models
    • Include speech and text data and derived text analytics and context in our scope of data research work
    • Treat metadata as a first class research area