Innovations in Data and Information Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Innovations in Data and Information Mining

  1. 1. Innovation in Data and Information Mining Innovations in Data and Information Mining MBA Technology Conference March 28, 2007 Linda C. Simmons, IBM Global Business Services 1
  2. 2. Innovation in Data and Information Mining Who is IBM Research? Unparalleled -- the largest private research institution in the world Annual budget of almost $5B Eight labs across the world on all continents Over 3,000 researchers 5 Nobel Prize winners, 4 US National Medals of Technology, 3 National Medals of Science, 19 memberships in the National Academy of Sciences and more than 47 members of the National Academy of Engineering Skills in mathematics, computer science, physics, operations research and many more Over 30,000 US patents since 1993 2
  3. 3. Innovation in Data and Information Mining Data Mining Research Goals • Developing effective tools and techniques for enabling a wide variety of Business Intelligence applications and solutions. • Techniques for extracting actionable insights from structured (data) and unstructured (text) information. • Enabling analytics for data and text within large-scale data and computing infrastructure environments. • Work with clients to drive our research agenda for developing novel data mining solutions • To have data mining impact business and industry problem-solving in new and unique ways. Current Activities • Basic Research • Cost-Sensitive Learning, Active Learning, Reinforcement Learning, Regularization Methods. • Systems Research: Developing highly scalable and fully automated predictive modeling capabilities • Data-parallel architectures for leveraging database systems • Compute-parallel architectures for leveraging grid computing • Solutions and Services • Customer Insights • Business Forecasting • Risk Management • Etc. 3
  4. 4. Innovation in Data and Information Mining Multi-faceted Approach to our Data Mining Research Agenda Retail Customer Manufacturing Banks and Insurance Interaction Travel and Transport Industries Government GBS - ODIS IBM Service and maintenance IBM Businesses manufacturing, procurement, distribution product design, forecasting, pricing, and fulfillment Theory Software Advancement Development Parallel Databases OLAP Analytics Academic Community Parallel Data Mining Software Group Professional Societies Unstructured Information Management Open Source Text Analytics and Mining Database Management Knowledge Discovery and Data Mining Knowledge Management Natural Language Processing Information Retrieval 4
  5. 5. Innovation in Data and Information Mining Emerging Innovation from Research Supply chain solutions – Optimize, plan, model and analyze supply chain and transportation processes. Advanced call center automation – Design and help deploy natural language voice- recognition and voice-mining solutions Advanced networking services – Apply cutting-edge models, algorithms, software and expertise to help design, monitor and optimize enterprise networks and networked applications, e.g. storage area networks and IP telephony. Business optimization and analytics – Optimize, plan, model, analyze and transform businesses to on demand models. Collaboration – Realize the value of collaboration through a skilled assessment of the current environment for collaboration, methodologies that document end-user requirements for collaboration, strategic design for visualizing future collaborative states, and tools that support human communication. Security and privacy – Assess, design and implement enhanced security processes and tools. 5
  6. 6. Innovation in Data and Information Mining Emerging Innovation in Research (2) e-business systems and architecture – Design and help deploy applications, middleware and Web content. Grid and autonomic solutions – Apply cutting-edge models, software, designs and expertise to help quickly and accurately evaluate, design, pilot and optimize grid and autonomic capability in client distributed-computing systems. Information mining and management – Gain business insight from structured and unstructured data, text, voice, video and more. Mobile enablement – Apply new wireless and pervasive technology to improve security, reliability and integration. Product lifecycle management – Improve product development processes through better tools, methodologies and collaboration. Technology-based learning – Deploy prototype learning technology that can help improve learning effectiveness, increase accountability and boost productivity. 6
  7. 7. Innovation in Data and Information Mining Continual Optimization Client Big City Coach, a high-end car service company has a few hundred cars and drivers (more drivers than cars), which may service 1000 rides/day in several big cities nationwide. Challenge This ground transportation leader wanted to increase vehicle and driver utilization, push customer service to new levels, and lower operating costs. The mathematical optimization concept came from discussions with our Research Math team. Solution Developed a Fleet Optimization System (FOS) which gathers off- line and real-time information from a variety of internal and external sources and produces a near-optimal staffing plan. FOS made possible real-time adjustment of schedules and resource allocation. Benefits - Increased vehicle utilization thru better visibility of scheduling info - Increase efficiency – less downtime for drivers, more effective use of partner resources - Improved customer service and satisfaction due to real time reallocation of cars/drivers - Better resource management, esp. during peak traffic times, bad weather, & delays 7
  8. 8. Innovation in Data and Information Mining Text Analytics for a Financial Communicator Client A leading financial communications powerhouse which prides itself on providing an unequaled mix of electronic trading, data, analytics, calculation engines, and straight-through processing. Challenge The company was interested in validating its hypotheses around text analytics, which enable computers to read documents and derive value from the output. The intent was to use text analytics to automate the data collection and analysis process. Solution Strategy and Change Consulting, powered by computational linguists, performed in-depth analyses around the new technologies and solutions that the firm had been evaluating. Benefits The firm now has validated and enhanced new product plays that it can leverage; in addition, it is realizing staff efficiencies that enables it to do more with the same number of people. Other benefits include data quality and time-to-market improvements. Overall, it can now better compete in the marketplace. 8
  9. 9. Innovation in Data and Information Mining Underwriting Profitability Analysis Client Famous Group – A subsidiary of A Big Finance Company Challenge Automatic discovery of all credible and actionable risk groups in auto insurance policyholders to improve premium pricing, underwriting rules, and new business development. Solution A data warehouse was put together that stored four years of 300 historical actors on 2 million policyholders, claims, and insured assets (autos). A new predictive modeling technology was developed that was optimized for discovering homogenous risk groups from this data. The generated models were represented as if-then rules. Benefits Of all the rules that were generated, 43 were statistically significant and not known before. Marketing benefits analysis of 6 of these 43 discoveries suggested a $2 Million profit enhancement over a 2 million policyholder base. 9
  10. 10. Innovation in Data and Information Mining Customer Insight : Personalization of Product Recommendations Client A Big UK Grocery Challenge Cross-Sell / Up-Sell services to consumers with handheld PDAs for anytime / anywhere shopping Solution A solution was developed in which recommendations are generated by matching products to customers based on the expected appeal of the product and the previous spending of the customer. A combination of associations mining in the product domain and clustering in the customer domain is used for developing customer-specific recommendations. Benefits In a pilot program with several hundred customers, a 1.8% boost in revenue was observed as a result of purchases made directly from the list of recommended products. 10
  11. 11. Innovation in Data and Information Mining Customer Insight: Lifetime Value Management Client A Fifth Avenue Retailer Challenge Optimize cross-channel customer messaging to maximize customer lifetime value Solution A reinforcement learning based methodology was developed to model enterprise-customer. The developed methodology discovers customer responses on one channel as a result of a contact on another channel. The technology is highly scalable so it could address the large volumes of data that are typically available in a cross-channel scenario. Benefits The system was benchmarked against the retailer’s current methodology for customer relationship management in the direct mail and store channels. Initial results suggest a 7-8% increase in store revenues. 11
  12. 12. Innovation in Data and Information Mining Passenger-Based Airline No-show Prediction Passenger- Based Airline No-show Prediction Client Air Elsewhere Challenge Using detailed information on each passenger, predict the number of passengers who will not show for a flight. Accurate no-show forecasts are an essential input to airline revenue-management systems. Solution Two different predictive models were built using passenger-based features extracted from over 1M passenger records. The first model used a segmented Naïve Bayes approach (ProbE) to estimate each passenger’s probability of not showing. The second model predicted the no-show fraction directly using a novel aggregation method for an ensemble of probabilistic models. Benefits Various evaluation metrics demonstrated that the passenger-based models are more 1 accurate than conventional history-based statistical models. A simple revenue model suggested that use of these models could produce between 0.4% and 3.2% revenue 0.9 0.8 0.7 Fraction of PNR no-shows gain over the conventional model. 0.6 0.5 0.4 0.3 0.2 Passenger-Level [ProbE] Passenger-Level [APMR] Passenger-Level [C4.5] 0.1 Historical Model [Statistical] 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Random 0.7 0.8 Fraction of booked PNRs (sorted by no-show probability) 0.9 1 12
  13. 13. Innovation in Data and Information Mining Call Center Text Mining Analysts Vast amounts of textual data Internal Reports Patents Call Center Customers’ messages Text TAKMI Analysis etc. Customers Knowledge Acquisition Hidden Better self- What customers are regularities/facts service; lower The Business asking about; what they Trends in costs need to know contents Notes taken Actionable Features of by CSR’s Information specific topics Relationship with Better Products Information about other knowledge and Services; customer’s experiences Increased The Business with products or services Customer Sat. IBM Research leads in Speech Recognition, Natural Language Understanding, Dialog Management, Language Generation and Speech Synthesis. Our approach combines the use of advanced statistical and machine learning techniques with sophisticated grammars, digital dictionaries and encyclopedias 13
  14. 14. Innovation in Data and Information Mining We have new approaches to Segmentation-based predictive modeling Tree structure is obtained by a recursive procedure using the best univariate splits at each stage Rec < 6m The leaf nodes define a non-overlapping, exhaustive partition of the input space Final model is a collection of segments Rec < 3m #kids < 2 with their associated segment model in each leaf node Splitting condition is based on minimizing the negative log-likelihood using search Spend < $150 #delinq < 2 algorithms Final tree is determined by a stopping condition based on test set or cross- Ret < $20 validation error Interior node Out-of-memory row-scan based procedure Leaf node Data-partitioned parallelism 14
  15. 15. Innovation in Data and Information Mining Traditional Predictive Current Predictive Mining Process Mining Research Mine historical data to train Automated methods for embedding in patterns/models that can predict future solutions behavior Integrating structured and unstructured Behaviors data Response to Direct Mail Absorbing new ideas from learning theory Product Quality (Defects) and computational statistics for Declining Activity addressing typical issues with business Credit Risk data Delinquency Missing values, Data sparseness, High Likelihood to buy specific Dimensionality products Support Vector Machines, Predictive Profitability Rule Induction, Regularization etc. Techniques Score with models to reflect likelihood to Streaming data mining exhibit the modeled behavior Online and incremental mining of Act to optimize business objectives based streaming data on these scores. Outlier Detection Detecting anomalies and abnormalities in data 15
  16. 16. Innovation in Data and Information Mining Security and Privacy Initiatives Security and Privacy Initiatives: Financial Services • Secure Hardware Embedded Analytics – Leveraging cryptographic secure processing technology • Sovereign Information Integration – Need-to-know information sharing • Privacy Preserving Data Mining – Assumes no trusted third party. 16
  17. 17. Innovation in Data and Information Mining Secure Federated Mining Architecture Enable data analysis Memory-light inside secure • Secure processor data mining processor → Ultimate data security. Memory-light DB2 • Memory-light data mining → Sophisticated analytics can Secure Data are only decrypted run inside processor. processor Encrypted inside processor data transfer • Memory-light DB2 → Secure data federation and query processing capabilities across multiple data sources. Database …. Database Enterprise 1 Enterprise N 17
  18. 18. Innovation in Data and Information Mining Intra-bank Service Center Scenarios • Guarantees confidentiality Anti-Money Laundering Credit Risk Rating 1. Analyzing data from different LOBs together to know CRM customers. 2. Legislations limiting data sharing among LOBs. Intra-Bank Data Centralizer • Guarantees that data will only be Secure used for specialized purposes. Federated – Customers are more likely to allow Mining banks to share their data among LOBs with this condition. Encrypted Encrypted • Data federation allows multiple LOBs data data to share data without having central data warehouse. …. LOB 1 LOB N 18
  19. 19. Innovation in Data and Information Mining EPAL –Enterprise Privacy Architecture Language Architecture Language EPAL –Enterprise Privacy The Enterprise Privacy Authorization Language (EPAL) is a formal language to specify fine-grained enterprise privacy policies. It concentrates on the core privacy authorization while abstracting from all deployment details such as data model or user-authentication. Implementing Privacy Management Using EPA Enterpris Privacy Customer e Management Employee CPO Console Privacy Management Submission Monitor Legacy Privacy Applications Audit Log Management E – P3P ManagerData Server Policy Consent ► EPAL specs published (07/2003) Privacy Management Enforcement Monitors Web Legacy ► Java ref implementation of EPAL & XACML Obligations Queue Data Data ■ On alphaWorks: ► P3P ↔ EPAL mapping ► WS Privacy specs and bindings: ongoing 19
  20. 20. Innovation in Data and Information Mining Hippocratic Database • Vision: Database systems that take responsibility for the privacy and Privacy Data Queries Other ownership of data they manage, while not Policy Collection impeding the flow of information. Attribute Data Privacy • Architectural principles derived from Access Collection Constraint principles behind current legislations. Control Analyzer Privacy Validator Metadata Query Data # Name Age Phone Creator Data Intrusion Retention Accuracy 1 Adams 10 111-1111 Analyzer Detector Manager 3 - - 333-3333 Audit Audit Info Info 4 Daniels 40 - Table Size: 10 million, no index Query Execution Time 300 250 Encryption (seconds) 200 O rig in a l Q u e rie s Privacy Record 150 R e w rit t e n Q u e rie s Audit Store Support 100 50 Metadata Trail Access 0 Control 0 .0 1 0.1 0 .2 0.5 1 A p p l i c a ti o n S e l e c ti v i ty 20
  21. 21. Innovation in Data and Information Mining Thank you! Contact Information: Linda C Simmons IBM Global Business Services Office 904.491.0410 Mobile 904.610.3723 21