Data Mining: Crossing the Chasm Rakesh Agrawal IBM Almaden Research Center
Thesis <ul><li>The greatest challenge facing data mining is to make the transition from being an early market technology t...
Outline <ul><li>Chasm in the technology adoption life cycle, à la Geoffrey Moore † </li></ul><ul><li>Experience with Quest...
Technology Adoption Life Cycle Techies: Try it! Visionaries: Get ahead of the herd! Pragmatists: Stick with the herd! Cons...
Innovators: Technology Enthusiasts <ul><li>Intrigued by any fundamental advance in technology </li></ul><ul><li>Like to al...
Early Adopters: Visionaries <ul><li>Driven by vision of dramatic competitive advantage via revolutionary breakthroughs </l...
Early Majority: Pragmatists <ul><li>Want  sustainable productivity improvement through evolutionary change </li></ul><ul><...
Late Majority: Conservatives <ul><li>Want to stay even with the competition </li></ul><ul><li>Risk averse </li></ul><ul><l...
Laggards: Skeptics <ul><li>Driven to maintain status quo </li></ul><ul><li>Good at debunking marketing hype </li></ul><ul>...
Crack in the curve Early Market Mainstream Market Chasm The greatest peril in the development of a high-tech market lies i...
Visionaries vs. Pragmatists <ul><li>Adventurous </li></ul><ul><li>First strike capability </li></ul><ul><li>Early buy-in <...
Is data mining following this curve? <ul><li>Yes!!! </li></ul><ul><li>My personal viewpoint based on Quest/Intelligent Min...
Quest <ul><li>Started as skunk work in early nineties </li></ul><ul><li>Inspired by needs articulated by industry visionar...
Approach <ul><li>Examine “real” applications </li></ul><ul><li>Identify operations that cut across applications </li></ul>...
Operations <ul><li>Associations </li></ul><ul><li>Sequential Patterns </li></ul><ul><li>Similar time series </li></ul><ul>...
Bringing Quest to market <ul><li>Visionaries who inspired Quest did not become first customers: </li></ul><ul><ul><li>Want...
First hits <ul><li>Small information-based companies who provided data in exchange for free results </li></ul><ul><li>CIO ...
Characteristics of engagements <ul><li>Mostly associations and sequential patterns </li></ul><ul><li>Completeness a big pl...
Into the product land <ul><li>Formation of a small “out-of-plan” product group to productize Quest </li></ul><ul><li>Facil...
Intelligent Miner <ul><li>Serious product </li></ul><ul><li>Integrates technologies from various groups </li></ul><ul><li>...
Are we in the chasm? <ul><li>Perceived to be sophisticated technology, usable only by specialists </li></ul><ul><li>Long, ...
Chasm Crossing <ul><li>Personal speculations on some technical challenges </li></ul><ul><li>Do not imply IBM research/prod...
XML-based Data Mining Standard (1) <ul><li>Model Building: </li></ul><ul><ul><li>A pair of standard DTDs for each operatio...
XML-based Data Mining Standard (2) <ul><li>Model Deployment: </li></ul><ul><ul><li>Mapping XML object provides mapping bet...
Implications <ul><li>Standard interfaces for application developers to incorporate data mining  </li></ul><ul><li>Coupling...
Data Mining Benchmarks <ul><li>UC Irvine repository </li></ul><ul><li>Generating synthetic benchmarks modeled after real d...
Auto-focus data mining <ul><li>Automatic parameter tuning </li></ul><ul><li>Automatic algorithm selection (à la join metho...
Web: Greatest opportunity <ul><li>Huge collection of data (e.g. Yahoo collecting ~50GB every day) </li></ul><ul><li>Univer...
Privacy-preserving data mining <ul><li>Technical vs. legislated solutions </li></ul><ul><li>Implication for data mining al...
Personalization <ul><li>Internet might provide for the first time tools necessary for users to capture information about t...
What about Association Rules? <ul><li>Very long patterns </li></ul><ul><li>Separating wheat from chaff </li></ul><ul><li>P...
What else? <ul><li>Formal foundations of data mining </li></ul>
Summary <ul><li>Closely couple data mining with database systems </li></ul><ul><li>Embed data mining into applications </l...
Concluding remarks <ul><li>Data mining, a great technology </li></ul><ul><ul><li>Combination of intriguing theoretical que...
Acknowledgments
Upcoming SlideShare
Loading in...5
×

"Data Mining: Crossing the Chasm"

987

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
987
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

"Data Mining: Crossing the Chasm"

  1. 1. Data Mining: Crossing the Chasm Rakesh Agrawal IBM Almaden Research Center
  2. 2. Thesis <ul><li>The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology </li></ul><ul><li>We have the opportunity to make this transition successful </li></ul>
  3. 3. Outline <ul><li>Chasm in the technology adoption life cycle, à la Geoffrey Moore † </li></ul><ul><li>Experience with Quest/Intelligent Miner </li></ul><ul><li>Ideas for successful chasm crossing </li></ul><ul><li>Geoffrey A Moore. Crossing the Chasm. Harper Business. http://www.chasmgroup.com </li></ul>
  4. 4. Technology Adoption Life Cycle Techies: Try it! Visionaries: Get ahead of the herd! Pragmatists: Stick with the herd! Conservatives: Hold on! Skeptics: No way! Late Majority Early Majority Early Adopters Laggards Innovators Psychographic profile of each group is different
  5. 5. Innovators: Technology Enthusiasts <ul><li>Intrigued by any fundamental advance in technology </li></ul><ul><li>Like to alpha test new products </li></ul><ul><li>Can ignore the missing elements </li></ul><ul><li>Want access to top technologists </li></ul><ul><li>Want no-profit pricing (preferably free) </li></ul>Gatekeepers to early adopters
  6. 6. Early Adopters: Visionaries <ul><li>Driven by vision of dramatic competitive advantage via revolutionary breakthroughs </li></ul><ul><li>Great imagination for strategic applications </li></ul><ul><li>Not so price-sensitive </li></ul><ul><li>Want rapid time to market </li></ul><ul><li>Demand high degree of customization </li></ul>Fund the development of early market
  7. 7. Early Majority: Pragmatists <ul><li>Want sustainable productivity improvement through evolutionary change </li></ul><ul><li>Astute managers of mission-critical apps </li></ul><ul><li>Understand real-world issues and tradeoffs </li></ul><ul><li>Focus on proven applications; want to see the solution in production </li></ul>Bulwark of the mainstream market
  8. 8. Late Majority: Conservatives <ul><li>Want to stay even with the competition </li></ul><ul><li>Risk averse </li></ul><ul><li>Price sensitive </li></ul><ul><li>Need completely pre-assembled solutions </li></ul>Extend technology life cycles
  9. 9. Laggards: Skeptics <ul><li>Driven to maintain status quo </li></ul><ul><li>Good at debunking marketing hype </li></ul><ul><li>Disbelieve productivity-improvement arguments </li></ul><ul><li>Can be formidable opposition to early adoption of a technology </li></ul>Retard the development of high-tech markets
  10. 10. Crack in the curve Early Market Mainstream Market Chasm The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.
  11. 11. Visionaries vs. Pragmatists <ul><li>Adventurous </li></ul><ul><li>First strike capability </li></ul><ul><li>Early buy-in </li></ul><ul><li>State of the art </li></ul><ul><li>Think big </li></ul><ul><li>Spend big </li></ul><ul><li>Prudent </li></ul><ul><li>Staying power </li></ul><ul><li>Wait-and-see </li></ul><ul><li>Industry standard </li></ul><ul><li>Manage expectation </li></ul><ul><li>Spend to budget </li></ul>
  12. 12. Is data mining following this curve? <ul><li>Yes!!! </li></ul><ul><li>My personal viewpoint based on Quest/Intelligent Miner experience </li></ul>
  13. 13. Quest <ul><li>Started as skunk work in early nineties </li></ul><ul><li>Inspired by needs articulated by industry visionaries: </li></ul><ul><ul><li>Transaction data collected over a long period </li></ul></ul><ul><ul><li>Current tools/SQL don’t cut it </li></ul></ul><ul><ul><li>About ready to throw data </li></ul></ul>
  14. 14. Approach <ul><li>Examine “real” applications </li></ul><ul><li>Identify operations that cut across applications </li></ul><ul><li>Design fast, scalable algorithms for each operation </li></ul><ul><li>Develop applications by composing operations </li></ul>
  15. 15. Operations <ul><li>Associations </li></ul><ul><li>Sequential Patterns </li></ul><ul><li>Similar time series </li></ul><ul><li>New Operations </li></ul><ul><li>Completeness, scalability </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Deviations </li></ul><ul><li>Adopted from Statistics/Learning </li></ul><ul><li>Scalability </li></ul>http://www.almaden.ibm.com/cs/quest
  16. 16. Bringing Quest to market <ul><li>Visionaries who inspired Quest did not become first customers: </li></ul><ul><ul><li>Wanted evidence that the technology “worked” </li></ul></ul><ul><li>Frustrating attempts to interest major IBM customers: </li></ul><ul><ul><li>Integration with existing applications </li></ul></ul><ul><ul><li>Too-far-out technology </li></ul></ul><ul><ul><li>Resistance from in-house analytic groups </li></ul></ul>
  17. 17. First hits <ul><li>Small information-based companies who provided data in exchange for free results </li></ul><ul><li>CIO who wanted to be seen as the technology pioneer in his industry </li></ul><ul><li>CIO who wanted the success story to feature in the company’s annual report </li></ul>Led to the formation of a group offering services using Quest
  18. 18. Characteristics of engagements <ul><li>Mostly associations and sequential patterns </li></ul><ul><li>Completeness a big plus </li></ul><ul><li>Unanticipated uses </li></ul><ul><li>Feedback for further development </li></ul>
  19. 19. Into the product land <ul><li>Formation of a small “out-of-plan” product group to productize Quest </li></ul><ul><li>Facilitated by a closet mathematician </li></ul><ul><li>Successes of the services group used for market validation </li></ul><ul><li>Continued development and infusion of technology </li></ul>
  20. 20. Intelligent Miner <ul><li>Serious product </li></ul><ul><li>Integrates technologies from various groups </li></ul><ul><li>Fast, scalable, runs on multiple platforms </li></ul><ul><li>Several “early market” success stories </li></ul>http://www.software.ibm.com/data/iminer/
  21. 21. Are we in the chasm? <ul><li>Perceived to be sophisticated technology, usable only by specialists </li></ul><ul><li>Long, expensive projects </li></ul><ul><li>Stand-alone, loosely-coupled with data infrastructures </li></ul><ul><li>Difficult to infuse into existing mission-critical applications </li></ul>
  22. 22. Chasm Crossing <ul><li>Personal speculations on some technical challenges </li></ul><ul><li>Do not imply IBM research/product directions </li></ul>
  23. 23. XML-based Data Mining Standard (1) <ul><li>Model Building: </li></ul><ul><ul><li>A pair of standard DTDs for each operation </li></ul></ul><ul><ul><li>Interchangeable library of operator implementations </li></ul></ul>Operator Model Parameters Data Specs Standard DTD Standard DTD Library Ack: Mattos, Pirahesh, Schwenkries
  24. 24. XML-based Data Mining Standard (2) <ul><li>Model Deployment: </li></ul><ul><ul><li>Mapping XML object provides mapping between names and format in the model object and the data record </li></ul></ul><ul><ul><li>Model could have been developed on a different system </li></ul></ul>Application Result Mapping Standard DTDs Standard DTD Library Model Data Record
  25. 25. Implications <ul><li>Standard interfaces for application developers to incorporate data mining </li></ul><ul><li>Coupling with relational databases </li></ul><ul><ul><li>mappings from DTDs to relational schemas </li></ul></ul><ul><ul><li>implementation using existing infrastructure </li></ul></ul>
  26. 26. Data Mining Benchmarks <ul><li>UC Irvine repository </li></ul><ul><li>Generating synthetic benchmarks modeled after real data sets is a hard problem </li></ul><ul><ul><li>How to map names into meaningful literals </li></ul></ul><ul><ul><li>How to preserve empirical distributions </li></ul></ul>Ack: Srikant, Ullman
  27. 27. Auto-focus data mining <ul><li>Automatic parameter tuning </li></ul><ul><li>Automatic algorithm selection (à la join method selection in database query optimization) </li></ul>Ack: Andreas Arning
  28. 28. Web: Greatest opportunity <ul><li>Huge collection of data (e.g. Yahoo collecting ~50GB every day) </li></ul><ul><li>Universal digital distribution medium makes data mining results actionable in fundamentally new ways </li></ul><ul><li>But watch for privacy pitfall </li></ul>
  29. 29. Privacy-preserving data mining <ul><li>Technical vs. legislated solutions </li></ul><ul><li>Implication for data mining algorithms when some fields of a data record have been fudged according to the user’s privacy sensitivity </li></ul>Ack: R. Srikant
  30. 30. Personalization <ul><li>Internet might provide for the first time tools necessary for users to capture information about themselves and to selectively release this information † </li></ul><ul><li>Will we be providing these tools? </li></ul><ul><li>† John Hagel, Marc Singer. Net Worth. Harvard Business School Press . </li></ul>
  31. 31. What about Association Rules? <ul><li>Very long patterns </li></ul><ul><li>Separating wheat from chaff </li></ul><ul><li>Principled introduction of domain knowledge </li></ul>
  32. 32. What else? <ul><li>Formal foundations of data mining </li></ul>
  33. 33. Summary <ul><li>Closely couple data mining with database systems </li></ul><ul><li>Embed data mining into applications </li></ul><ul><li>Focus on web </li></ul><ul><li>Standard interfaces </li></ul><ul><li>Benchmarks </li></ul><ul><li>Auto focussing </li></ul><ul><li>Personalization </li></ul><ul><li>Privacy </li></ul>
  34. 34. Concluding remarks <ul><li>Data mining, a great technology </li></ul><ul><ul><li>Combination of intriguing theoretical questions with large commercial interest in the technology </li></ul></ul><ul><li>Poised for transitioning into mainstream technology </li></ul><ul><li>Will we rise to the challenge as a community? </li></ul>
  35. 35. Acknowledgments
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×