Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VMTeknikleri

633 views

Published on

VMTeknikleri

  • Be the first to comment

  • Be the first to like this

VMTeknikleri

  1. 1. Veri Madenciliği Teknikleri Ali Alkan ali.alkan@infora.com.tr www.infora.com.tr Aralık 2011 1
  2. 2. Bilgi önemlidir Charles Schwab, TD Waterhouseniçin golf turnuvalarına sponsor olurlar? Yatırımcıların %70’i golf oynayan Önemli oranda yaşlı, torunları için 40 yaş üzeri erkeklerdir. rap albümü satın almakta. Sony Music, niçin yaşlılara yönelik dergilere Rap albümlerinin reklamını verir?Gündem1. Veri Madenciliğine Giriş?2. Veri Madenciliğinin Temel Kavramları3. Pazarlama ve Müşteri İlişkileri Yönetiminde Veri Madenciliği Teknikleri4. Bir Veri Madenciliği Metodolojisi Olarak CRISP-DM5. Oracle Data Miner (ODM)6. Veri Madenciliğinde Kullanılan Temel İstatistik Teknikleri7. Doğrusal Regresyon8. Karar Ağaçları9. Lojistik Regresyon10. Naive Bayes Algoritması11. Support Vector Machine (SVM)12. Sepet Analizleri ve İlişkisel Kurallar13. Kümeleme ve Segmentasyon Analizleri14. Veri Madenciliğinde Performans Ölçümü15. Başarılı Veri Madenciliği Uygulamaları İçin Projelendirme Teknikleri ve Kurumsal Yaklaşım16. Veri Madenciliği Uygulamaları İçin Veri Temizliği ve Kalitesi Analizleri17. Başarılı veri Madenciliği Modellerinin Uygulamaya Alınması 2
  3. 3. 1. Veri Madenciliğine GirişVeri Madenciliği• Veri madenciliği; önceden bilinmeyen, geçerli, ve uygulanabilir bilginin, verilerden elde edilmesi sürecidir. 3
  4. 4. En ünlü Veri Madenciliği hikayesiVeri madenciliği: Neden şimdi? Veri madenciliğine duyulan gereksinim Veri Madenciliği Veri madenciliğini olanaklı kılan teknolojik gelişmeler 4
  5. 5. Değişen iş dünyası dinamikleri Değişen tüketici Doyuma ulaşan davranışları pazar Artan üretim İflas eden geleneksel pazarlama yaklaşımları Kısalan ürün Kızışan rekabet yaşam süresi ve artan risklerİş dünyasının yeni yönelimleri Hangi sınıftan müşterilere sahibim? Müşteriye odaklanma Mevcut müşterilerime nasıl daha fazla satış yapabilirim? Müşterilerim bana olan borçlarına sadık kalacaklar mı? Veri kıymetlerine odaklanma Rakiplerimin potansiyel stratejilerini tahmin edebilir miyim? Rekabete odaklanma Rakiplerimin taktiksel hareketlerini öngörebilir miyim? 5
  6. 6. Veri madenciliğini olanaklı kılan gelişmeler Veri ambarlarının Bilişim teknolojilerinde yaygınlaşması yeni çözümler Yapay zeka Yaygınlaşan ve istatistik elektronik alanlarındaki veri akışı yeni buluşlarVeri Madenciliği & İş Zekası• İş zekası, bilişim teknolojilerini kullanarak yönetsel karar vermeyi destekleyen bütün süreçler, teknikler ve araçların genel adıdır. 6
  7. 7. Veri Madenciliği & İş zekası Karar Karar Verici V e Sun eri tirm lleş ri İş rse Gö knikle um Te Analisti u Ma Veri ş fi den i Ke cili Bilg Veri ği Analisti liz Ve r Ana a i Ke stik porlam tati şfi , İs ve Ra AP OL lama gu Sor Ve r Veritabanı i Am Yöneticisi bar ları la rı, Ve r nak i Ka Kay ri ilgi le yna r, B istem ala S kla o sy b a n ı rı ıt, D Ta Kağ VeriCustomer Relationship Management (CRM) 7
  8. 8. Müşteri İlişkileri Yönetimi (MİY) Customer Relationship Management (CRM)• Müşterileri ile öğrenenen bir ilişki kurmak isteyen bir kuruluş ya da firma şunlara dikkat etmelidir: 1. Farket (Notice) – Müşteri ne yapıyor 2. Hatırla (Remember) – Firma ve Müşteri zaman içinde neler yaptı. 3. Öğren (Learn) – Hatırladıkların 4. Harekete Geç (Act On) – Öğrendiklerine göre müşteriyi firma adına daha kazançlı hale getir.“Transaksiyon” verisi bazında 8
  9. 9. “Transaksiyon” verisi bazındaBilimin evrimi• 1600’lerden önce, ampirik bilim• 1600-1950 arası, teorik bilim − Her disiplin teorik bir bileşen olarak büyüdü. − Teorik modeller genellikle deneyler ve genellemelerle kavrayışımızı motive etti. Sir Isaac Newton 1643 - 1727 9
  10. 10. Bilimin evrimi• 1950-1990 arası, hesaplamalı bilim (computational science) − Pek çok disiplin üçüncü yol olarak, hesaplamalı kolda büyüdü − Fizik, Finans, Astronomi, Dil bilim vb. − Hesaplamalı Bilim genel anlamda simülasyon demektir. Kapalı formda çözemediğimiz karmaşık matematiksel modellere çözüm üretmek üzere geliştiBilimin evrimi• 1990-günümüz, veri bilimi (data science) − Yeni bilimsel araçlar ve simülasyonlar ile veri akışı. − Petabyte’larca verinin ekonomik olarak saklanması ve yönetilmesinin olanaklı hale gelmesi. − Internet ve bilgisayar ağları ile bu büyük veri arşivlerinin global olarak ulaşılabilir hale gelmesi. − Veri Madenciliği uygarlığımızın en önemli meydan okumalarından biri haline geldi. 10
  11. 11. Veritabanı teknolojisinin evrimi• 1960lar: − Veri toplama, database yaratmak, IMS ve DBMS’in gelişimi• 1970ler: − Relational data model, relational DBMS implementation• 1980ler: − RDBMS, advanced data models (extended-relational, OO, deductive, etc.) − Application-oriented DBMS (spatial, scientific, engineering, etc.)Veritabanı teknolojisinin evrimi• 1990ler: − Data mining, data warehousing, multimedia databases, and Web databases• 2000ler − Stream data management and mining − Data mining and its applications − Web technology (XML, data integration) and global information systems 11
  12. 12. Veri madenciliğinin tarihçesi • Veri Madenciliğinin kökleri 50 yıl öncesine kadar uzanır. • 1960larda veri madenciliği istatistik analizden ibaretti ve öncüleri de istatistik yazılımı geliştiren firmalardı. • 1980’lerin sonlarında geleneksel teknikler karar ağaçları, yapay sinir ağları, bulanık mantık gibi yeni algoritmalarla zenginleşti.2. Veri Madenciliğinin Temel Kavramları 12
  13. 13. Introduction• Data are at the heart of most companies’ core business processes• Data are generated by transactions regardless of industry (retail, insurance…)• In addition to this internal data, there are tons of external data sources (credit ratings, demographics, etc.)• Data Mining’s promise is to find patterns in the “gazillions” of bytesWhat is Data? Attributes• Collection of data objects and their attributes Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K No• An attribute is a property or 2 No Married 100K No characteristic of an object 3 No Single 70K No − Examples: eye color of a 4 Yes Married 120K No person, temperature, etc. 5 No Divorced 95K Yes − Attribute is also known as Objects 6 No Married 60K No variable, field, characteristic, 7 Yes Divorced 220K No or feature 8 No Single 85K Yes• A collection of attributes 9 No Married 75K No describe an object 10 10 No Single 90K Yes − Object is also known as record, point, case, sample, entity, or instance 13
  14. 14. Attribute Values• Attribute values are numbers or symbols assigned to an attribute• Distinction between attributes and attribute values − Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters − Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different − ID has no limit but age has a maximum and minimum valueTypes of Attributes• There are different types of attributes − Nominal • Examples: ID numbers, eye color, zip codes − Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} − Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. − Ratio • Examples: temperature in Kelvin, length, time, counts 14
  15. 15. Discrete and Continuous Attributes• Discrete Attribute − Has only a finite or countably infinite set of values − Examples: zip codes, counts, or the set of words in a collection of documents − Often represented as integer variables. − Note: binary attributes are a special case of discrete attributes• Continuous Attribute − Has real numbers as attribute values − Examples: temperature, height, or weight. − Practically, real values can only be measured and represented using a finite number of digits. − Continuous attributes are typically represented as floating- point variables.Data Quality• What kinds of data quality problems?• How can we detect problems with the data?• What can we do about these problems?• Examples of data quality problems: − Noise and outliers − missing values − duplicate data 15
  16. 16. Noise• Noise refers to modification of original values − Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + NoiseOutliers• Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set 16
  17. 17. Missing Values• Reasons for missing values − Information is not collected (e.g., people decline to give their age and weight) − Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)• Handling missing values − Eliminate Data Objects − Estimate Missing Values − Ignore the Missing Value During Analysis − Replace with all possible values (weighted by their probabilities)Duplicate Data• Data set may include data objects that are duplicates, or almost duplicates of one another − Major issue when merging data from heterogeous sources• Examples: − Same person with multiple email addresses• Data cleaning − Process of dealing with duplicate data issues 17
  18. 18. Data Mining’s Biggest Challenge• The largest challenge a data miner may face is the sheer volume of data in the data warehouse.• It is quite important, then, that summary data also be available to get the analysis started.• A major problem is that this sheer volume may mask the important relationships the data miner is interested in.• The ability to overcome the volume and be able to interpret the data is quite important.Data Mining main species• Directed – Attempts to explain or categorize some particular target field such as income or response.• Undirected – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes. 18
  19. 19. Data Mining Tasks• Classification – example: Credit scoring• Prediction – example: predict credit card balance transfer average amount• Affinity Grouping – Example: people who buy X, often buy Y also with probability Z%• Clustering – similar to classification but no predefined classes• Description and Profiling – behavior begets an explanation such as “More guys prefer In-n-Out Burger than do girls.”Data Mining Examples in Enterprises• US Government − FBI – track down criminals (SD Police also) − Treasury Dept – suspicious int’l funds transfer• Phone companies• Supermarkets & Superstores (Vons, Albertsons, Wal-Mart, Costco)• Mail-Order, On-Line Order (L.L. Bean, Victoria’s Secret, Lands End)• Financial Institutions (BofA, Wells Fargo, Charles Schwab)• Insurance Companies (USAA, Allstate, State Farm)• Tons of others… 19
  20. 20. What Does All of This Mean?• Nothing is free, however, and the benefits do come with a cost.• The value of a data warehouse and subsequent data mining is a result of the new and changed business processes it enables – competitive advantage also.• There are limitations, though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them.Data Mining…Easy?• Marketing literature makes it look easy!!! − Just apply automated algorithms created by great minds, such as: • Neural networks • Decision trees • Genetic algorithms• “Poof”…magic happens!!!• Not So…Data Mining is an iterative, learning process• DM takes conscientious, long-term hard work and commitment• DM’s Reward: Success transforms a company from being reactive to being proactive 20
  21. 21. Data Mining’s Virtuous Cycle 1. Identify the business opportunity* 2. Mining data to transform it into actionable information 3. Acting on the information 4. Measuring the resultsWhy have a Methodology?• A DM methodology which includes DM Best Practices helps to avoid: − Learning things that are not true − Learning things that are true, but not useful• Learning things that are not true is more dangerous than the other. Why is that? … 21
  22. 22. Learning Things that are not True• Patterns may not represent any underlying rule• Sample may not reflect its parent population, hence bias• Data may be at the wrong level of detail (granularity; aggregation) Examples?Learning Things that are True, but notUseful • Learning things that are already known Examples? • Learning things that cannot be used Examples? 22
  23. 23. Profiling and Prediction• Profiling − describes what is in the data − Demographic variables − Inability to distinguish cause and effect (eg. Beer drinkers and males) − Focus is on the past to explain it (timing = past)• Prediction − Finding patterns in data from prior period(s) that are capable of explaining or anticipating outcomes in a later period (timing = future) − Predictive models require separation in time between the model inputs and output.Data Mining Uses Data from the Pastto Effect Future Action “Those who do not remember the past are condemned to repeat it.” – George Santayana Analyze available data (from the past) Discover patterns, facts, and associations Apply this knowledge to future actions 23
  24. 24. Examples Prediction uses data from the past to make predictions about future events (“likelihoods” and “probabilities”) Profiling characterizes past events and assumes that the future is similar to the past (“similarities”) Description and visualization find patterns in past data and assume that the future is similar to the pastModels• Model: An explanation or description of how something works that reflects reality well enough that it can be used to make inferences about the real world.• We use models every day…Examples?• DM uses models of data called Model Set• Applying model set to new data is called Score Set• Model Set includes: − Training Set – used to build a set of DM models − Validation Set – used to choose best DM model − Test Set – used to determine how the model performs 24
  25. 25. More about the Model and Score Sets The model set can be partitioned into three subsets the model is trained using pre-classified data called the training set the model is refined, in order to prevent memorization, using the test set the performance of models can be compared using a third subset called the evaluation or validation set The model is applied to the score set to predict the (unknown) future 49We Want a Stable Model A stable model works (nearly) as well on unseen data as on the data used to build it Stability is more important than raw performance for most applications we want a car that performs well on real roads, not just on test tracks Stability is a constant challenge 25
  26. 26. Is the Past Relevant? Does past data contain the important business drivers? e.g., demographic data Is the business environment from the past relevant to the future? in the ecommerce era, what we know about the past may not be relevant to tomorrow users of the web have changed since late 1990s Are the data mining models created from past data relevant to the future? have critical assumptions changed?Data Mining is about Creating Models A model takes a number of inputs, which often come from databases, and it produces one or more outputs Sometimes, the purpose is to build the best model The best model yields the most accurate output Such a model may be viewed as a black box Sometimes, the purpose is to better understand what is happening This model is more like a gray box 26
  27. 27. Models Past Present Future Data ends Actions take here place here Building models takes place in the present using data from the past outcomes are already known Applying (or scoring) models takes place in the present Acting on the results takes place in the future outcomes are not knownOften, the purpose is to assign a Score to eachCustomer Comments ID Name State Score 1. Scores are assigned to rows1 0102 Will MA 0.314 using models2 0104 Sue NY 0.159 2. Some scores may be the same3 0105 John AZ 0.265 3. The scores may represent the4 0110 Lori AZ 0.358 probability of some outcome5 0111 Beth NM 0.9796 0112 Pat WY 0.3287 0116 David ID 0.4468 0117 Frank MS 0.8979 0118 Ethel NE 0.446 54 27
  28. 28. Common Examples of What a Score Could Mean Likelihood to respond to an offer Which product to offer next Estimate of customer lifetime Likelihood of voluntary churn Likelihood of forced churn Which segment a customer belongs to Similarity to some customer profile Which channel is the best way to reach the customer The Scores Provide a Ranking of the Customers ID Name State Score ID Name State Score1 0102 Will MA 0.314 5 0111 Beth NM 0.9792 0104 Sue NY 0.159 8 0117 Frank MS 0.8973 0105 John AZ 0.265 7 0116 David ID 0.446 SORT4 0110 Lori AZ 0.358 9 0118 Ethel NE 0.4465 0111 Beth NM 0.979 4 0110 Lori AZ 0.3586 0112 Pat WY 0.328 6 0112 Pat WY 0.3287 0116 David ID 0.446 1 0102 Will MA 0.3148 0117 Frank MS 0.897 3 0105 John AZ 0.2659 0118 Ethel NE 0.446 2 0104 Sue NY 0.159 28
  29. 29. This Ranking give Rise to Quantiles(terciles, quintiles, deciles, etc.) ID Name State Score5 0111 Beth NM 0.97987 0117 0116 Frank David MS ID 0.897 0.446 } high9 0118 Ethel NE 0.44646 0110 0112 Lori Pat AZ WY 0.358 0.328 } medium1 0102 Will MA 0.31432 0105 0104 John Sue AZ NY 0.265 0.159 } low 57 Data Mining prefers “customer signatures” Often, the data come from many different sources Relational database technology allows us to construct a customer signature from these multiple sources The customer signature includes all the columns that describe a particular customer the primary key is a customer id the target columns contain the data we want to know more about (e.g., predict) the other columns are input columns 29
  30. 30. Stability Challenge: Memorizing the Training Set Error Rate Training Data Model Complexity Decision trees and neural networks can memorize nearly any pattern in the training set 59 Danger: Overfitting This is the modelErrorRate we want The model has overfit the training data As model complexity grows, performance deteriorates on test data Model Complexity 60 30
  31. 31. Experiment to Find the Best Modelfor Your Data Try different modeling techniques Try oversampling at different rates Tweak the parameters Add derived variables Remember to focus on the business problem 61It is Often Worthwhile to Combinethe Results from Multiple Models ID Name State Mod 1 Mod 2 Mod 3 1 0102 Will MA 0.111 0.314 0.925 2 0104 Sue NY 0.121 0.159 0.491 3 0105 John AZ 0.133 0.265 0.211 4 0110 Lori AZ 0.146 0.358 0.692 5 0111 Beth NM 0.411 0.979 0.893 6 0112 Pat WY 0.510 0.323 0.615 7 0116 David ID 0.105 0.879 0.298 8 0117 Frank MS 0.116 0.502 0.419 9 0118 Ethel NE 0.152 0.446 0.611 62 31
  32. 32. Multiple-Model Voting Multiple models are built using the same input data Then a vote, often a simple majority or plurality rules vote, is used for the final classification Requires that models be compatible Tends to be robust and can return better results 63Segmented Input Models Segment the input data by customer segment by recency Build a separate model for each segment Requires that model results be compatible Allows different models to focus and different models to use richer data 64 32
  33. 33. Combining Models What is response to a mailing from a non-profit raising money (1998 data set) Exploring the data revealed the more often, the less money one contributes each time so, best customers are not always most frequent Thus, two models were developed who will respond? how much will they give? 65Compatible Model Results In general, the score refers to a probability for decision trees, the score may be the actual density of a leaf node for a neural network, the score may be interpreted as the probability of an outcome However, the probability depends on the density of the model set The density of the model set depends on the oversampling rate 66 33
  34. 34. 3. Pazarlama ve Müşteri İlişkileri Yönetiminde Veri Madenciliği TeknikleriRelationship Marketing Relationship Marketing is a Process communicating with your customers listening to their responses Companies take actions marketing campaigns new products new channels new packaging 34
  35. 35. Relationship Marketing Customers and prospects respond most common response is no response This results in a cycle data is generated opportunities to learn from the data and improve the process emerge CRM is Revolutionary Grocery stores have been in the business of stocking shelves Banks have been in the business of managing the spread between money borrowed and money lent Insurance companies have been in the business of managing loss ratios Telecoms have been in the business of completing telephone calls Key point: More companies are beginning to view customers as their primary asset 35
  36. 36. Why Now ?Representative Growth in a Maturing Market Number of Customers (1000s) 1200 total customers As growth flattens out, exploiting 1000 existing customers becomes more important 800 600 In this region of rapid growth, building infrastructure is more 400 important than CRM 200 new customers churners 0 0 1 2 3 4 5 6 7 8 9 10 11 YearCRM Requires Learning and More Form a learning relationship with your customers Notice their needs On-line Transaction Processing Systems Remember their preferences Decision Support Data Warehouse Learn how to serve them better Data Mining Act to make customers more profitable 36
  37. 37. The Importance of Channels Channels are the way a company interfaces with its customers Examples Direct mail Email Banner ads Telemarketing Billing inserts Customer service centers Messages on receipts Key data about customers come from channelsChannels Channels are the source of data Channels are the interface to customers Channels enable a company to get a particular message to a particular customer Channel management is a challenge in organizations CRM is about serving customers through all channels 37
  38. 38. Where Does Data Mining Fit In? Hindsight Analysis and Reporting (OLAP) Foresight Statistical Insight Modeling Data MiningHolding on to Good Customers Data mining is used to help a major cellular company figure out who is at risk for attrition And why are they at risk They built predictive models to generate call lists for telemarketing The result was a better focused, more effective retention campaign 38
  39. 39. Weeding out Bad Customers Default and personal bankruptcy cost lenders millions of dollars Figuring out who are your worst customers can be just as important as figuring out who are your best customers many businesses lose money on most of their customersThey Sometimes get Their Man The FBI handles numerous, complex cases such as the Unabomber case Leads come in from all over the country The FBI and other law enforcement agencies sift through thousands of reports from field agents looking for some connection Data mining plays a key role in FBI forensics 39
  40. 40. Anticipating Customer Needs Clustering is an undirected data mining technique that finds groups of similar items Based on previous purchase patterns, customers are placed into groups Customers in each group are assumed to have an affinity for the same types of products New product recommendations can be generated automatically based on new purchases made by the group This is sometimes called collaborative filteringCRM Focuses on the Customer The enterprise has a unified view of each customer across all business units and across all channels This is a major systems integration task The customer has a unified view of the enterprise for all products and regardless of channel This requires harmonizing all the channels 40
  41. 41. A Continuum of Customer Relationships Large accounts have sales managers and account teams E.g., Coca-Cola, Disney, and McDonalds CRM tends to focus on the smaller customer --the consumer But, small businesses are also good candidates for CRMWhat is a Customer A transaction? An account? An individual? A household? The customer as a transaction purchases made with cash are anonymous most Web surfing is anonymous we, therefore, know little about the consumer 41
  42. 42. A Customer is an Account More often, a customer is an account Retail banking checking account, mortgage, auto loan, … Telecommunications long distance, local, ISP, mobile, … Insurance auto policy, homeowners, life insurance, … Utilities The account-level view of a customer also misses the boat since each customer can have multiple accountsCustomers Play Different Roles Parents buy back-to-school clothes for teenage children children decide what to purchase parents pay for the clothes parents “own” the transaction Parents give college-age children cellular phones or credit cards parents may make the purchase decision children use the product It is not always easy to identify the customer 42
  43. 43. The Customer’s Lifecycle Childhood birth, school, graduation, … Young Adulthood choose career, move away from parents, … Family Life marriage, buy house, children, divorce, … Retirement sell home, travel, hobbies, … Much marketing effort is directed at each stage of lifeThe Customer’s Lifecycle is Unpredictable It is difficult to identify the appropriate events graduation, retirement may be easy marriage, parenthood are not so easy many events are “one-time” Companies miss or lose track of valuable information a man moves a woman gets married, changes her last name, and merges her accounts with spouse It is hard to track your customers so closely, but, to the extent that you can, many marketing opportunities arise 43
  44. 44. Customers Evolve Over Time Customers begin as prospects Prospects indicate interest fill out credit card applications apply for insurance visit your website They become new customers After repeated purchases or usage, they become established customers Eventually, they become former customers either voluntarily or involuntarily Business Processes Organize Around the Customer LifecycleAcquisition Activation Relationship Management Winback Former Customer High ValueProspect New Established Voluntary Customer Customer Churn High Potential Low Forced Value Churn 44
  45. 45. Different Events Occur Throughout the Lifecycle Prospects receive marketing messages When they respond, they become new customers They make initial purchases They become established customers and are targeted by cross-sell and up-sell campaigns Some customers are forced to leave (cancel) Some leave (cancel) voluntarily Others simply stop using the product (e.g., credit card) Winback/collection campaignsDifferent Models are Appropriate at Different Stages Prospect acquisition Prospect product propensity Best next offer Forced churn Voluntary churn Bottom line: We use data mining to predict certain events during the customer lifecycle 45
  46. 46. No Substitute for Human Intelligence Data mining is a tool to achieve goals The goal is better service to customers Only people know what to predict Only people can make sense of rules Only people can make sense of visualizations Only people know what is reasonable, legal, tasteful Human decision makers are critical to the data mining processA Long, Long Time Ago There was no marketing There were few manufactured goods Distribution systems were slow and uncertain There was no credit Most people made what they needed at home There were no cell phones There was no data mining It was sufficient to build a quality product and get it to market 46
  47. 47. Then and Now Before supermarkets, a typical grocery store carried 800 different items A typical grocery store today carries tens of thousands of different items There is intense competition for shelf space and premium shelf space In general, there has been an explosion in the number of products in the last 50 years Now, we need to anticipate and create demand (e.g., e-commerce) This is what marketing is all aboutEffective Marketing Presupposes High quality goods and services Effective distribution of goods and services Adequate customer service Marketing promises are kept Competition direct (same product) “wallet-share” Ability to interact directly with customers 47
  48. 48. How Data Mining Helps in MarketingCampaigns Improves profit by limiting campaign to most likely responders Reduces costs by excluding individuals least likely to respond AARP mails an invitation to those who turn 50 they excluded the bottom 10% of their list response rate did not sufferHow Data Mining Helps in Marketing Campaigns Predicts response rates to help staff call centers, with inventory control, etc. Identifies most important channel for each customer Discovers patterns in customer data 48
  49. 49. Marketing Campaigns and CRM The simplest approach is to optimize the budget using the rankings that models produce Campaign optimization determines the most profitable subset of customers for a given campaign, but it is sensitive to assumptions Customer optimization is more sophisticated It chooses the most profitable campaign for each customer 99Prospecting• Prospect − Noun – someone/something with possibilities − Verb – to explore• > 7B people worldwide − Relatively few are prospects for a company − Exclusion based on geography, age, ability to pay, need for product/service, etc.• Data mining can help in prospecting: − Identifying good prospects − Choosing appropriate communication channels − Picking suitable messages 49
  50. 50. Data Mining & Advertising • Who fits the profile for this nationwide publication? Reader- YES NO Mike Nancy ship Score Score Mike Nancy Score ScoreBS or > 58% 0.58 0.42 Yes No 0.58 0.42Prof/Exec 46% 0.46 0.54 Yes No 0.46 0.54$ > $75k 21% 0.21 0.79 Yes No 0.21 0.79$ > $100k 7% 0.07 0.93 No No 0.93 0.93Total 2.18 2.68 Data Mining & Advertising • But…that might be a bit naïve; compare readership to US population, then score Mike and Nancy Reader- YES Reader- NO ship US Index ship US Index Pop Pop BS or > 58% 20.3% 2.86* 42% 79.7% 0.53* Prof/Exec 46% 19.2% 2.40 54% 80.8% 0.67 $ > $75k 21% 9.5% 2.21 79% 90.5% 0.87 $ > $100k 7% 2.4% 2.92 93% 97.6% 0.95 * 58% / 20.3% • Mike’s score: 8.42 (2.86 + 2.40 + 2.21 + 0.95) * 42% / 79.7% • Nancy’s score: 3.02 (0.53 + 0.67 + 0.87 + 0.95) 50
  51. 51. TIP• When comparing customer profiles (Mike and Nancy), it is important to keep in mind the profile of the population as a whole.• For this reason, using indexes (table #2) is often better than using raw values (table #1) A Catalog Response Model Using the model, we get The cumulative 100% 65% of the likely responders response chart for 90% instead of just 30% . . . A lift of 2.17. a decision tree 80% model for 70% response to a catalog mailing. 60% 50% 40% 30% 20% 10% Response Model No Model 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 51
  52. 52. A Profitable MailingDECILE GAINS CUM LIFT SIZE SIZE(YES) SIZE(NO) PROFIT 0% 0.00% 0% 0 0 - - ($20,000) 10% 30.00% 30% 3.000 100,000 3,000 97,000 $15,000 20% 20.00% 50% 2.500 200,000 5,000 195,000 $5,000 30% 15.00% 65% 2.167 300,000 6,500 293,500 ($27,500) 40% 13.00% 78% 1.950 400,000 7,800 392,200 ($69,000) 50% 7.00% 85% 1.700 500,000 8,500 491,500 ($137,500) 60% 5.00% 90% 1.500 600,000 9,000 591,000 ($215,000) 70% 4.00% 94% 1.343 700,000 9,400 690,600 ($297,000) 80% 4.00% 98% 1.225 800,000 9,800 790,200 ($379,000) 90% 2.00% 100% 1.111 900,000 10,000 890,000 ($470,000) 100% 0.00% 100% 1.000 1,000,000 10,000 990,000 ($570,000)• Assumptions: − Overall population response rate of 1% − $44 net revenue per responder − $1 cost per item mailed − $20,000 overheadMarketing Campaign• Goal is to change behavior (to help drive revenue)• How do we know if we did? − Control Group – randomly receives mailing − Test Group – model selected to get mailing − Holdout Group – model selected not get mailing − Compare responses of the groups 52
  53. 53. Differential Response Analysis • How do we know if the responders actually responded because of our campaign or would have anyway? − Answer: Differential Response Analysis (DRA) • DRA starts with Control & Treated groups • Control group = no “mailing” • Treated group = receive “mailing” • Compare results…see if there is any “uplift” Control Group Treated Group Young Old Young Old Women 0.8% 0.4% 4.1% 4.6% Men 2.8% 3.3% 6.2% 5.2%DM “meets” CRM• Matching campaigns to customers• Segmenting the customer base• Reducing exposure to credit risk• Determining customer value• Cross-selling and Up-selling• Retention and Churn ([in]voluntary attrition)• Different kinds of churn models – predicting who will leave; predicting how long one will stay 53
  54. 54. Customer Relationship Management A Databased Approach Instructor’s Presentation Slides Chapter Ten Data Mining 54
  55. 55. Yapi Kredi - Define Business Objectives• YAPI KREDI’s B-type mutual funds, characterized by – Being low risk investment instruments based on fixed income securities – Easily purchased via the ATM, Web, and Telephone channels• Offer to two customer groups: – Customers already having invested into B-type mutual funds to stimulate an increase of the assets – Customers not yet owning any B-type fund to help increase product ratio and attract new moneyYapi Kredi-Define business objectives (contd. )• Communication channels: two-channel approach• Campaign sizing: Contact 3000 customers by branch based out-bound calls and active marketing during customer branch visits• Campaign: Two-step – Customers were first contacted with the B-type mutual fund offer – Positive responders received a follow up call if they had not purchased until one week after their initial positive response• Evaluation of results: Based on response and purchase rates by contact channel (branch or call center) 55
  56. 56. Yapi Kredi- Get Raw Data & Identify Relevant Variables • Get Raw Data: – Data mart with data extracted from more than 50 source system tables – About 20 database tables were produced with 30 Giga Bytes of disk space for the initial project phase • Identify Relevant Variables - customer attributes describing: – Demographics – Product Ownership – Product Usage – Channel usage – Assets – Liabilities – Profitability Yapi Kredi - Gain Customer Insight • Based on six months of historical customer data, five different predictive models were developed • Best model: logistic regression – Yielding a lift value of 2.9 and a cumulative response rate of 14 % for the top customer percentile – Reaches 2.9 times more responders for the top customer percentile than a random selection of the same size – A set of 4200 customers with the highest propensity to purchase was selected as the target group for the pilot campaign 56
  57. 57. Yapi Kredi - Act• A subset of 3000 customers was assigned to the 16 branches holding the responsibility for the respective relationships• The remaining 1200 customers were assigned to the call center• The target list with the corresponding channel assignment was made available to the campaign management system Yapi Kredi - Result• Result: – Impressive response rates of 6.5% and 12.2% were obtained with the branch based part of the campaign and the call center based part of the campaign respectively – The pilot campaign acquired more than € 1 million into B-type mutual funds 57
  58. 58. Summary• Data Mining can assist in selecting the right target customers or in identifying previously unknown customers with similar behavior and needs• A good target list is likely to increase purchase rates, and have a positive impact on revenue• In the context of CRM, the individual customer is often the central object analyzed by means of data mining methods• A complete data mining process comprises assessing and specifying the business objectives, data sourcing, transformation and creation of analytical variables, and building analytical models using techniques such as logistic regression and neural networks, scoring customers and obtaining feedback from the field• Learning and refining the data mining process is the key to success 5. Bir Veri Madenciliği Metodolojisi olarak CRISP-DM CRoss Industry Standard Process for Data Mining 58
  59. 59. Why Should There be a Standard Process? The data mining process must be reliable and repeatable by people with little data mining background.Why Should There be a Standard Process?• Framework for recording experience − Allows projects to be replicated• Aid to project planning and management• “Comfort factor” for new adopters − Demonstrates maturity of Data Mining − Reduces dependency on “stars” 59
  60. 60. Process Standardization• Initiative launched in late 1996 by three “veterans” of data mining market. Daimler Chrysler (then Daimler-Benz), SPSS (then ISL) , NCR• Developed and refined through series of workshops (from 1997-1999)• Over 300 organization contributed to the process model• Published CRISP-DM 1.0 (1999)• Over 200 members of the CRISP-DM SIG worldwide - DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, etc. - System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, etc. - End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, etc. www.crisp-dm.orgCRISP-DM • Non-proprietary • Application/Industry neutral • Tool neutral • Focus on business issues − As well as technical analysis • Framework for guidance • Experience base − Templates for Analysis 60
  61. 61. CRISP-DM: Overview • Data Mining methodology • Process Model • For anyone • Provides a complete blueprint • Life cycle: 6 phasesCRISP-DM: Phases• Business Understanding Project objectives and requirements understanding, Data mining problem definition• Data Understanding Initial data collection and familiarization, Data quality problems identification• Data Preparation Table, record and attribute selection, Data transformation and cleaning• Modeling Modeling techniques selection and application, Parameters calibration• Evaluation Business objectives & issues achievement evaluation• Deployment Result model deployment, Repeatable data mining process implementation 61
  62. 62. Phases and Tasks Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Determine Collect Select Select Evaluate Plan Business Initial Modeling Data Results Deployment Objectives Data Technique Plan Monitering Assess Describe Clean Generate Review & Situation Data Data Test Design Process Maintenance Determine Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report Verify Produce Integrate Assess Review Data Project Plan Data Model Project Quality Format DataPhase 1. Business Understanding• Statement of Business Objective• Statement of Data Mining Objective• Statement of Success Criteria Focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives 62
  63. 63. Phase 1. Business Understanding• Determine business objectives- thoroughly understand, from a business perspective, what the client really wants to accomplish- uncover important factors, at the beginning, that can influence the outcome of the project- neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions• Assess situation- more detailed fact-finding about all of the resources, constraints, assumptions and other factors that should be considered- flesh out the detailsPhase 1. Business Understanding• Determine data mining goals- a business goal states objectives in business terminology- a data mining goal states project objectives in technical terms ex) the business goal: “Increase catalog sales to existing customers.” a data mining goal: “Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city) and the price of the item.”• Produce project plan- describe the intended plan for achieving the data mining goals and the business goals- the plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques 63
  64. 64. Phase 2. Data Understanding • Explore the Data • Verify the Quality • Find Outliers Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.Phase 2. Data Understanding• Collect initial data- acquire within the project the data listed in the project resources- includes data loading if necessary for data understanding- possibly leads to initial data preparation steps- if acquiring multiple data sources, integration is an additional issue, either here or in the later data preparation phase• Describe data- examine the “gross” or “surface” properties of the acquired data- report on the results 64
  65. 65. Phase 2. Data Understanding• Explore data - tackles the data mining questions, which can be addressed using querying, visualization and reporting including: distribution of key attributes, results of simple aggregations relations between pairs or small numbers of attributes properties of significant sub-populations, simple statistical analyses - may address directly the data mining goals - may contribute to or refine the data description and quality reports - may feed into the transformation and other data preparation needed• Verify data quality - examine the quality of the data, addressing questions such as: “Is the data complete?”, Are there missing values in the data?”Phase 3. Data Preparation•Takes usually over 90% of the time - Collection - Assessment - Consolidation and Cleaning - Data selection - Transformations Covers all activities to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modeling tools. 65
  66. 66. Phase 3. Data Preparation• Select data- decide on the data to be used for analysis- criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types- covers selection of attributes as well as selection of records in a table• Clean data- raise the data quality to the level required by the selected analysis techniques- may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modelingPhase 3. Data Preparation• Construct data- constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes• Integrate data - methods whereby information is combined from multiple tables or records to create new records or values• Format data- formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool 66
  67. 67. Phase 4. Modeling• Select the modeling technique (based upon the data mining objective)• Build model (Parameter settings)• Assess model (rank the models) Various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.Phase 4. Modeling• Select modeling technique- select the actual modeling technique that is to be used ex) decision tree, neural network- if multiple techniques are applied, perform this task for each techniques separately• Generate test design- before actually building a model, generate a procedure or mechanism to test the model’s quality and validity ex) In classification, it is common to use error rates as quality measures for data mining models. Therefore, typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set 67
  68. 68. Phase 4. Modeling• Build model- run the modeling tool on the prepared dataset to create one or more models• Assess model- interprets the models according to his domain knowledge, the data mining success criteria and the desired test design- judges the success of the application of modeling and discovery techniques more technically- contacts business analysts and domain experts later in order to discuss the data mining results in the business context- only consider models whereas the evaluation phase also takes into account all other results that were produced in the course of the projectPhase 5. Evaluation• Evaluation of model - how well it performed on test data• Methods and criteria - depend on model type• Interpretation of model - important or not, easy or hard depends on algorithm Thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached 68
  69. 69. Phase 5. Evaluation• Evaluate results - assesses the degree to which the model meets the business objectives - seeks to determine if there is some business reason why this model is deficient - test the model(s) on test applications in the real application if time and budget constraints permit - also assesses other data mining results generated - unveil additional challenges, information or hints for future directionsPhase 5. Evaluation• Review process- do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked- review the quality assurance issues ex) “Did we correctly build the model?”• Determine next steps- decides how to proceed at this stage- decides whether to finish the project and move on to deployment if appropriate or whether to initiate further iterations or set up new data mining projects- include analyses of remaining resources and budget that influences the decisions 69
  70. 70. Phase 6. Deployment• Determine how the results need to be utilized• Who needs to use them?• How often do they need to be used• Deploy Data Mining results by Scoring a database, utilizing results as business rules, interactive scoring on-line The knowledge gained will need to be organized and presented in a way that the customer can use it. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.Phase 6. Deployment• Plan deployment- in order to deploy the data mining result(s) into the business, takes the evaluation results and concludes a strategy for deployment- document the procedure for later deployment• Plan monitoring and maintenance- important if the data mining results become part of the day-to-day business and it environment- helps to avoid unnecessarily long periods of incorrect usage of data mining results- needs a detailed on monitoring process- takes into account the specific type of deployment 70
  71. 71. Phase 6. Deployment• Produce final report- the project leader and his team write up a final report- may be only a summary of the project and its experiences- may be a final and comprehensive presentation of the data mining result(s)• Review project- assess what went right and what went wrong, what was done well and what needs to be improvedSummary• Why CRISP-DM? The data mining process must be reliable and repeatable by people with little data mining skills CRISP-DM provides a uniform framework for - guidelines - experience documentation CRISP-DM is flexible to account for differences - Different business/agency problems - Different data 71
  72. 72. 6. Oracle Data MinerOracle Data Mining 11gR2 Copyright 2010 Oracle Corporation 72
  73. 73. Traditional Analytics (SAS) Environment XXX Source Data SAS Work SAS Process Target (Oracle, DB2, Area Processing Output (e.g. Oracle) SQL Server, (SAS Datasets) (Statistical (SAS Work Area) TeraData, functions/ Ext. Tables, etc.) Data mining) SAS SAS SAS Hours, Days or Weeks• SAS environment requires: • Data movement • Data duplication • Loss of security Copyright 2010 Oracle Corporation Traditional Analytics (SAS) Environment XXX Source Data SAS Work SAS Process Target (Oracle, DB2, Area Processing Output (e.g. Oracle) SQL Server, (SAS Datasets) (Statistical (SAS Work Area) TeraData, functions/ Ext. Tables, etc.) Data mining) SAS SAS SAS Secs, Mins or Hours•• Oracle environment: SAS environment requires: •• Eliminates data movement Data movement •• Eliminates data duplication Data duplication •• Preserves security Loss of security Copyright 2010 Oracle Corporation 73
  74. 74. In-Database Data Mining Traditional Analytics Oracle Data Mining Data Import Results • Faster time for Data Mining “Data” to “Insights” Model “Scoring” • Lower TCO—Eliminates Data Preparation and Transformation Savings • Data Movement • Data Duplication Data Mining Model Building • Maintains Security Model “Scoring” Data remains in the Database Data Prep & Transformation Embedded data preparation Cutting edge machine learning Data Extraction Model “Scoring” Embedded Data Prep algorithms inside the SQL kernel of Model Building Database Data Preparation SQL—Most powerful language for data Hours, Days or Weeks Secs, Mins or Hours preparation and transformationSource Data SAS Work SAS Process Process Output Target Data remains in the Database Area ing SAS SAS SAS Copyright 2010 Oracle Corporation Oracle Data Mining Option Copyright 2010 Oracle Corporation 74
  75. 75. Oracle Data Mining 11g Oracle 11g Data Warehousing• Data Mining API Functions (Server) ETL • PL/SQL OLAP Statistics • Java Data Mining• Oracle Data Miner (optional GUI)• Wide range of DM algorithms (12) • Anomaly detection • Association rules (Market Basket analysis) • Attribute importance • Classification & regression • Clustering • Feature extraction (NMF) • Structured & unstructured data (text mining)• Predictive Analytics • “1-click/automated data mining” (EXPLAIN, PREDICT, PROFILE) Copyright 2010 Oracle CorporationOracle Data Mining AlgorithmsProblem Algorithm ApplicabilityClassification Logistic Regression (GLM) Classical statistical technique Decision Trees Popular / Rules / transparency Naïve Bayes Embedded app Support Vector Machine Wide / narrow data / textRegression Multiple Regression (GLM) Classical statistical technique Support Vector Machine Wide / narrow data / textAnomaly One Class SVM Fraud & Intrusion DetectionDetectionAttribute Minimum Description Attribute reduction Length (MDL) Identify useful dataImportance Reduce data noise A1 A2 A3 A4 A5 A6 A7Association Market basket analysis Apriori Link analysisRulesClustering Hierarchical K-Means Product grouping Text mining Hierarchical O-Cluster Gene and protein analysis Text analysisFeature Support Vector Machine Feature reductionExtraction F1 F2 F3 F4 Copyright 2010 Oracle Corporation 75
  76. 76. Oracle Data Miner 11gR1 (GUI) [ODM’r “Classic”] Copyright 2010 Oracle CorporationOracle Data Miner 11gR1 GUI Copyright 2010 Oracle Corporation 76
  77. 77. Oracle Data Miner 11gR1 GUI Oracle Data Miner guides the analyst through the data mining process Copyright 2010 Oracle CorporationOracle Data Miner 11gR1 GUI Oracle Data Mining builds a model that differentiates HI_VALUE_CUSTOMERS from others Copyright 2010 Oracle Corporation 77
  78. 78. Oracle Data Mining + OBI EETargeting High Value Customers Oracle Data Mining creates a prioritized list of customer who likely to be high value Copyright 2010 Oracle CorporationOracle Data Miner 11gR2 (GUI) [ODM’r “New”] Copyright 2010 Oracle Corporation 78
  79. 79. Copyright 2010 Oracle CorporationCopyright 2010 Oracle Corporation 79
  80. 80. Copyright 2010 Oracle CorporationCopyright 2010 Oracle Corporation 80
  81. 81. Copyright 2010 Oracle CorporationCopyright 2010 Oracle Corporation 81
  82. 82. Copyright 2010 Oracle CorporationCopyright 2010 Oracle Corporation 82
  83. 83. Copyright 2010 Oracle CorporationCopyright 2010 Oracle Corporation 83
  84. 84. Copyright 2010 Oracle CorporationCopyright 2010 Oracle Corporation 84
  85. 85. Copyright 2010 Oracle CorporationOracle Data Mining APIs (SQL & Java) Copyright 2010 Oracle Corporation 85
  86. 86. In-Database Analytics Example Launch & Evaluate a Marketing Campaign1.Given a previously select responder, cust_region, count(*) as cnt, sum(post_purch – pre_purch) as tot_increase, built response avg(post_purch – pre_purch) as avg_increase, model,…predict stats_t_test_paired(pre_purch, post_purch) as who will respond to significance a campaign, from ( …and why select cust_name, prediction(campaign_model using *) as responder,2.…find out how sum(case when purchase_date < 15-Apr-2005 then much each purchase_amt else 0 end) as pre_purch, customer spent 3 sum(case when purchase_date >= 15-Apr-2005 then purchase_amt else 0 end) as post_purch months before and from customers, sales, products@PRODDB after the campaign where sales.cust_id = customers.cust_id3.…how much for and purchase_date between 15-Jan-2005 and 14-Jul-2005 just DVDs? and sales.prod_id = products.prod_id and contains(prod_description, ‘DVD’) > 04.Is the success group by cust_id, prediction(campaign_model using *) ) statistically group by rollup responder, cust_region order by 4 desc; significant? Copyright 2010 Oracle Corporation More Interesting SQL (Missing Value Imputation Example) Select the 10 customers who are most likely to attrite based solely on: age, gender, annual_income, and zipcode. In addition, since annual_income is often missing, perform null/missing value imputation for the annual_income attribute using all of the customer demographics. SELECT * FROM ( SELECT cust_name, cust_contact_info, rank() over (ORDER BY PREDICTION_PROBABILITY(attrition_model, ‘attrite’ USING age, gender, zipcode, NVL(annual_income, PREDICTION(estim_income USING *)) as annual_income) DESC) as cust_rank FROM customers) WHERE cust_rank < 11; Copyright 2010 Oracle Corporation 86
  87. 87. Example of Embedded Predictive SQLPowers Next Generation Predictive Marketing Tools Letter personalized with embedded predictive analytics Copyright 2010 Oracle CorporationEmbedded Data PreparationAutomatically applied when scoring Attribute Expression income salary + bonus value case when revenue < 100 then ‘low’ when revenue < 500 then ‘med’ else ‘high’ end age age / 100 Copyright 2010 Oracle Corporation 87
  88. 88. Oracle Data Mining and Unstructured Data• Oracle Data Mining mines unstructured i.e. “text” data• Include free text and comments in ODM models• Cluster and Classify documents• Oracle Text used to preprocess unstructured text Copyright 2010 Oracle Corporation Performing a Moving Average The following query computes the moving average of the sales amount between the current month and the previous three months: SQL> --SQL> MONTH MONTH_AMOUNT MOVING_AVERAGE SQL> SELECT ---------- ------------ -------------- month, SUM(amount) AS month_amount, 1 58704.52 58704.52 AVG(SUM(amount)) OVER 2 28289.3 43496.91 (ORDER BY month ROWS BETWEEN 3 3 20167.83 35720.55 PRECEDING AND CURRENT ROW) 4 50082.9 39311.1375 AS moving_average 5 17212.66 28938.1725 FROM all_sales 6 31128.92 29648.0775 GROUP BY month 7 78299.47 44180.9875 ORDER BY month; 8 42869.64 42377.6725 9 35299.22 46899.3125 10 43028.38 49874.1775 11 26053.46 36812.675 12 20067.28 31112.085 12 rows selected. Copyright 2010 Oracle Corporation 88
  89. 89. Complex SQL Transform-- For each customer, compute the amount sold to customer in the past three months and three months prior to that.-- If the increase is greater than 25%, mark the customer as G(rowing).-- If the decrease is greater than 25%, mark the customer as S(hrinking).-- Otherwise, mark the customer as U(nchanged).-- Add special handling for old_sales of 0 by replacing the denominator with new_sales/2, which will yield an increase of more than 25% in the calculation, which is the desired result. #2 select cust_id, case when changed_sales > 0.25 then G when changed_sales < -0.25 then S else U end as cust_value from ( select cust_id, (new_sales - old_sales) / decode(old_sales, 0, decode(new_sales, 0, 1, new_sales/2), old_sales) as changed_sales from ( select cust_id, sum(case when time_id < add_months((select max(time_id) from sh.sales),-3) then amount_sold else 0 end) as old_sales, sum(case when time_id >= add_months((select max(time_id) from sh.sales),-3) then amount_sold else 0 end) as new_sales from sh.sales where time_id >= add_months((select max(time_id) from sh.sales),-6) group by cust_id ) ); Copyright 2010 Oracle CorporationReal-time Predictionwith records as (select 78000 SALARY, On-the-fly, single record 250000 MORTGAGE_AMOUNT, 6 TIME_AS_CUSTOMER, apply with new data (e.g. 12 MONTHLY_CHECKS_WRITTEN, from call center) 55 AGE, 423 BANK_FUNDS, Married MARITAL_STATUS, Nurse PROFESSION, M SEX, 4000 CREDIT_CARD_LIMITS, 2 N_OF_DEPENDENTS, 1 HOUSE_OWNERSHIP from dual) select s.prediction prediction, s.probability probability from ( select PREDICTION_SET(INSUR_CUST_LT58218_DT, 1 USING *) pset from records) t, TABLE(t.pset) s; Copyright 2010 Oracle Corporation 89
  90. 90. ODM & Exadata Copyright 2010 Oracle CorporationPrediction Multiple Models/Optimization with records as (select 178255 ANNUAL_INCOME, 30 AGE, Bach. EDUCATION, On-the-fly, multiple models; Married MARITAL_STATUS, Male SEX, then sort by expected revenues 70 HOURS_PER_WEEK, 98 PAYROLL_DEDUCTION from dual) select t.* from ( select CAR_MODEL MODEL, s1.prediction prediction, s1.probability probability, s1.probability*25000 as expected_revenue from ( select PREDICTION_SET(NBMODEL_JDM, 1 USING *) pset from records ) t1, TABLE(t1.pset) s1 UNION select MOTOCYCLE_MODEL MODEL, s2.prediction prediction, s2.probability probability, s1.probability*2000 as expected_revenue from ( select PREDICTION_SET(ABNMODEL_JDM, 1 USING *) pset from records ) t2, TABLE(t2.pset) s2 UNION select TRICYCLE_MODEL MODEL, s3.prediction prediction, s3.probability probability, s1.probability*50 as expected_revenue from ( select PREDICTION_SET(TREEMODEL_JDM, 1 USING *) pset from records ) t3, TABLE(t3.pset) s3 UNION select BICYCLE_MODEL MODEL, s4.prediction prediction, s4.probability probability, s1.probability*200 as expected_revenue from ( select PREDICTION_SET(SVMCMODEL_JDM, 1 USING *) pset from records ) t4, TABLE(t4.pset) s4 ) t order by t.expected_revenue desc; Copyright 2010 Oracle Corporation 90
  91. 91. Oracle Data Mining + Exadata• In 11gR2, SQL predicates and Oracle Data Mining models are pushed to storage level for execution For example, find the US customers likely to churn: select cust_id from customers Scoring function executed in Exadata where region = ‘US’ and prediction_probability(churnmod,‘Y’ using *) > 0.8; Copyright 2010 Oracle Corporation R-ODM Interface Copyright 2010 Oracle Corporation 91
  92. 92. R Interface to Oracle Data Mining • The R Interface to Oracle Data Mining ( R-ODM) allows R users to access the power of Oracle Data Minings in- database functions using R syntax. • R-ODM provides a powerful environment for prototyping data analysis and data mining methodologies. • R-ODM is especially useful for: • Quick prototyping of vertical or domain-based applications where the Oracle Database supports the application • Scripting of "production" data mining methodologies • Customizing graphics of ODM data mining results (examples: classification, regression, anomaly detection) Copyright 2010 Oracle Corporation RODM Interface R environment PL/SQL ODMlibrary(RODM) ODBCRODM_open_dbms_connection(…)RODM_create_dbms_table(…) table …RODM_create_nb_model(…)sqlQuery(DB, "BEGIN create_model dbms_data_mining.create_model……”)RODM_apply_model(…) sqlFetch(DB, out_table) tableRODM_drop_model(…)RODM_drop_dbms_table(…)RODM_close_dbms_connection(…) Copyright 2010 Oracle Corporation 92
  93. 93. Excel Add-In Copyright 2010 Oracle CorporationExcel Add-In Copyright 2010 Oracle Corporation 93
  94. 94. Integration with OBIEE Copyright 2010 Oracle CorporationIntegration with Oracle BI EE Oracle Data Mining results available to Oracle BI EE administrators Oracle BI EE defines results for end user presentation Copyright 2010 Oracle Corporation 94
  95. 95. Example Better Information for OBI EE Reports and Dashboards ODM’s ODM’s predictions & Predictions & probabilities are probabilities available in the availablefor Database in Database for reporting using OracleBI EE Oracle BI EE and other andother tools reporting tools Copyright 2010 Oracle CorporationGündem1. Veri Madenciliğine Giriş?2. Veri Madenciliğinin Temel Kavramları3. Pazarlama ve Müşteri İlişkileri Yönetiminde Veri Madenciliği Teknikleri4. Bir Veri Madenciliği Metodolojisi Olarak CRISP-DM5. CRoss Industry Process Model for Data Mining (CRISP-DM)6. InforSense Platform – Bir Veri Madenciliği Platformu7. Veri Madenciliğinde Kullanılan Temel İstatistik Teknikleri8. Doğrusal Regresyon9. Lojistik Regresyon10. Karar Ağaçları11. Yapay Sinir Ağları12. Sepet Analizleri ve İlişkisel Kurallar13. Kümeleme ve Segmentasyon Analizleri14. Veri Madenciliğinde Performans Ölçümü15. Başarılı Veri Madenciliği Uyglamaları İçin Projelndirme Teknikleri ve Kurumsal Yaklaşım16. Veri Madenciliği Uygulamaları İçin Veri Temizliği ve Kalitesi Analizleri17. Başarılı veri Madenciliği Modellerinin Uygulamaya Alınması 95
  96. 96. 7. A Few Important Ideas From StatisticsImportant Ideas• Population, Sample, Statistics• Mean, Median, Mode, Range• Probability• Expected value• Distributions − Probability density functions (PDF) − Cumulative density functions (CDF)• Confidence• Significance• Variance, Standart Deviation 96
  97. 97. Why a Manager (or you) Needs toKnow Some Basics about Statistics? • To know how to properly present information • To know how to draw conclusions about populations based on sample information • To know how to improve processes • To know how to obtain reliable forecastsStatistics vs Data Mining• Statistics don’t lie but liars use statistics!• Statistics developed as a discipline to help scientists make sense of observations and experiments, hence the scientific method• Problem has often been too little data for statisticians• DM is faced with too much data• Many of the techniques & algorithms used are shared by both statisticians and data miners 97

×