Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Data Mining


Published on

An intro lecture on data mining.

Published in: Data & Analytics, Technology

Introduction to Data Mining

  1. 1. DM  Intro   Integrated  Knowledge  Solutions  
  2. 2. What  is  Data?   ©IKSINC Data is a set of facts/observations/ measurements about objects/ events/processes of interest
  3. 3. What  is  Information?   ©IKSINC Information is processed data that is useful in one way or the other, for example for decision making, communication etc. While the data is fixed, information from it can differ based on needs
  4. 4. What  is  Knowledge?   ©IKSINC Patterns of relationships in data and information that exhibit a high degree of certainty
  5. 5. What  Is  Data  Mining?    Data  mining  is  essen+ally  a  process  of  data-­‐driven  extrac+on   of  not  so  obvious  but  useful  informa+on  from  large  databases.   The  en+re  process  is  interac+ve  and  itera+ve.                Data  mining  also  goes  under  various  other  names  such  as:    Knowledge  discovery  in  databases  (KDD),  knowledge   extrac+on,  data/paDern  analysis,  data  analy+cs,  business   intelligence,  etc.   ©IKSINC
  6. 6. © IKSINC Latest buzz word
  7. 7. Why  Data  Mining?   •  The  Data  Glut   •  Data  rich  but  informa+on   poor  businesses   •  Es+mates  of  data  doubling   every  20  months   •  The  average  Fortune  500   company  manages  over  a   terabyte  of  data  everyday   •  Convergence  of   Technologies   •  Compe++ve  Edge   •  Mass  marke+ng  versus   targeted  marke+ng   ©IKSINC
  8. 8. ©IKSINC
  9. 9. Changing  Business  Direction   ©IKSINC
  10. 10. Typical  Business  Applications  of  Data   Mining   •  Market   Segmenta+on   •  Customer  Targe+ng   and  Reten+on   •  Product  Design  and   Placement   •  Credit  Card  Fraud   Detec+on   •  Web  Adver+sing   •  Recommenda+on   Systems   ©IKSINC
  11. 11. Other  Applications  of  Data  Mining   •  Stock  Market  Trends   •  Text  and  Mul+media  Data   Mining   •  Sports  Scou+ng   •  Medical  Outcomes  Analysis   •  Scien+fic  Data  Mining       ©IKSINC
  12. 12. Data  ClassiDication   •  Structured  Data   •  Data  consis+ng  of  well-­‐defined  fields   of  numeric  or  alphanumeric  values   ©IKSINC
  13. 13. Data  ClassiDication   •  Unstructured  Data   •  No  well  defined  fields  of   informa+on   •  Requires  extensive   processing  to  extract   content  informa+on   •  Examples  include  blogs,   news  reports,  images,   videos,  tweets  etc.   •  Fastest  growing  data   segment   ©IKSINC
  14. 14. Data  ClassiDication   •  Semi-­‐Structured  Data   •  Data  with  par+al  structure  (medical  reports,   execu+ve  summaries,  interview  scripts,  web   documents  etc.)   ©IKSINC
  15. 15. Core  Technologies  for  Data  Mining   Data Mining Machine Learning Database Technology Statistics Visualization Information Retrieval
  16. 16. ©IKSINC Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  17. 17. ©IKSINC Architecture of a Typical Data Mining System Data Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
  18. 18. Data  Mining  and  Data  Warehousing   ©IKSINC Customers Etc… Vendors Etc… Orders Data Warehouse Enterprise “Database” Transactions Copied, organized summarized Data Mining Data Miners A data warehouse is a data repository set up to support strategic decision making.
  19. 19. Data  Mart   •  A  Data  Mart  is  a  smaller,  more  focused  Data  Warehouse  –  a   mini-­‐warehouse.   •  A  Data  Mart  typically  reflects  the  business  rules  of  a  specific   business  unit  within  an  enterprise.  
  20. 20. Accessing  Information  in  a  Data   Warehouse   •  Structured  Query  Language  (SQL)-­‐based   Repor+ng  and  Query  Tools   •  Good  for  extrac+ng  shallow,  non-­‐dimensional   informa+on.  For  example,  “Find  all  credit  card   customers  holding  a  balance  payment  of  $500   or  more.”   ©IKSINC
  21. 21. Accessing  Information  in  a  Data   Warehouse   •  On-­‐Line  Analy+cal  Processing  (OLAP)-­‐based   Repor+ng  and  Query  Tools   •  Good  for  extrac+ng  shallow,  dimensional   informa+on.  For  example,  “Find  all  credit  card   customers  who  live  in  Midwest,  drive  a  luxury  car,   and  hold  a  balance  payment  of  $500  or  more.”   ©IKSINC
  22. 22. On-­‐Line  Analytical  Processing  (OLAP)   •  OLAP  tools  organize  data  in  a  dimensional   representa+on,  called    a  cube.  This  permits  data   to  be  viewed  from  any  user  specified  angle  to   allow  slicing-­‐and-­‐dicing  of  the  data.     •  SQL  can  easily  answer  ‘who?’  and  ‘what?’   ques+ons,  however,  ability  to  answer  ‘what  if?’   and  ‘why?’  type  ques+ons  dis+nguishes  OLAP   from  general-­‐purpose  query  tools.       ©IKSINC
  23. 23. ©IKSINC Time Regions Northeast Midwest South West 08 09 10 11 Products A B C D
  24. 24. 25 OLAP  Operations   Single Cell Multiple Cells Slice Dice Roll Up Drill Down
  25. 25. Different  OLAP  Tools   •  Desktop  OLAP  (DOLAP)   •  Limited  capability  PC-­‐based  tools   •  Rela+onal  OLAP  (ROLAP)   •  Server-­‐based  tools  that  let  a  user  assemble  into  a  cube   a  subset  of  data  from  a  rela+onal  database   •  Mul+dimensional  OLAP  (MOLAP)   •  Server-­‐based  tools  that  use  pre-­‐computed  cubes  of   data  for  faster  response   ©IKSINC
  26. 26. PowerPivot   ©IKSINC
  27. 27. Accessing  Information  in  a  Data  Warehouse   •  Data  Mining  Tools   •  Good  for  extrac+ng  hidden  or  not  so  obvious   informa+on.  For  example,  “Find  all  credit  card   customers  who  are  likely  to  declare   bankruptcy.”   ©IKSINC
  28. 28. DBMS,  OLAP,  and  Data  Mining     DBMS OLAP Data Mining Task Extraction of detailed and summary data Summaries, trends and forecasts Knowledge discovery of hidden patterns and insights Type of result Information Analysis Insight and Prediction Method Deduction (Ask the question, verify with data) Multidimensional data modeling, Aggregation, Statistics Induction (Build the model, apply it to new data, get the result) Example question Who purchased mutual funds in the last 3 years? What is the average income of mutual fund buyers by region by year? Who will buy a mutual fund in the next 6 months and why?
  29. 29. Data  Mining  and  SQL   •  SQL  is  good  for   queries  that  impose  a   constraint  on  data  to   extract  an  answer   •  SQL  extracts  shallow   knowledge  from  a   database   •  Use  SQL  when  we   know  exactly  what  we   are  looking  for   •  Data  mining  is  good   for  exploratory   queries   •  Data  mining  extracts   hidden  informa+on   from  a  database   •  Use  data  mining  when   we  are  in  a  fishing   mode   ©IKSINC
  30. 30. Data  Mining  and  Data  Warehousing:  A   Mutually  Reinforcing  Relationship       •  Data  mining  provides  a  good  ROI  for  data  warehousing   •  A  data  warehouse  or  data  mart  provides  clean,  well-­‐ formaDed  historical  data  for  mining   ©IKSINC
  31. 31. Data  Mining  Process   ©IKSINC Domain Understanding Data Selection Cleaning and Preprocessing Discovering Patterns Reporting Interpretation
  32. 32. Domain  Understanding  Stage   •  Learning  the  business  goals   •  Gathering  relevant  prior  knowledge   •  Best  executed  by  a  team  of  business  and   IT  persons   •  Good  understanding  of  the  domain  avoids   discovering  irrelevant  paDerns  or   minimizes  the  chances  for  garbage  in   garbage  out   ©IKSINC
  33. 33. Example  Data  Sources   •  Point-­‐of-­‐sale  data   •  Credit  card  charge   records   •  Warranty  claims   •  Medical  insurance   claims   •  Direct  mail   response  data   •  Telephone  call   records   •  Web  ac+vity  data   •  Economic  data   •  U+lity  charges   •  Census  returns   •  Magazine   subscrip+on  records   ©IKSINC
  34. 34. Data  Cleaning  and  Preprocessing  Stage   •  Data  comes  from  many  sources  -­‐  internal  and  external   •  Data  comes  in  many  forms  and  formats   •  Hierarchical  databases,  flat  files,  COBOL  data  sets   •  Data  is  never  clean   •  Most  important  stage.  Typically  consumes  about  60-­‐80%   of  the  total  data  mining  effort   ©IKSINC
  35. 35. Business  Data  Corruption  Examples   •  Duplica+on  -­‐  A  common  problem  with  direct   mailers  and  credit  card  companies   •  Missing  and  Confusing  Data  Fields   •  Outliers  -­‐  Generally  present  due  to  incorrect   entry/coding  of  a  data  field.   ©IKSINC
  36. 36. Data  Preprocessing    This  step  is  also  known  as  data  transforma4on.  The  aim  here   is  to  map  data  fields  into  representa+ons  suitable  for  the  data   discovery  stage.    Examples:      Month/Date/Year      ===>  Age  Groups    Customer  Address    ===>  Geographic  Zone  Code   ©IKSINC
  37. 37. Pattern  Discovery  Stage   •  Discovery  Model?   •  Discovery  Methodology?   ©IKSINC
  38. 38. Discovery  Models   •  Associa+on  Model   •  Classifica+on  Model   •  Clustering  Model   •  Regression  Model   •  Sequen+al  Model   •  Visual  Model   ©IKSINC
  39. 39. Association  Model    90%  of  customers  who  subscribe  to  at  least   three  premium  channels  also  subscribe  to  pay-­‐ per  view  events   ©IKSINC Also known as Market Basket Analysis
  40. 40. Association  Model:  Application  1   •  Marke+ng  and  Sales  Promo+on:   •  Let  the  rule  discovered  be            {Bagels,  …  }  -­‐-­‐>  {Potato  Chips}   •  Potato  Chips  as  consequent  =>  Can  be  used  to  determine   what  should  be  done  to  boost  its  sales.   •  Bagels  in  the  antecedent  =>  Can  be  used  to  see  which   products  would  be  affected  if  the  store  discon+nues  selling   bagels.   •  Bagels  in  antecedent  and  Potato  chips  in  consequent  =>   Can  be  used  to  see  what  products  should  be  sold  with   Bagels  to  promote  sale  of  Potato  chips!  
  41. 41. Association  Model:  Application  2   •  Supermarket  shelf  management.   •  Goal:  To  iden+fy  items  that  are  bought  together  by  sufficiently   many  customers.   •  Approach:  Process  the  point-­‐of-­‐sale  data  collected  with  barcode   scanners  to  find  dependencies  among  items.   •  A  classic  rule  -­‐-­‐   •  If  a  customer  buys  diaper  and  milk,  then  he  is  very  likely  to  buy  beer.   •  So,  don’t  be  surprised  if  you  find  six-­‐packs  stacked  next  to  diapers!  
  42. 42. ClassiDication  Model   •  If  Annual_Income  >40,000  AND  Home-­‐Owner,  Then   Credit-­‐Risk  è  Medium   •  If                                                                                        >  25,      Then  Loan-­‐Approval  è  Yes   ©IKSINC ( _ ) ( _ _ ) . . Annual Income Avg Monthly CreditCardBalance Mortgage 1 2 15 +
  43. 43. ClassiDication:  Application  1   •  Direct  Marke+ng   •  Goal:  Reduce  cost  of  mailing  by  targe4ng  a  set  of   consumers  likely  to  buy  a  new  cell-­‐phone  product.   •  Approach:   •  Use  the  data  for  a  similar  product  introduced  before.     •  We  know  which  customers  decided  to  buy  and  which  decided   otherwise.  This  {buy,  don’t  buy}  decision  forms  the  class  aQribute.   •  Collect  various  demographic,  lifestyle,  and  company-­‐interac+on   related  informa+on  about  all  such  customers.   •  Type  of  business,  where  they  stay,  how  much  they  earn,  etc.   •  Use  this  informa+on  as  input  aDributes  to  learn  a  classifier  model.  
  44. 44. ClassiDication:  Application  2   •  Fraud  Detec+on   •  Goal:  Predict  fraudulent  cases  in  credit  card  transac+ons.   •  Approach:   •  Use  credit  card  transac+ons  and  the  informa+on  on  its  account-­‐ holder  as  aDributes.   •  When  does  a  customer  buy,  what  does  he  buy,  how  oqen  he  pays  on   +me,  etc   •  Label  past  transac+ons  as  fraud  or  fair  transac+ons.  This  forms  the   class  aDribute.   •  Learn  a  model  for  the  class  of  the  transac+ons.   •  Use  this  model  to  detect  fraud  by  observing  credit  card   transac+ons  on  an  account.  
  45. 45. ClassiDication:  Application  3   •  Customer  ADri+on/Churn:   •  Goal:  To  predict  whether  a  customer  is   likely  to  be  lost  to  a  compe+tor.   •  Approach:   •  Use  detailed  record  of  transac+ons  with   each  of  the  past  and  present  customers,  to   find  aDributes.   •  How  oqen  the  customer  calls,  where  he  calls,   what  +me-­‐of-­‐the  day  he  calls  most,  his   financial  status,  marital  status,  etc.     •  Label  the  customers  as  loyal  or  disloyal.   •  Find  a  model  for  loyalty.  
  46. 46. Clustering  Model    Clustering  models  are  similar  to   classifica+on  models  except  that  no  a-­‐ priori  informa+on  is  available  for   classes.   ©IKSINC
  47. 47. Clustering:  Application  1   •  Market  Segmenta+on:   •  Goal:  subdivide  a  market  into  dis+nct   subsets  of  customers  where  any  subset   may  conceivably  be  selected  as  a  market   target  to  be  reached  with  a  dis+nct   marke+ng  mix.   •  Approach:     •  Collect  different  aDributes  of  customers   based  on  their  geographical  and  lifestyle   related  informa+on.   •  Find  clusters  of  similar  customers.   •  Measure  the  clustering  quality  by  observing   buying  paDerns  of  customers  in  same  cluster   vs.  those  from  different  clusters.    
  48. 48. Clustering:  Application  2   •  Document  Clustering:   •  Goal:  To  find  groups  of  documents  that  are   similar  to  each  other  based  on  the  important   terms  appearing  in  them.   •  Approach:  To  iden+fy  frequently  occurring   terms  in  each  document.  Form  a  similarity   measure  based  on  the  frequencies  of  different   terms.  Use  it  to  cluster.   •  Gain:  Informa+on  Retrieval  can  u+lize  the   clusters  to  relate  a  new  document  or  search   term  to  clustered  documents.  
  49. 49. Regression  Model    Log(Peak_Load)  =  300  +  1.6(|Temp  -­‐  65|)**2                  +  2.4(Rel_Humidity  -­‐  80)      Unlike  classifica+on  models  that  produce  only   discrete  outcomes,  a  regression  model   generates  a  numerical  score  as  its  output.   ©IKSINC
  50. 50. Sequential  Model    Similar  to  associa+on  models  except  that  sequences  of  events   are  considered.  For  example:    “80%  of  customers  who  buy  a  product  X  are  likely  to  buy   product  Y  in  next  six  months”   ©IKSINC
  51. 51. Visual  Model   ©IKSINC
  52. 52. ©IKSINC Geographical Patterns and Map Visualization
  53. 53. Interpretation  Stage   •  Evaluate  the  quality  of  the   discovered  paDerns   •  Determine  the  value  of  the   discovery  to  the  business   ©IKSINC
  54. 54. Value  of  Mined  Information   •  Percep+ve  gap   ©IKSINC The Organization Organization’s View of the Business Data Mined View of the Business Gap
  55. 55. Value  of  Mined  Information   •  Dollar  gap   ©IKSINC Gap The Organization Organization’s View of the Business Data Mined View of the Business Action 1 Action 2
  56. 56. Reporting  Stage   •  Repor+ng  the  discovery  to  higher  management   •  Transforming  the  discovery  to  new  ac+ons  or   products   ©IKSINC
  57. 57. Challenges  of  Data  Mining   •  Scalability   •  Dimensionality   •  Complex  and  Heterogeneous  Data   •  Data  Quality   •  Data  Ownership  and  Distribu+on   •  Privacy  Preserva+on   ©IKSINC
  58. 58. Thank  you  for  Viewing  my  Presenta+on     For  ques+ons,  you  can  contact  me  at     Also  visit  my  blog  “From  Data  to  Decisions”  at