25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy PowerpointAda intro-kennedy-slides


Published on

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

25 June 2013 - Advanced Data Analytics - an Introduction - Paul kennedy PowerpointAda intro-kennedy-slides

  1. 1. Advanced Data Analytics An Introduction Data Mining in Advanced Analytics Dr Paul Kennedy paul.kennedy@uts.edu.au Centre for Quantum Computation & Intelligent Systems School of Software, Faculty of Engineering & IT 1Friday, 5 July 2013
  2. 2. Outline • What is Data Analytics (DA)? • Motivation for DA • Main approaches • DA professionals • Links to other topics • Overview of techniques Paul Kennedy - paul.kennedy@uts.edu.au 2Friday, 5 July 2013
  3. 3. What is Data Analytics? Paul Kennedy - paul.kennedy@uts.edu.au 3Friday, 5 July 2013
  4. 4. • Data Analytics is the analysis of large databases to find novel, commercially valuable and exploitable patterns. • Aim: discover meaningful insights and knowledge from data. • Discoveries expressed as models. • Data mining = process of building models. Paul Kennedy - paul.kennedy@uts.edu.au 4Friday, 5 July 2013
  5. 5. • A model • Captures the essence of the discovered knowledge. • Can assist in understanding the world. • Can be used to make predictions. Models Paul Kennedy - paul.kennedy@uts.edu.au 5Friday, 5 July 2013
  6. 6. Where applied? • Who by? • Business, government, financial services, biology, medicine, risk and intelligence, science and engineering. • Data collected about • Businesses, customers, human resources, products, manufacturing processes, suppliers, business partners, local and international markets & competitors. • Why? • Better support managers, find fraudulent behaviour, understand scientific processes, finding opportunities. Paul Kennedy - paul.kennedy@uts.edu.au 6Friday, 5 July 2013
  7. 7. Motivation for DA Paul Kennedy - paul.kennedy@uts.edu.au 7Friday, 5 July 2013
  8. 8. Collecting Data • We have always collected, checked and organised data. • 5500 years ago Sumerians marked tax records onto dried mud tablets. • Scientists have looked through microscopes and telescopes and drawn what they saw. • Market researchers ran surveys or had TV diaries • Medical laboratories take dozens of measurements per patient Paul Kennedy - paul.kennedy@uts.edu.au 8Friday, 5 July 2013
  9. 9. Data • Analysing • Since then, people have sought ways to use the recorded information to improve their lives (financially, health, ...) • Understanding • People can understand these amounts of data. • But nowadays, there is a data explosion. Paul Kennedy - paul.kennedy@uts.edu.au 9Friday, 5 July 2013
  10. 10. Data explosion • Most data now goes straight to computers without humans seeing them. • Tax records submitted electronically • Telescopes operated remotely and digital images goes to computer files. • Market and POS data go to data warehouses. • High throughput technology make simultaneous measurements of 1000s of genes per patient. • This deluge of data is useless to unaided people! Paul Kennedy - paul.kennedy@uts.edu.au 10Friday, 5 July 2013
  11. 11. TechAmerica Foundation: Federal Big Data Commission ! Cover Page A Practical Guide To Transforming The Business of Government DEMYSTIFYING BIG DATA "#$%&#$'()*(+$,-./$#0,&(1234'&502467(1$'$#&8(90:(;&5&(<2//077024 Big Data ... • Huge global interest currently. • Obama administration in 2011 announced $200m for Big Data R&D in US • TechAmerica Foundation released report describing “transformational” power of Big Data and recommendations for training huge number of data scientist & analysts urgently needed. Paul Kennedy - paul.kennedy@uts.edu.au Source: http://www.techamericafoundation.org/bigdata 11Friday, 5 July 2013
  12. 12. Is it really an “explosion”? • 2011: 1.8 zetabytes of information created globally and expected to double each year • = 200 billion 2-hour HD movies that one person could watch for 47 million years straight! • From sensors, satellites, social media, mobile comms, email, RFID and enterprise applications. • Source: Demystifying Big Data,TechAmerica Foundation, 2012. Paul Kennedy - paul.kennedy@uts.edu.au 12Friday, 5 July 2013
  13. 13. Data Analytics Successes Paul Kennedy - paul.kennedy@uts.edu.au 13Friday, 5 July 2013
  14. 14. Helping to catch the backpacker killer • Australia’s most notorious serial murder case • Early 1990s, 7 young backpackers murdered. • Police had developed a profile. • Huge dataset generated of vehicle records, gym memberships, gun licensing and police records. • Link analysis software from Sydney company NetMap Analytics, narrowed list of suspects from 18 million to 32, which included the murderer: Ivan Milat. Paul Kennedy - paul.kennedy@uts.edu.au 14Friday, 5 July 2013
  15. 15. Predicting the 2012 US election result • Nate Silver used predictive analytics & statistics to correctly predict outcomes of 50 out of 50 states from polling and related data. • Republican pundits were confident in their landslide-win predictions. Democrat pundits predicted razor-thin victory. • Shows the power of a data- centric approach over “gut- feeling”. Paul Kennedy - paul.kennedy@uts.edu.au 15Friday, 5 July 2013
  16. 16. How does it fit to business? Paul Kennedy - paul.kennedy@uts.edu.au 16Friday, 5 July 2013
  17. 17. Fitting to the business • Understand the business context, and stronger, framing a business question. • Translating the business question into a data analytics question. • Collecting, understanding and processing data from across the business and possibly externally. • Build models and evaluate them. • Deploying the results in the business to deliver benefits. • Iterative process. Paul Kennedy - paul.kennedy@uts.edu.au 17Friday, 5 July 2013
  18. 18. Fitting to the business Mathematical Model Predict ‘class’ of unseen rows e.g. customers Find relationships between rows or columns e.g. to target e.g. customer groups Paul Kennedy - paul.kennedy@uts.edu.au 18Friday, 5 July 2013
  19. 19. Two main approaches • Unsupervised methods • Model tries to make sense of the data set or characterise it. • Supervised methods • Model learns a relationship between inputs and outputs from historical data. • Model can then be used to predict output for new data. Paul Kennedy - paul.kennedy@uts.edu.au 19Friday, 5 July 2013
  20. 20. Fitting to the business Mathematical Model Predict ‘class’ of unseen rows e.g. customers Find relationships between rows or columns e.g. to target e.g. customer groups Paul Kennedy - paul.kennedy@uts.edu.au 20Friday, 5 July 2013
  21. 21. Data Warehousing to Data Mining • Data Warehouse: an organisation-wide integrated access to a centralised repository + data models • On-Line Analytic Processing (OLAP): • statistical summaries and basic analytical modeling • build and cache fixed ‘cubes’ (business intelligence) • restructure data for efficient analysis • Fast summarisation and aggregation at different levels Paul Kennedy - paul.kennedy@uts.edu.au 21Friday, 5 July 2013
  22. 22. Data Mining to Knowledge Discovery • Data: raw uninterpreted facts e.g.Tom, 20 years old, student • Information relates items of Data together e.g.Tom is 20 years old • Knowledge relates items of Information together Tom is 20 years old → Tom pays > $1500 insurance • Modeling the world (= generalising) [18 - 25] years old → P(accident) = high Paul Kennedy - paul.kennedy@uts.edu.au 22Friday, 5 July 2013
  23. 23. Data Mining - a Business Intelligence view Data Mining Data mining problem(s) Patterns Business Intelligence Business Problem Paul Kennedy - paul.kennedy@uts.edu.au 23Friday, 5 July 2013
  24. 24. Data Mining - a Business Intelligence view Data Mining Data mining problem(s) Patterns Business Intelligence Business Problem Domain Domain Paul Kennedy - paul.kennedy@uts.edu.au 24Friday, 5 July 2013
  25. 25. Data Mining - a Business Intelligence view Data Mining Data mining problem(s) Patterns Business Intelligence Business Problem Domain Domain Data & Information Visualisation Data Warehousing Methods and Frameworks Knowledge Discovery Techniques Paul Kennedy - paul.kennedy@uts.edu.au 25Friday, 5 July 2013
  26. 26. CRISP-DM view Paul Kennedy - paul.kennedy@uts.edu.au Source: Kenneth Jensen / Wikimedia Commons / Public Domain 26Friday, 5 July 2013
  27. 27. DA professionals Paul Kennedy - paul.kennedy@uts.edu.au 27Friday, 5 July 2013
  28. 28. The rising profession of Data Analyst • “Data mining as a profession is definitely growing because data is growing. Data is becoming more and more usable because of data warehousing (where information from many locations can be centrally mined). So the only way is up.” - Eugene Dubossarsky (Ernst &Young) • If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data.And what is complementary to data? Analysis. - Prof. HalVarian, UC Berkeley, Chief Economist at Google. • The ATO has a network of 30+ data miners working with another 70 or so analytics staff. - Dr Warwick Graco,Australian Taxation Office Paul Kennedy - paul.kennedy@uts.edu.au 28Friday, 5 July 2013
  29. 29. Data Miners / Data Analysts • Typical data mining jobs pay six-figure salaries.The required blend of skills makes good data miners a rare breed. - Ronnie Chan, senior IT specialist IBM's DB2 team • Data miners are the SAS of the IT industry, and it's not a job for beginners. Demand is strong for people who have the technical skills combined with business knowledge.“To produce useable results, data miners must draw on advanced analytical approaches such as predictive modelling, association discovery and sequence discovery.” - Peter Norris, Business Manager Computer Associates Paul Kennedy - paul.kennedy@uts.edu.au 29Friday, 5 July 2013
  30. 30. 10 Hot IT Skills for 2013 • ComputerWorld, 24/9/12 • #5 Business Intelligence / Analytics • “Big data is one of the top priorities for many companies, but getting the right people to analyze all that information is challenging, says Jerry Luftman, managing director at the Global Institute for IT Management and a leader in the Society for Information Management. • The best candidates have technical know-how, business knowledge and strong statistical and mathematical backgrounds -- an uncommon mix of skills, Luftman says. In fact, some companies are hiring statisticians and teaching them about technology and business.” Paul Kennedy - paul.kennedy@uts.edu.au 30Friday, 5 July 2013
  31. 31. Gartner Top 10 Strategic Technology Trends for 2013 • Gartner identifies the Top 10 Strategic Technology Trends for 2013, October 23, 2012 • Of the 10 strategic trends, two were for data analytics. • Strategic Big Data • Actionable Analytics Paul Kennedy - paul.kennedy@uts.edu.au 31Friday, 5 July 2013
  32. 32. Gartner Top 10 Strategic Technology Trends for 2013 • Strategic Big Data • “Big Data is moving from a focus on individual projects to an influence on enterprises’ strategic information architecture. Dealing with data volume, variety, velocity and complexity is forcing changes to many traditional approaches.This realization is leading organizations to abandon the concept of a single enterprise data warehouse containing all information needed for decisions. Instead they are moving towards multiple systems, including content management, data warehouses, data marts and specialized file systems tied together with data services and metadata, which will become the "logical" enterprise data warehouse.” Paul Kennedy - paul.kennedy@uts.edu.au 32Friday, 5 July 2013
  33. 33. Gartner Top 10 Strategic Technology Trends for 2013 • Actionable Analytics • “Analytics is increasingly delivered to users at the point of action and in context.With the improvement of performance and costs, IT leaders can afford to perform analytics and simulation for every action taken in the business.The mobile client linked to cloud-based analytic engines and big data repositories potentially enables use of optimization and simulation everywhere and every time.This new step provides simulation, prediction, optimization and other analytics, to empower even more decision flexibility at the time and place of every business process action.”  Paul Kennedy - paul.kennedy@uts.edu.au 33Friday, 5 July 2013
  34. 34. Institute of Analytics Professionals of Australia • “Our mission is to unite, inform, support and promote analytics professionals in Australia.We provide information sources, a virtual community, a networking hub and a professional identity. We promote the benefits of analytics in modern business.” • www.iapa.org.au Paul Kennedy - paul.kennedy@uts.edu.au 34Friday, 5 July 2013
  35. 35. Privacy • Privacy is important and it is an ethical concern for data analysts. • Laws directly govern data mining in Australia and overseas. • Some basic principles from OECD: • Collection limitation: data should be obtained lawfully and fairly • Data quality: data should be relevant to the stated purposes, accurate, complete and up-to-date. • Purpose specification: should give purpose for use of data and data should be destroyed if it no longer serves the purpose. • Use limitation: use of data for other purposes than specified is forbidden Paul Kennedy - paul.kennedy@uts.edu.au 35Friday, 5 July 2013
  36. 36. Some examples Paul Kennedy - paul.kennedy@uts.edu.au 36Friday, 5 July 2013
  37. 37. Market analysis & management • Data sources? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, social media, plus (public) lifestyle studies • Target marketing • Find clusters of ‘model’ customers who share same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • e.g. conversion of single to joint bank account: marriage, ... • Cross-market analysis • Associations / co-relations between product sales • Prediction based on the association information. Paul Kennedy - paul.kennedy@uts.edu.au 37Friday, 5 July 2013
  38. 38. Market analysis & management (cont’d) • Customer profiling • Data analytics can tell you what types of customers buy what products (clustering or classification) • Identifying customer requirements • Identifying the best products for different customers • Use prediction to find what factors will attract new customers. • Provide summary information • Various multidimensional summary reports • Statistical summary information (mean and variance ...) Paul Kennedy - paul.kennedy@uts.edu.au 38Friday, 5 July 2013
  39. 39. Links to other topics Paul Kennedy - paul.kennedy@uts.edu.au 39Friday, 5 July 2013
  40. 40. Databases Data Warehouse Task-relevant Data Patterns Knowledge Note: iterative process not waterfall! Data Cleaning & Integration Data Selection Data Mining Pattern Evaluation The Knowledge Discovery Process Paul Kennedy - paul.kennedy@uts.edu.au 40Friday, 5 July 2013
  41. 41. The KDD Process • Learn the application domain (prior knowledge & goals) • Create target data set: data selection • Data cleaning and preprocessing (may take 60% of effort!) • Data reduction and transformation • Find useful features, dimensionality/variable reduction, invariant representation • Choose functions of data mining: the “data mining problem” • Summarisation, classification, regression, association, clustering • Choose the data mining algorithm(s) • Data Mining: find patterns of interest • Pattern evaluation and knowledge presentation • Visualisation, transformation, remove redundant patterns, ... • Use of discovered knowledge Paul Kennedy - paul.kennedy@uts.edu.au 41Friday, 5 July 2013
  42. 42. Data Mining Other Disciplines Information Science Visualisation Artificial Intelligence Statistics Database Technology •HCI •High Perfomance Computing •Software Engineering 42Friday, 5 July 2013
  43. 43. Database technology • OLTP → OLAP →OLAM • Data Warehouses • Subject-oriented, integrated, time-variant, non- volatile • Excellent starting point for data mining • Data Marts: specialised, smaller data store • OLAP: drill-down, roll-up, slice-n-dice, data cubes Paul Kennedy - paul.kennedy@uts.edu.au 43Friday, 5 July 2013
  44. 44. OLAP vs Data Mining OLAP - On-Line Analytical Processing • Emphasis on Query • Generally know what you want to find. • Expressible in SQL • Drill-down, data cubes Data Mining Emphasis on Exploration General idea of target but not how to find. Let the machine drive the exploration Paul Kennedy - paul.kennedy@uts.edu.au 44Friday, 5 July 2013
  45. 45. Statistics • Data, Counting, Probabilities, Hypothesis Testing • Correlation and regression analyses • Exploratory data analysis • Predictive models • CART : Classification And Regression Trees • MARS: Multi Adaptive Regression Splines • TreeNet • Random Forest • Important foundations for data mining and knowledge discovery • Ensemble methods • Computational requirements → Sampling Paul Kennedy - paul.kennedy@uts.edu.au 45Friday, 5 July 2013
  46. 46. Artificial Intelligence (AI) • Brings to data analytics • The inductive approach (machine learning) - the design cycle for predictive modeling • Knowledge representation • Inference • Generalisation: everyone who drank beer in Sydney in 1900 is now dead. • Inference:Therefore, beer is fatal. • Warning: it’s easy to get into a similar situation in data analytics! • Uses Data Analytics • e.g. as supporting components in multi-agent systems. • e.g. in multi-agent electronic markets: negotiation agents request information about their opponents & text mining bots deliver that kind of information. Paul Kennedy - paul.kennedy@uts.edu.au 46Friday, 5 July 2013
  47. 47. Artificial Intelligence (AI) • The design cycle for predictive modeling • Issues: • Algorithms developed for toy datasets (< few hundred points) • Prior knowledge (e.g. bias) • Model deviation from true model • Sampling distributions • Computational complexity Collect data Select features Select model type "Train" classifier Evaluate classifier Paul Kennedy - paul.kennedy@uts.edu.au 47Friday, 5 July 2013
  48. 48. Visualisation • Deals with visual presentation of the data. • “A picture is worth a thousand words” - true? • Taps into human strengths • In Data Analytics • Understanding data • Visualising the process • Visualising and communicating the results 48Friday, 5 July 2013
  49. 49. Overview of approaches Paul Kennedy - paul.kennedy@uts.edu.au 49Friday, 5 July 2013
  50. 50. Data Analytics:Techniques (unsupervised) • Association analysis (correlation and causality) • Identify attribute-value conditions that frequently occur in the data • Examples: • age(P,“20..29”) ^ income(P,“20..29K”) → buys(P,“DVDs”) [support = 2%, confidence = 60%] • contains(T,“MP3 player”) → contains(T,“sound processing software”) [1%, 75%] • Support: fraction of data with ‘attribute’ and ‘value’. • Confidence: fraction of data with ‘attribute’ where the rule holds (i.e. where attribute → value. Paul Kennedy - paul.kennedy@uts.edu.au 50Friday, 5 July 2013
  51. 51. Data Analytics:Techniques (unsupervised) • Clustering (cluster analysis) • Identify groups within data where data points in the group are similar to one another but different to those in other groups. • Identify groups within data that maximise intraclass similarity and minimise interclass similarity. • Examples: • cluster crime locations based on characteristics of the crimes. • cluster students based on their marks in assignments for all the core subjects of their degree. • Building models from unlabelled data: unsupervised learning Paul Kennedy - paul.kennedy@uts.edu.au 51Friday, 5 July 2013
  52. 52. Data Analytics: Techniques (supervised) • Classification and Prediction • Using historical data find a model which describes and distinguishes data classes or concepts for the purpose of using the model to classify or predict the class of unknown entities. • Examples: • Build a model to classify countries based on climate or cars based on engine efficiency and on-road behaviour. • Build a model to predict whether customer are likely to purchase a download of a particular music file. • Build a model to predict the grade (Z, P, C, D, H) of a student based on students who previously did a subject. • Building models from labelled data: supervised learning. Paul Kennedy - paul.kennedy@uts.edu.au 52Friday, 5 July 2013
  53. 53. Data Analytics: Techniques • Outlier analysis • Identify entities that are different to other entities or to a model of data. • i.e. Find exceptions to the rule! • Example: odd patterns can be easily hidden among 10 million transactions, but may indicate fraud. • Usually statistics consider them as noise or an exception. • Data analytics: rare and unusual events or items are generally interesting. • Time-series analysis • Identify similar patterns over time - trends, deviation, sequential patterns, periodicity analysis • Example: predicting trends in share prices Paul Kennedy - paul.kennedy@uts.edu.au 53Friday, 5 July 2013
  54. 54. Understandable by Humans “Understandable” by Computers Association Rules Bayesian Networks Decision Trees Neural Networks Paul Kennedy - paul.kennedy@uts.edu.au 54Friday, 5 July 2013
  55. 55. Questions ... Paul Kennedy - paul.kennedy@uts.edu.au 55Friday, 5 July 2013