Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Mining Raw Data to Story Visualization

58 views

Published on

Presentation for TechCamp Cyprus

Published in: Technology
  • Be the first to comment

  • Be the first to like this

From Mining Raw Data to Story Visualization

  1. 1. 12/12/2018 1Demetris Trihinas trihinas.d@unic.ac.cy 1Tutorial | TechCamp Cyprus Department of Computer Science Storytelling through Data From Mining Raw Data to Story Visualization Demetris Trihinas Department of Computer Science University of Nicosia trihinas.d@unic.ac.cy Cyprus
  2. 2. 12/12/2018 2Demetris Trihinas trihinas.d@unic.ac.cy 2Tutorial | TechCamp Cyprus Department of Computer Science Full-Time Faculty Member University of Nicosia “Developing scalable and self-adaptive tools for data management, exploration and visualization” @dtrihinas http://dtrihinas.info https://ailab.unic.ac.cy/
  3. 3. 12/12/2018 3Demetris Trihinas trihinas.d@unic.ac.cy 3Tutorial | TechCamp Cyprus Department of Computer Science State | Unemployment ------------------------------ NY | 1.72 CA | 2.43 DC | 3.54 … Raw bits n’ bytes Structured data Knowledge Story Today’s Talk
  4. 4. 12/12/2018 4Demetris Trihinas trihinas.d@unic.ac.cy 4Tutorial | TechCamp Cyprus Department of Computer Science Data Collection “Taping” into data sources
  5. 5. 12/12/2018 5Demetris Trihinas trihinas.d@unic.ac.cy 5Tutorial | TechCamp Cyprus Department of Computer Science Data Collection • The world’s data sources (e.g., social media, news outlets) often permit –restricted– access to their data. • Web Crawling: methodically scrape website content • Application Programmable Interfaces (APIs) • “ASK for permission and GET access to resource(s)” • So… turn the “tap” of a data source (coding task) and store the data somewhere (data warehousing).
  6. 6. 12/12/2018 6Demetris Trihinas trihinas.d@unic.ac.cy 6Tutorial | TechCamp Cyprus Department of Computer Science Web Crawling
  7. 7. 12/12/2018 7Demetris Trihinas trihinas.d@unic.ac.cy 7Tutorial | TechCamp Cyprus Department of Computer Science Data Collection via API Data Collection GET access to tweets You can have 1% for free with this access token. The tweet sink Data Warehouse GET tweets from @dtrihinas or with #data_miningAlso, ask for #cyprus and #cyprus
  8. 8. 12/12/2018 8Demetris Trihinas trihinas.d@unic.ac.cy 8Tutorial | TechCamp Cyprus Department of Computer Science Twitter Search Behind the “scenes” is the Twitter API
  9. 9. 12/12/2018 9Demetris Trihinas trihinas.d@unic.ac.cy 9Tutorial | TechCamp Cyprus Department of Computer Science
  10. 10. 12/12/2018 10Demetris Trihinas trihinas.d@unic.ac.cy 10Tutorial | TechCamp Cyprus Department of Computer Science Data Overview • Trawling through a couple of articles manually is easy. • But… what about thousands of news articles from multiple news outlets? Humans are slow, Computers are fast! • Get the data, store it and then mine it!
  11. 11. 12/12/2018 11Demetris Trihinas trihinas.d@unic.ac.cy 11Tutorial | TechCamp Cyprus Department of Computer Science Big Data refers to datasets that are too large or complex for traditional data-processing application software to adequately deal with.
  12. 12. 12/12/2018 12Demetris Trihinas trihinas.d@unic.ac.cy 12Tutorial | TechCamp Cyprus Department of Computer Science Big Data is… a Volume Problem
  13. 13. 12/12/2018 13Demetris Trihinas trihinas.d@unic.ac.cy 13Tutorial | TechCamp Cyprus Department of Computer Science The Internet’s Digital Footprint
  14. 14. 12/12/2018 14Demetris Trihinas trihinas.d@unic.ac.cy 14Tutorial | TechCamp Cyprus Department of Computer Science Sensory Data Boeing 787 generates 40TB of data per hour in flight. Google’s self-driving car generates 1GB of data per minute.
  15. 15. 12/12/2018 15Demetris Trihinas trihinas.d@unic.ac.cy 15Tutorial | TechCamp Cyprus Department of Computer Science The Internet of Things 21 Billion devices by 2020 accounting for 12% of the digital universe.
  16. 16. 12/12/2018 16Demetris Trihinas trihinas.d@unic.ac.cy 16Tutorial | TechCamp Cyprus Department of Computer Science That’s a lot of data!
  17. 17. 12/12/2018 17Demetris Trihinas trihinas.d@unic.ac.cy 17Tutorial | TechCamp Cyprus Department of Computer Science Big Data is… a Velocity Problem
  18. 18. 12/12/2018 18Demetris Trihinas trihinas.d@unic.ac.cy 18Tutorial | TechCamp Cyprus Department of Computer Science Batch Data • Assumes that the data is available when and if we want it (e.g., reading and parsing data from a file or database) • The application knows the dataset in advance and controls the input rate of the data. Count events by color fetch data <red, 3> <yellow, 1> <blue, 2> <green, 2> Application Database
  19. 19. 12/12/2018 19Demetris Trihinas trihinas.d@unic.ac.cy 19Tutorial | TechCamp Cyprus Department of Computer Science • Unbounded Data -> the volume of the data is overwhelming • Conceptually infinite sequence of data items • Push Model -> data arrives at high velocity and different rates • Potentially multiple sources pushing data to the application at different rates (data distribution changes over time) Data Streams Application src1 src2 src3 0 2 4 input rate t
  20. 20. 12/12/2018 20Demetris Trihinas trihinas.d@unic.ac.cy 20Tutorial | TechCamp Cyprus Department of Computer Science US Presidential Elections 2016 Happiness Anger Clinton Trump Per minute Emotions During First Debate 200K tweets/min https://qz.com/810092
  21. 21. 12/12/2018 21Demetris Trihinas trihinas.d@unic.ac.cy 21Tutorial | TechCamp Cyprus Department of Computer Science Big Data is… a Value Problem
  22. 22. 12/12/2018 22Demetris Trihinas trihinas.d@unic.ac.cy 22Tutorial | TechCamp Cyprus Department of Computer Science Data Mining From bits and bytes to knowledge
  23. 23. 12/12/2018 23Demetris Trihinas trihinas.d@unic.ac.cy 23Tutorial | TechCamp Cyprus Department of Computer Science Data Warehousing • Data warehousing provides data storage and management capabilities. • Memory and storage has never been cheaper. 1MB today is 10 times cheaper than 5 years ago!
  24. 24. 12/12/2018 24Demetris Trihinas trihinas.d@unic.ac.cy 24Tutorial | TechCamp Cyprus Department of Computer Science Marketing Mantra • Collect whatever data you can, whenever and wherever possible. • The expectation is that collected data will have value either for the purpose collected or for a purpose not yet envisioned.
  25. 25. 12/12/2018 25Demetris Trihinas trihinas.d@unic.ac.cy 25Tutorial | TechCamp Cyprus Department of Computer Science Data Mining • Data is useless unless you can convert it to structured information and ultimately into knowledge. • So… data mining provides you with the intelligence to convert data into knowledge.
  26. 26. 12/12/2018 26Demetris Trihinas trihinas.d@unic.ac.cy 26Tutorial | TechCamp Cyprus Department of Computer Science Confluence of Multiple Disciplines
  27. 27. 12/12/2018 27Demetris Trihinas trihinas.d@unic.ac.cy 27Tutorial | TechCamp Cyprus Department of Computer Science We are drowning in data but starved for knowledge… John Naisbitt, 1982
  28. 28. 12/12/2018 28Demetris Trihinas trihinas.d@unic.ac.cy 28Tutorial | TechCamp Cyprus Department of Computer Science What is NOT Data Mining • Any question you can ask and get an –immediate and concrete– answer from a database. • How many sofas models does IKEA currently have in stock? • How many sofas did IKEA sell in Sweden last month? • Which IKEA customers bought a sofa worth more than 500 euros this year?
  29. 29. 12/12/2018 29Demetris Trihinas trihinas.d@unic.ac.cy 29Tutorial | TechCamp Cyprus Department of Computer Science Data Mining Techniques • Classification • Clustering • Pattern Discovery • Associations • Regression • Outlier Detection
  30. 30. 12/12/2018 30Demetris Trihinas trihinas.d@unic.ac.cy 30Tutorial | TechCamp Cyprus Department of Computer Science Classification • Develop models (or functions) that feature the ability to distinguish and describe a collection of various attribute into classes. • “Give a label to your data!” • Should the IKEA sofa model S be added to this month’s discount items (yes, no)?
  31. 31. 12/12/2018 31Demetris Trihinas trihinas.d@unic.ac.cy 31Tutorial | TechCamp Cyprus Department of Computer Science Predicting Person’s Credit Worthiness Attribute Values Classes {Yes, No}
  32. 32. 12/12/2018 32Demetris Trihinas trihinas.d@unic.ac.cy 32Tutorial | TechCamp Cyprus Department of Computer Science Google News Classify by type Classify by country
  33. 33. 12/12/2018 33Demetris Trihinas trihinas.d@unic.ac.cy 33Tutorial | TechCamp Cyprus Department of Computer Science Clustering • Develop models to group data together based on their similarity or dissimilarity to data in other groups. • Group IKEA customers based on how much disposable income they have, or how often they tend to shop at a particular IKEA branch. • Similar to classification but with unknown classes.
  34. 34. 12/12/2018 34Demetris Trihinas trihinas.d@unic.ac.cy 34Tutorial | TechCamp Cyprus Department of Computer Science Customer Demographics Customers of this group usually buy sofa S so let’s send to customer X an email with a discount for S.
  35. 35. 12/12/2018 35Demetris Trihinas trihinas.d@unic.ac.cy 35Tutorial | TechCamp Cyprus Department of Computer Science Google News Similar articles clustered together
  36. 36. 12/12/2018 36Demetris Trihinas trihinas.d@unic.ac.cy 36Tutorial | TechCamp Cyprus Department of Computer Science Google News Article Clustering based on similarity Cluster Classification automated label generation
  37. 37. 12/12/2018 37Demetris Trihinas trihinas.d@unic.ac.cy 37Tutorial | TechCamp Cyprus Department of Computer Science Pattern Discovery • One of the most basic techniques in data mining is learning to recognize patterns in the data. • This is usually a recognition of some aberration in your data happening at regular intervals, or an ebb and flow of a certain variable over time. • Sales of a certain product seem to spike just before the holidays, or notice that warmer weather drives more people to your website.
  38. 38. 12/12/2018 38Demetris Trihinas trihinas.d@unic.ac.cy 38Tutorial | TechCamp Cyprus Department of Computer Science IKEA Sofa Sales Forecast ???
  39. 39. 12/12/2018 39Demetris Trihinas trihinas.d@unic.ac.cy 39Tutorial | TechCamp Cyprus Department of Computer Science Association • Association is related to tracking patterns, but is more specific to dependently linked attributes. • Model developed to look for specific events or attributes that are highly correlated with another event or attribute. • When your customers buy a specific item, they also often buy a second, related item.
  40. 40. 12/12/2018 40Demetris Trihinas trihinas.d@unic.ac.cy 40Tutorial | TechCamp Cyprus Department of Computer Science People Also…
  41. 41. 12/12/2018 41Demetris Trihinas trihinas.d@unic.ac.cy 41Tutorial | TechCamp Cyprus Department of Computer Science
  42. 42. 12/12/2018 42Demetris Trihinas trihinas.d@unic.ac.cy 42Tutorial | TechCamp Cyprus Department of Computer Science Outlier Detection • Particular data points do not comply with general behavior (pattern) of the rest of the data. • We call them outliers. • Credit card fraud from irregular buying patterns • Patient health from irregular symptoms
  43. 43. 12/12/2018 43Demetris Trihinas trihinas.d@unic.ac.cy 43Tutorial | TechCamp Cyprus Department of Computer Science Regression • Used primarily as a form of modeling to identify the likelihood of a certain variable, given the presence of other variables. • Project a certain price, based on other factors like availability, consumer demand, and competition. • How much should we sell the new IKEA sofa?
  44. 44. 12/12/2018 44Demetris Trihinas trihinas.d@unic.ac.cy 44Tutorial | TechCamp Cyprus Department of Computer Science House Price Projection
  45. 45. 12/12/2018 45Demetris Trihinas trihinas.d@unic.ac.cy 45Tutorial | TechCamp Cyprus Department of Computer Science Beware… Data mining is NOT about fitting the model to the answer YOU want!
  46. 46. 12/12/2018 46Demetris Trihinas trihinas.d@unic.ac.cy 46Tutorial | TechCamp Cyprus Department of Computer Science Correlation • Correlation is a statistical technique that tells us how strongly pairs of variables are related. • But… correlation does not tell us the why and how behind the relationship. • So… correlation just says that a relationship exists.
  47. 47. 12/12/2018 47Demetris Trihinas trihinas.d@unic.ac.cy 47Tutorial | TechCamp Cyprus Department of Computer Science Ice-Cream and Sunglass Sales As the sales of ice creams is increasing so do the sales of sunglasses.
  48. 48. 12/12/2018 48Demetris Trihinas trihinas.d@unic.ac.cy 48Tutorial | TechCamp Cyprus Department of Computer Science Causation • Causation denotes that any change in the value of one variable will cause a change in the value of another variable. • This means that one variable makes other to happen.
  49. 49. 12/12/2018 49Demetris Trihinas trihinas.d@unic.ac.cy 49Tutorial | TechCamp Cyprus Department of Computer Science Exercise and Calories • When a person is exercising then the amount of calories burned increases every minute. • The former (exercise) is causing the latter (calories burned) to happen.
  50. 50. 12/12/2018 50Demetris Trihinas trihinas.d@unic.ac.cy 50Tutorial | TechCamp Cyprus Department of Computer Science Ice-Cream and Homicides in New York • A study in the 90’s showed that ice-cream sales are the cause of homicides in New York. • As the sales of ice-cream rise and fall, so do the number of homicides -> correlation. • But… does the consumption of ice-cream actually cause the death of people in NY? https://www.nytimes.com/2009/06/19/nyregion/19murder.html
  51. 51. 12/12/2018 51Demetris Trihinas trihinas.d@unic.ac.cy 51Tutorial | TechCamp Cyprus Department of Computer Science Correlation Does NOT Imply Causation • No… the two things are correlated. • But this does NOT mean one causes other. Correlation is something which we think, when we can’t see under the covers. So the less the information we have the more we are forced to observe correlations.
  52. 52. 12/12/2018 52Demetris Trihinas trihinas.d@unic.ac.cy 52Tutorial | TechCamp Cyprus Department of Computer Science Confidence Intervals • How many football games do US citizens got to? • To get an -exact- answer (100% correct), you must ask everyone in the US (>350M people) -> Not practical! • Use a random sample, meaning ask (much) less people -> but we won’t be 100% correct.
  53. 53. 12/12/2018 53Demetris Trihinas trihinas.d@unic.ac.cy 53Tutorial | TechCamp Cyprus Department of Computer Science Confidence Intervals • What we try to achieve: Get an interval that we are confident that the actual answer lies within. “I am 95% confident that the number of football games people in the U.S. go to lies between 10 and 12” • So basically, CIs describe the level of uncertainty associated with a sample estimation.
  54. 54. 12/12/2018 54Demetris Trihinas trihinas.d@unic.ac.cy 54Tutorial | TechCamp Cyprus Department of Computer Science Random Sample Selection • Random… means random! • You cannot just select 1000 people from one city, the sample wont represent the whole US. • You cannot just send FB messages to 1000 random people, you will get a representation of US FB users, and of course not all of the US citizens use FB.
  55. 55. 12/12/2018 55Demetris Trihinas trihinas.d@unic.ac.cy 55Tutorial | TechCamp Cyprus Department of Computer Science Random Sample Distribution • Without going into a lot of statistics, a perfectly random sample distribution should look like this: Assuming that you actually selected a random sample
  56. 56. 12/12/2018 56Demetris Trihinas trihinas.d@unic.ac.cy 56Tutorial | TechCamp Cyprus Department of Computer Science Random Sample Distribution • Without going into a lot of statistics, a perfectly random sample distribution should look like this: 95%
  57. 57. 12/12/2018 57Demetris Trihinas trihinas.d@unic.ac.cy 57Tutorial | TechCamp Cyprus Department of Computer Science Confidence Intervals • Random sample: 1000 US citizens • Avg is 11 games and SD is 5 games. • Let’s say we want a 95% confidence interval. 95% 11 With some statistics we get an interval of 1 game for 95% CI. We are 95% confident that the average US citizen watches between 10-12 games a year.
  58. 58. 12/12/2018 58Demetris Trihinas trihinas.d@unic.ac.cy 58Tutorial | TechCamp Cyprus Department of Computer Science Data Visualization Visually communicate analysis results
  59. 59. 12/12/2018 59Demetris Trihinas trihinas.d@unic.ac.cy 59Tutorial | TechCamp Cyprus Department of Computer Science A picture is worth a 1000 words... Chinese proverb
  60. 60. 12/12/2018 60Demetris Trihinas trihinas.d@unic.ac.cy 60Tutorial | TechCamp Cyprus Department of Computer Science Unemployment Data in the US
  61. 61. 12/12/2018 61Demetris Trihinas trihinas.d@unic.ac.cy 61Tutorial | TechCamp Cyprus Department of Computer Science Unemployment Data in the US
  62. 62. 12/12/2018 62Demetris Trihinas trihinas.d@unic.ac.cy 62Tutorial | TechCamp Cyprus Department of Computer Science Seismic Activity in California
  63. 63. 12/12/2018 63Demetris Trihinas trihinas.d@unic.ac.cy 63Tutorial | TechCamp Cyprus Department of Computer Science Seismic Activity in California
  64. 64. 12/12/2018 64Demetris Trihinas trihinas.d@unic.ac.cy 64Tutorial | TechCamp Cyprus Department of Computer Science Why Visualize Your Results? Easier to interpret large volumes of data because the human eye can immediately focus on the main information.
  65. 65. 12/12/2018 65Demetris Trihinas trihinas.d@unic.ac.cy 65Tutorial | TechCamp Cyprus Department of Computer Science
  66. 66. 12/12/2018 66Demetris Trihinas trihinas.d@unic.ac.cy 66Tutorial | TechCamp Cyprus Department of Computer Science Interactiveness Focus even more on information that we care about and we can perform “real-time” queries on the data.
  67. 67. 12/12/2018 67Demetris Trihinas trihinas.d@unic.ac.cy 67Tutorial | TechCamp Cyprus Department of Computer Science Big Data Challenges The human eye cannot find anymore the information that we care about…
  68. 68. 12/12/2018 68Demetris Trihinas trihinas.d@unic.ac.cy 68Tutorial | TechCamp Cyprus Department of Computer Science Big Data Challenges Data navigation through interactiveness either does not work or is not “real-time” anymore…
  69. 69. 12/12/2018 69Demetris Trihinas trihinas.d@unic.ac.cy 69Tutorial | TechCamp Cyprus Department of Computer Science Data Science Process Data Warehousing Data Collection Data Mining Data Visualization Insights StoryStruct Info Raw Data
  70. 70. 12/12/2018 70Demetris Trihinas trihinas.d@unic.ac.cy 70Tutorial | TechCamp Cyprus Department of Computer Science Data Science Process Data Warehousing Data Collection Data Mining Data Visualization Insights Story Struct Info Raw Data Data Preprocessing Preprocessed Info
  71. 71. 12/12/2018 71Demetris Trihinas trihinas.d@unic.ac.cy 71Tutorial | TechCamp Cyprus Department of Computer Science Data Preprocessing • Data mining, especially on big data, is a -compute and time- expensive process. • Data Preprocessing can significantly increase performance if performed before mining. • Data Cleaning • Data Reduction • Data Transformation Preprocessing can even take around 60% of your effort but totally worth it!
  72. 72. 12/12/2018 72Demetris Trihinas trihinas.d@unic.ac.cy 72Tutorial | TechCamp Cyprus Department of Computer Science That’s a lot of data, but… how much is actually useful!
  73. 73. 12/12/2018 73Demetris Trihinas trihinas.d@unic.ac.cy 73Tutorial | TechCamp Cyprus Department of Computer Science Data Cleaning • You would assume that data stored in a database is ready for analysis, but… “dirty data”. • Removing duplicate, erroneous or NA data. • Statistically imputing missing data. id name age score 1000 1001 Anna John 42 fifty 84.7 89.5 age MUST be a number id name age score 1000 1001 1002 Anna John Mat 42 50 29 84.7 89.5 Mat was sick on test day but is C- average student so lets assume he would have scored a 72.0
  74. 74. 12/12/2018 74Demetris Trihinas trihinas.d@unic.ac.cy 74Tutorial | TechCamp Cyprus Department of Computer Science Data Transformation • Reshape, sort and combine data to suitable format(s) for analysis. id name age score 1000 1001 1002 Anna John Mat 42 50 29 84.7 89.7 72.0 id name Eats Breakfast 1000 1001 1002 Anna John Mat Yes yes no id name age score 1001 1000 1002 John Anna Mat 50 42 29 90 85 72 Breakfast 1 1 0 Sort by score
  75. 75. 12/12/2018 75Demetris Trihinas trihinas.d@unic.ac.cy 75Tutorial | TechCamp Cyprus Department of Computer Science Data Reduction • Perform filtering on the data that is not needed for the analysis to consume less resources and time. • Analysis will be performed on US citizens so remove others. • Use only a sample of the data to get an approximate, but quick, answer • Create random sample of 1K rows instead of 1M rows. • Reduce the dimensionality of the problem • The field age is not relevant to analysis.
  76. 76. 12/12/2018 76Demetris Trihinas trihinas.d@unic.ac.cy 76Tutorial | TechCamp Cyprus Department of Computer Science Kepler.gl Interactive Maps
  77. 77. 12/12/2018 77Demetris Trihinas trihinas.d@unic.ac.cy 77Tutorial | TechCamp Cyprus Department of Computer Science Kepler.gl Dimensionality reduction through “layering”.
  78. 78. 12/12/2018 78Demetris Trihinas trihinas.d@unic.ac.cy 78Tutorial | TechCamp Cyprus Department of Computer Science Kepler.gl Filter data through “real-time” queries.
  79. 79. 12/12/2018 79Demetris Trihinas trihinas.d@unic.ac.cy 79Tutorial | TechCamp Cyprus Department of Computer Science Data Visualization Putting everything together!
  80. 80. 12/12/2018 80Demetris Trihinas trihinas.d@unic.ac.cy 80Tutorial | TechCamp Cyprus Department of Computer Science
  81. 81. 12/12/2018 81Demetris Trihinas trihinas.d@unic.ac.cy 81Tutorial | TechCamp Cyprus Department of Computer Science Storytelling through Data From Mining Raw Data to Story Visualization Demetris Trihinas Department of Computer Science University of Nicosia trihinas.d@unic.ac.cy Cyprus

×