Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Myths and challenges in knowledge extraction and analysis from human-generated content


Published on

For centuries, science (in German "Wissenschaft") has aimed to create ("schaften") new knowledge ("Wissen") from the observation of physical phenomena, their modelling, and empirical validation. Recently, a new source of knowledge has emerged: not (only) the physical world any more, but the virtual world, namely the Web with its ever-growing stream of data materialized in the form of social network chattering, content produced on demand by crowds of people, messages exchanged among interlinked devices in the Internet of Things. The knowledge we may find there can be dispersed, informal, contradicting, unsubstantiated and ephemeral today, while already tomorrow it may be commonly accepted. The challenge is once again to capture and create knowledge that is new, has not been formalized yet in existing knowledge bases, and is buried inside a big, moving target (the live stream of online data). The myth is that existing tools (spanning fields like semantic web, machine learning, statistics, NLP, and so on) suffice to the objective. While this may still be far from true, some existing approaches are actually addressing the problem and provide preliminary insights into the possibilities that successful attempts may lead to.
The talk explores the mixed realistic-utopian domain of knowledge extraction and reports on some tools and cases where digital and physical world have brought together for better understanding our society.

Published in: Data & Analytics
  • $25 per hour jobs on Facebook, now hiring! ●●●
    Are you sure you want to  Yes  No
    Your message goes here

Myths and challenges in knowledge extraction and analysis from human-generated content

  1. 1. Myths and Challenges in Knowledge Extraction and Analysis from Human-generated Content Marco Brambilla @marcobrambi
  2. 2. Knowledge, Behaviour and Feature Extraction with Big Data Science
  3. 3. Your Data, My Problem
  4. 4. Problem 1. The Complexity of Knowledge
  5. 5. There are more things In heaven and earth, Horatio, Than are dreamt of in your philosophy. Shakespeare (Hamlet Act 1, scene 5)
  6. 6. The Answer to the Great Question... Of Life, the Universe and Everything Data Information Knowledge WisdomContext independence Understanding Understanding relations Understanding patterns Understanding principles
  7. 7. Formalizing evolving knowledge is hard Only high frequency emerges The long tail challenge
  8. 8. The Evolving Knowledge known social factoid a c ¬c bpotentially emerging potentially decaying actual and solid d
  9. 9. Information and Knowledge Extraction
  10. 10. Heaven and Earth Are they so different?
  11. 11. The Digital “Heaven” Vs. The “physical” Earth
  12. 12. Heaven and Earth How to peer into the world through an effective window? INGREDIENTS Social media, IoT, … – the data Domain experts – the context
  13. 13. 13 [photo:] The digital reflection of our life is sharpening
  14. 14. 14 Data source Around for Frequency Delay Census data 100s year years months Newspaper 100s year days 1 day Weather sensors 10s year hours/minutes hours/minutes TV news 10s years hours minutes Traffic sensors years 15 minutes minutes Call Data Records years 15 minutes hours Social media years seconds seconds IoT recently milliseconds milliseconds Source:EmanueleDellaValle The data evolution
  15. 15. Data piles up without easing decision making I have to decide: A or B? Why not C? What if D? Source:EmanueleDellaValle
  16. 16. But, we would like to … fusing all those data sources making sense of the fused information Definitely E! Source:EmanueleDellaValle
  17. 17. The MacroScope Joël de Rosnay, The Macroscope, 1979
  18. 18. Problem 2. Cognitive Bias (of the observer)
  19. 19. the streetlamp effect The bias of the observer
  20. 20. Strategy and Inaccuracy
  21. 21. Use Case: City
  22. 22. Model of social media and reality sensing
  23. 23. Model of social media and reality sensing
  24. 24. Model of social media and reality sensing
  25. 25. Problem 3. Data Quality
  26. 26. Data Quality Issue Gartner Report In 2017, 33% of the largest global companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information. If you torture the data long enough, it will confess to anything – Darrell Huff
  27. 27. The Vicious Cycle of Bad Data Bad Data Incorrect Analysis Invalid Insights Wrong Decisions Poor Outcome
  28. 28. Conventional Definition of Data Quality • Accuracy • The data was recorded correctly. • Completeness • All relevant data was recorded. • Uniqueness • Entities are recorded once. • Timeliness • The data is kept up to date (and time consistency is granted). • Consistency • The data agrees with itself.
  29. 29. Why is Data “Dirty” ? • Dummy Values, • Absence of Data, • Multipurpose Fields, • Cryptic Data, • Contradicting Data, • Shared Field Usage, • Inappropriate Use of Fields, • Violation of Business Rules, • Reused Primary Keys, • Non-Unique Identifiers, • Data Integration Problems
  30. 30. Data Wrangling a.k.a. • Data Preprocessing • Data Preparation • Data Cleansing • Data Scrubbing • Data Munging • Data Transformation • Data Fold, Spindle, Mutilate… • (good old) ETL
  31. 31. Foursquare • Check-ins explicitly performed in venues all around the world • Data set: Geo-localized Foursquare venues, collected through a query every 50m with radius >50m over: • Milan area: 20km x 17,5km • Some numbers • Total n° of venues: 90K (dirty) • Total n° of valid venues: 43K
  32. 32. Isn’t data science sexy?
  33. 33. College & University 0 200 400 600 800 1000 1200 1400 weekend we eke nd we eke nd we eke nd we eke nd No access No access No access
  34. 34. Event 0 10 20 30 40 50 60 70 wee kend wee kend wee kend wee kend wee kend eve nts Eve nts
  35. 35. The skeptic approach
  36. 36. The Pragmatic Approach
  37. 37. The (pseudo) Practitioner Approach
  38. 38. Problem 4. Content Bias (of the source)
  39. 39. Data vs. Question • Are they aligned? • The usual problem of representativeness of the sample… • At a different scale • With much less control • Example: the different pictures of the city
  40. 40. Foursquare Checkins Copyright © Milano-Hub project @Politecnico di Milano
  41. 41. Flickr Copyright © Milano-Hub project @Politecnico di Milano
  42. 42. Instagram Copyright © Milano-Hub project @Politecnico di Milano
  43. 43. Instagram Copyright © Milano-Hub project @Politecnico di Milano
  44. 44. 44 Cities into cities, by language
  45. 45. Bias of the Source • Technology • Audience / Users / Adopters • Behaviour
  46. 46. Problem 5. Granularity (time, space, …)
  47. 47. Example. Space Granularity: the Grid • Regular squared grid • Irregular grid with official business-driven meaning • Irregular grid with data-driven definition 12/4
  48. 48. Cities into cities
  49. 49. But other dimensions matter too • Time • Categories • Economical value • …
  50. 50. Problem 6. Availability & Access
  51. 51. Google Places Only in the UI (scraping) Via API
  52. 52. Problem 7. Consistency
  53. 53. Bringing Things Together Space-text similarity btw. Google - Foursquare
  54. 54. Problem 8. Size
  55. 55. Data is big! 1 GigaByte of Data (109) or, strictly, 230 bytes
  56. 56. 1 ZettaByte of Data one sextillion (1021) or, strictly, 270 bytes
  57. 57. The Fashion Week in Milano #MFW
  58. 58. • Mobile Phone Calls & Msgs: 5 to 10 MLN per day in a city like Milan • Trackable user events (incl. data traffic): 1,000 per user per day Mobile Phone Data
  59. 59. IoT Sensors • People counters: 1 event per second (or less) • 86K+ events per day per sensor • Industrial machine sensors: 100 measurements per second
  60. 60. Human computation and crowdsourcing
  61. 61. … and now … Examples and Cases
  62. 62. Use Case #1: Fashion The Milano Fashion Week
  63. 63. Response of Social Media #MFW • MILANO FASHION WEEK #MFW • We have 2 signals: • The first coming from the social media (in this case we will talk about only Instagram) • The second derived from the official calendar events
  64. 64. Research Questions “Are live events still relevant? Can online visibility be described simply by how famous is the brand? Do space and time still matter? Can we predict how people behave in time/space within events?
  65. 65. Discover more about the #MFW case • behaviour-during-live-events-the-milano-fashion-week-mfw-case- www2017/ (INCLUDING SLIDES)
  66. 66. Use Case #2: Design The Milano Design Week & FuoriSalone
  67. 67. •Fuorisalone Official database • events/locations/itineraries • Fuorisalone Official App • GPS positions1 of the App users • Events inserted in the agenda on the App • Private social post (Facebook) of App users2 • SocialMedia Listener • Keyword-based public social post (Twitter/Instagram) • Semantic analysis • 1 when the App was running • 2 to use some App features the users had to perform a social login Data sources of the analysis
  68. 68. • Data elements are georeferenced and aggregate by citypixel (100 x 100 mt squares) • Merging multiple data sources makes it possible to infer information: • Which events attract more visitors? • Which areas have the larger presence of visitors? • Do people talk on the social networks about the events they are interested in? • Do people use social networks while visiting the events? • ... Fusing the data
  69. 69. Use case #3: Como smartcity
  70. 70. Approach City-scale: mobile telephone and (gross-grain geo-located) social media data Street/square: people counting & profiling IoT sensors Point of Interest: people counting sensor, WiFi log analysis, beacons and (fine grain geo- located) social media Descriptive, predictive, privacy-preserving and, when needed, real-time analysis of a variety of (fused) data sources
  71. 71. Integration Personalized information/offers, city loyalty cards, digital coupons, and polling Proximity detection via NFC or BLE/Beacons
  72. 72. Measuring People counting and profiling via Mobile Data 24.512 People present 41% 71% 63% 59% tourists citizens 29% female male 37% private business 10 20 30 40 50 60 70 age More people than usual
  73. 73. Measuring People counting via 3D camera
  74. 74. Dashboards Why people is there CrowdInsights
  75. 75. Dashboards Why people is there CrowdInsights
  76. 76. 7 1 6 2 3 4 5 7 Areas 1. Città murata 2. Lago sponda Viale Geno 3. Lago 4. Lago sponda di Villa Olmo 5. Zona industriale 6. Brunate 7. Business e università Phone data
  77. 77. Social
  78. 78.
  79. 79. Use Case #4: Knowledge Updater
  80. 80. Overview
  81. 81. Famous Emerging …
  82. 82. Knowledge Enrichment Setting HF Entity1 HF Entity5 HF Entity2 HF Entity4 HF Entity3 LF Entity1 ?? LF Entity2 LF Entity4 LF Entity3 ?? High Frequency Entities Low Frequency Entities ?? ?? ???? ?? Type1 Type11 Type2 Type111 Instances Types <<instanceof>> <<instanceof>> <<instanceof>> <<instanceof>> <<instanceof>> <<instanceof>> ?? ?? ?? ?? ?? Seed Entity Seed Type Type of interest Legend Expert inputs Enrichment problems Property2 Relations HF - LF entities Relations LF - LF entities Typing of LF entities Extraction of new LF entities Property1 ?? ?? ?? Finding attribute values
  83. 83. Emerging Knowledge Harvesting
  84. 84. Discover more knowledge-from-social-media-www2017/ (SLIDES INCLUDED)
  85. 85. Concluding.. Plenty of issues And also plenty of application scenarios where to benchmark ideas!
  86. 86. THANKS! QUESTIONS? Myths and Challenges in Extraction of Emerging Knowledge from Human-generated Content Marco Brambilla @marcobrambi