Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gaat Artificial Intelligence helpen het zoeken verder te automatiseren?

791 views

Published on

Lezing door Jan Scholtes bij VOGIN-IP-lezing, Amsterdam, 9 maart 2017

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Gaat Artificial Intelligence helpen het zoeken verder te automatiseren?

  1. 1. Gaat Artificial Intelligence helpen het zoeken verder te automatiseren? 3 Maart 2017 Prof dr ir Jan Scholtes 1
  2. 2. Er is natuurlijk hele goede zoeksoftware… Zoek op alle gestructureerde-, ongestructureerde informatie en alle combinaties
  3. 3. Regular Expressions + matches the preceding element one or more times {m} matches the preceding element m times exactly {m,} matches the preceding element at least m times {m,n} matches preceding element at least m times but not more than n times • 𝑚 ≤ 𝑛, 𝑚, 𝑛 ∈ ℵ0 = {0, 1, 2, … } • The element can be a literal, literal range, escaped wildcard, ? wildcard, and number Examples: • [abcte]+ = (cab or cat or bat or bet or tab …) • appl[a-t]+ = (apple or apples or application or …) • 10+ = (10 or 100 or 1000 or 100000 ...) • [a-t]{0,10}dam = (amsterdam or dam or rotterdam or … ) • [0-9]{3}-[0-9]{4} = (123-4567 or 435-1539 or …) • bo{1,}k{1,}* = (book or bookkeeper or Boké ...)
  4. 4. Voor de liefhebber: … 4 Building Backtracking NFA Matches: Mississippi, mission, missing
  5. 5. Toch is de tijd van traditioneel zoeken wel een beetje voorbij • Te veel data, teveel hits, geen relevance ranking die altijd het beste werkt; • Je weet nooit precies wat je krijgt en wat je mist; • Te veel (geografisch verspreide) bronnen; • Te veel talen; • Allerlei spellingsvariaties; • Steeds meer niet-tekstuele formaten;
  6. 6. Artificial Intelligence kan ons verder helpen
  7. 7. Wat is Artificial Intelligence? “State-of-the Art”: • Intelligent zoeken • Informatie detective en extractie • Classificatie van informatie • Representeren van kennis • Overdragen van kennis • Redeneren met kennis • Machine Learning
  8. 8. Voorbeelden van AI om zoeken te verbeteren • Intelligent zoeken • Intelligent analyseren van de inhoud van documenten • Identificatie en extraheren relevante informatie • Classificatie van informatie • Leren en opslaan van kennis van bepaalde onderwerpen • Machinaal vertalen • Audio en video search
  9. 9. Geïntegreerde machine translation
  10. 10. Phonetic Audio Search
  11. 11. Wat is text-mining? 11 Ursus Wehrli: http://www.kunstaufraeumen.ch/en
  12. 12. Information Extraction Hierarchy • Entities: the basis units that can be found in a text; for example: people, companies, locations, products, medicines, and genes. • Attributes: these are the properties of the found entities: consider function title, a person’s age and social security number, addresses of locations, quantity of products, car registration numbers, and the type of organisation. • Facts: these are relationships between entities, for example, a contractual relationship between a company and a person. • Events: these are interesting events or activities that involve entities, such as: “one person speaks to another person”, “a person travels to a location”, and “a company transfers money to another company”. • Concepts, Sentiments or Emotions: finding abstract entities such as problems, requests, sentiments, emotions, etc.
  13. 13. Voorbeeld van informatie extractie
  14. 14. Zoeken op patronen in plaats van op woorden PERSON [visits | meets | lunches ] PERSON PERSON | COMPANY | ORGANIZATION [pays | wires | transfers] PERSON | COMPANY | ORGANIZATION • Zoeken op hoger (semantisch) niveau. • Automatisch vervoegen werkwoorden • Automatisch oplossen co- referenties en persoonlijke voornaamwoorden. • Geen noodzaak meer om hele lange queries te onderhouden. 14
  15. 15. 15
  16. 16. Wat zijn de belangrijkste doelstellingen van document classificatie? • Documenten automatisch classifieren in relevant en niet relevant. • Documenten classificeren in diverse conceptuele categorien. • Maximaliseren recall (> 80%). • Besparen op zoektijd. • Relevante documenten automatisch vinden zonder te veel afhankelijk te zijn van de zoekvaardigheden van een eindgebruiker. • Vinden zonder dat je precies weet wat je zoekt. 16
  17. 17. Hoe verhoudt dit zich t.o.v. andere zoektechnieken? • Supervised Document Classification • Topic Modeling Machine Learning • OCR bitmaps • Audio Search • Text Mining & Regular Expressions • Visual Classification Advanced Processing • Fuzzy & Wildcard Search • Quorum & Proximity Search • Ranking • Regular Expressions Advanced Search • Document Properties • File Properties • Collection Properties Metadata Search • Boolean Search Standard SearchRules Based TAR Machine Learning 0% 100% Recall 17
  18. 18. Welke technologien worden gebruikt? Protocols supported Random Start, Search Start (Continuous Active Learning) and Start with Topic Modeling or combine all methods Supervised Machine Learning Algorithm Support Vector Machines (SVM) Classifier type Binary Document Representation Term Frequency–Inverse Document Frequency (TF-IDF) on full-text or on extracted semantic document features* (entities) Evaluations 11-point precision/recall measurements in combinations with 10-fold cross validation 18 * Patented by ZyLAB
  19. 19. Term-Document Matrix For 800.000 Reuters documents this is a 1.2 million x 800.000 matrix 19
  20. 20. Term Frequency (TF)–Inverse Document Frequency (IDF) • The TF-IDF weight of a term is the product of its TF weight and its IDF weight. • Best known weighting scheme in information retrieval • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection • Automatically removes non-discriminating terms 20
  21. 21. Support Vector Machines (SVM) • Best known text-classifier do far. • Implements automatic feature selection: selects most discriminating features automatically. • SVMs support a highly dimensional spaces as seen in text classification. • SVMs have been reported to work better for text classification. • ZyLAB is using a linear SVM which makes it very fast • ZyLAB uses SVM as a binary classifier: one classifier per issue. Multi- topic classification is possible by using multiple classifiers (one per issue) at the same time. • A SVM classifier returns a classification value between [0-1]. 0 is 100% non-match, 1 is a 100% match. This is known as a confidence value. 21
  22. 22. Now imagine 1.2 million dimensional … 2-dimensional 3-dimensional 22
  23. 23. Automatisch Classificieren van Documenten Voorbeeld uit de M&A 23
  24. 24. Defensibility volgens internationale standaarden 24
  25. 25. Clustering om process te beginnen 25
  26. 26. Machine Learning in de praktijk Find Relevant Documents using standard Search Techniques Review Documents for Correctness _______ best matching first Every X new correct document, build classifier with manually reviewed documents to recognize similar documents Find potential relevant documents by matching classifier with all non-reviewed documents in data Calculate Precision & Recall classifier using 10-fold cross validation on Training Set. Calculate precision return set. Stop if Precision and Recall of the Training Set or the Return Set is Larger than a pre-agreed quality level (typically 70-80%) 26 Return Best-Matching Documents
  27. 27. Wat is een stop conditie? De classifier is goed genoeg om de rest van de documenten automatisch te classificeren. “Goed genoeg” kan zijn: • Precisie – recall van de classifier is structureel > 80% voor zowel precisie als de recall. • Precisie van de classificatie van nieuwe documenten is > 80% • Precisie van de classificatie van nieuwe documenten is < 10 % nadat het eerst naar >80% is gegaan. 27
  28. 28. Simulatie op de Reuters Documenten Set • 806.791 articles in total • War, Civil War (GVIO): 32.615 articles (4,04%): 90% is found after reviewing only 45.000 documents, which is only 5.6% of full corpus. • Sports (GSPO): 35.317 articles (4,38%): 90% is found after reviewing only 32.000 documents. This is only 4% of full corpus. 28
  29. 29. Evolutie van de kwaliteit van een classifier 29
  30. 30. Zijn er grote verschillen hoe je het process begint? Niet echt… 30
  31. 31. Wat als de trainer fouten maakt, is dat een problem? Ook niet echt… 31
  32. 32. Voorspellen hoe lang het nog duurt voor je een goede classifier hebt 32
  33. 33. Presenteer informatie in facets
  34. 34. Hoe kunnen we informatie nu nog beter presenteren voor optimale toegankelijkheid?
  35. 35. Question Entities or patterns to address this question Visualization Options Who is it about? PERSON, COMPANY, ORGANIZATION. EMAIL ADDRESS Pie Chart, Bar Graph What is it about? Result of Topic Modeling (NMF) or Document-Term Correlation Matrix (A*AT) Word cloud, Word wheel When did it happen? DATE, TIME, MONTH, DAY WEEK, YEAR Time line with bar graph Where did it happen? ADDRESS, CITY, COUNTRY, CONTINENT, DEPARTMENT and other geo-locations Geographical Mappings Why did it happen? Sentiments, emotions and cursing Word Cloud, Word Sheel on emitions and sentiments How did it happen? Custom patterns to recognize events, holistic OBJECT-PREDICATE-SUBJECT and RDF extractions Relation graphs How much/often did it happen? Quantitative measures such as amounts, currencies, and other numbers. Also frequency and averages on entity occurrences. Bar graphs 35
  36. 36. Who When Where Why What How How Much Who Centrality (Eigenvalue) Link Networks Timeline Geo-mapping Centrality (Eigenvalue) Link Networks Count Average Bar Graph When Time Line Topic Rivers Count Average Bar Graph Where Count Average Bar Graph Why Centrality (Eigenvalue) Link Networks Count Average Bar Graph What Topic Rivers Automatic Correlation Detection of synonyms Count Average Bar Graph How Count Average Bar Graph How Much Count Average Bar Graph Count Average Bar Graph 36
  37. 37. 37
  38. 38. 38
  39. 39. 39
  40. 40. 40 Gaat u de AI uitdaging aan, of ….
  41. 41. Nog meer te weten komen en hands-on demo’s? Meld u aan voor de relatiedag op donderdag 30 maart: “Automatisch antwoord op al uw onderzoeksvragen” Locatie: Amsterdam, WTC, 9:00 – 14:00 uur Key-note van misdaadverslaggever Peter R. de Vries “Op zoek naar wat niet in dossier staat. Het eerste spoor” www.zylab.nl/relatiedag

×