II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France)

973 views
775 views

Published on

Published in: Software, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
973
On SlideShare
0
From Embeds
0
Number of Embeds
262
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

II-SDV 2014 Search and Data Mining Open Source Platforms (Patrick Beaucamp - Bpm-Conseil, France)

  1. 1. Patrick Beaucamp Founder of the Vanilla Project Mail : Patrick.beaucamp@bpm-conseil.com Search and Data Mining Open Source Platforms II-SDV, Nice 15th April 2014 1II-SDV, Nice
  2. 2. Presentation Agenda Open Source Search Engine & Search Platform Some interesting Platforms Features expected for Search Engines Features expected for Search Platforms (Interface) 2II-SDV, Nice Open Source Data Mining & Machine Learning Platform Datamining subject & Usage type Some interesting Platforms Hadoop & Data Mining : the future of Learning Machin If you don’t find it, it doesn’t exist … so, may be an engine will find it for you ?
  3. 3. Search & Data Mining Together ? Search Interface 3II-SDV, Nice Enhanced Search Interface
  4. 4. Search & Data Mining Together ? 4II-SDV, Nice Data Mining : search for you ! • Example : correlation provides you with information about how data are linked together
  5. 5. 5II-SDV, Nice Confusion Search Engine & Platforms -Search platform per domain : - News (yahoo) - Job (jobserve) - Real Estate - Human Relation - Craig List - Patent industry - … A business case : Facebook - Basic Search interface - Proposal for connection - Advanced graphical search
  6. 6. 6II-SDV, Nice Search Engines Search Engine Providers in Open Source - Lucene - Lucene is a java based indexing and search API - Solr/Lucene is the leading server extension of Lucene. 2 companies, LucidWorks and ElasticSearch, provides packaging and extension of top of Lucene and Solr -Search Landscape -• Lucene : http://lucene.apache.org -• Solr/Lucene : http://lucene.apache.org/solr/ -• Plateform OpenSearch : http://www.open-search-server.com -• Plateform Katta : http://katta.sourceforge.net -• Plateform LucidWorks : http://www.lucidworks.com -• Plateform ElasticSearch : http://www.elasticsearch.com -• Sphinx : http://sphinxsearch.com/
  7. 7. 7II-SDV, Nice Before Search Engines Before indexing your document base, you need to access it ! -Apache Nutch is a highly extensible and scalable open source web crawler software project. -Reference : http://nutch.apache.org/
  8. 8. 8II-SDV, Nice Search Engines : focus Sphinx : inside Database like MariaDb (Mysql) Lucene : Retrieval Software library Use existing Search Infrastructure like Solr/Lucene (Vanilla certified) http://www.lucidworks.com/ or http://www.elasticsearch.org/
  9. 9. 9II-SDV, Nice Search Engines Basics (1/2) -Synonyms - It is possible to extend the search to synonyms if they are listed in a glossary. For example, to find articles containing synonyms to “TV” when you search with the word TV. -Metadata - Dictionary for list of searchable keywords
  10. 10. 10II-SDV, Nice Search Engines Basics (2/2) -Reserved Words, Protected Words - Indexing usually uses stemming, which is to reduce words to their root, for example "Developp" to find items also contain the word when trying to develop the word development. However, sometimes there are adverse lemmatizations, indexing under one lemma two words that have no relation. It is possible to prevent the stemming of words by listing them in a file protwords.txt. -StopWords - The stopwords are meaningless words. A word considered insignificant will be ignored. Note that some words are insignificant in some contexts, others have homonyms signifiers. For example, can refer to a summer season (rather mean) or past participle of the verb to be (relatively insignificant). Stopwords.txt the file looks like this
  11. 11. 11II-SDV, Nice Search Engines current Limits -Multi Language support (this is where commercial search engine have still more to bring to customer), even there is now Asian type language support (Hindi, Thai, Chineese, …) -Elision : - Elisions are a feature of the French, which consist of a contraction of the words like or when they are followed by a vowel. Example: + aircraft gives the aircraft. It is possible to remove these elisions using a lexicon. -Full text search interface (language with search engine) -SubQuery support : now its better with Solr 4.7 -Scalability (this is where Solr is taking technical advantage)
  12. 12. 12II-SDV, Nice Hadoop SearchPlatform for BigData -Cloudera with Solr/Cloud (Solr/Lucene) -Mapr with ElasticSearch (Lucene code) -HortonWorks with LucidWorks (Solr/Lucene)
  13. 13. 13II-SDV, Nice Search Interface : expectation (1/3) -Advance indexing and querying tools. -Provides distributed searching capabilities to prevent bottleneck for a particular server. -Provides document excerpts (snippets) generation that provides summary of the search -Relevance ranking display extracts from the documents based on the query.
  14. 14. 14II-SDV, Nice Search Interface : expectation (2/3) -Duplicate document detection, including fuzzy near duplicates -Rich Document Parsing and Indexing without using Database Indexing. -Ranking control carry out a targeted ranking of individual documents. -Search Grouping by Type / Tag / Categories (General page, documents, images)
  15. 15. 15II-SDV, Nice -Multi Criteria support -Ranking -Natural language support -Apps Support (Android, Ipad) Search Interface : expectation (3/3)
  16. 16. 16II-SDV, Nice Migration to Open Source -Only few project, but it’s growing SAME FUNCTIONALITY, MUCH LOWER COSTS -http://www.searchtechnologies.com/fast-to-solr-migration.html -http://fr.slideshare.net/janhoy/migrating-fast-to-solr
  17. 17. 17II-SDV, Nice Datamining -Work on your data with Dataming component such as R : http://www.revolutionanalytics.com/ Weka : http://www.cs.waikato.ac.nz/ml/weka/ RapidMiner : http://rapidminer.com/ KNIME : http://www.knime.org/ Apache Mahout : http://mahout.apache.org/
  18. 18. Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Datamining II-SDV, Nice
  19. 19. What is (not) Data Mining?  What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)  What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” II-SDV, Nice Datamining
  20. 20. Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems II-SDV, Nice Datamining
  21. 21. Data Mining Tasks Description Methods Find human-interpretable patterns that describe the data. Prediction Methods Use some variables to predict unknown or future values of other variables. II-SDV, Nice Datamining
  22. 22. Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] II-SDV, Nice Datamining
  23. 23. Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. II-SDV, Nice Datamining
  24. 24. Classification with Weka – J48 II-SDV, Nice Datamining
  25. 25. Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model. II-SDV, Nice II-SDV, Nice Datamining
  26. 26. Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. II-SDV, Nice Datamining
  27. 27. Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty. Datamining II-SDV, Nice
  28. 28. Clustering Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures. II-SDV, Nice Datamining
  29. 29. Clustering with Weka – K-Means II-SDV, Nice Datamining
  30. 30. Illustrating Clustering Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized II-SDV, Nice Datamining
  31. 31. Clustering: Application 1 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. II-SDV, Nice Datamining
  32. 32. Clustering: Application 2 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. II-SDV, Nice Datamining
  33. 33. Association Rule Discovery: Definition Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! II-SDV, Nice Datamining
  34. 34. Association Rules with Weka – APriori II-SDV, Nice Datamining
  35. 35. Association Rule Discovery: Application 1 Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers! II-SDV, Nice Datamining
  36. 36. Association Rule Discovery: Application 2 Inventory Management: Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns. II-SDV, Nice Datamining
  37. 37. Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data II-SDV, Nice Datamining
  38. 38. 38II-SDV, Nice Learning Machines - Weka - RapidMiner (formerly Yale : Yet Another Learning Environment) - Hadoop Mahout Background infrastructure & Framework Set of Api and Libraries Provides Algorithms for DataMining Your Data Your Application
  39. 39. 39II-SDV, Nice Learning Machines Implementation With Apache Mahout Classification - Naives Bayes - Markov Models - Logistic Regression - Random Forest -Clustering - K-Means and Fuzzy K-Means - Canopy - Spectral Clustering -Association - Latent Derichlet Allocation
  40. 40. Working with documents : Mahout – Lucene integration 40II-SDV, Nice Learning Machines Implementation
  41. 41. 41II-SDV, Nice References (Search Domain) -Lucene : http://lucene.apache.org -Plateform Solr/Lucene : http://www.lucidworks.com -Plateform OpenSearch : http://www.open-search-server.com -Plateform Katta : http://katta.sourceforge.net -Plateform ElasticSearch : http://www.elasticsearch.com -Plateform http://www.lucidworks.com/ -Sphinx : http://sphinxsearch.com/ -http://nutch.apache.org/ -http://en.wikipedia.org/wiki/List_of_search_engines -http://www.thesearchenginelist.com/ -www.cloudera.com -www.mapr.com -www.hortonworks.com
  42. 42. 42II-SDV, Nice References (DataMining Domain) R : http://www.revolutionanalytics.com/ Weka : http://www.cs.waikato.ac.nz/ml/weka/ RapidMiner : http://rapidminer.com/ KNIME : http://www.knime.org/ Apache Mahout : http://mahout.apache.org/
  43. 43. 43II-SDV, Nice

×