How to build your own google ...
artur.grzadziel@gmail.com
Data Wizards
Dec 2015
Artur Grządziel
few words about me
email: artur.grzadziel@gmail.com
Currently: BigData and Machine Learning Leader
From Jan 2016: BigData Solution Architect at General Electric
PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute
Graduated from Warsaw University of Technology and Warsaw School of Economics
BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning
in real business cases
Privately, husband and father
pl.linkedin.com/in/ArturGrzadziel
Introduction
Data Wizards
Artur represents „Data Wizards” group – informal group of
BigData/Machine Learning/Data Science professionals located in
Poland and interested in knowledge sharing and addressing business
challenges leveraging modern BigData and Machine Learning
methods.
Agenda
1. Cloudera search
2. How it works?
MySearch
very high level architecture
Data
Source
Index
Cloudera search
Apache Solr and Tika
1.
Other
Sources
Cloudera Search
Cloudera Search is one of Cloudera's near-real-time access products.
Cloudera Search enables non-technical users to search and explore data stored
in or ingested into Hadoop and HBase. Users do not need SQL or programming
skills to use Cloudera Search because it provides a simple, full-text interface for
searching.
Cloudera Search incorporates Apache Solr, which includes Apache Lucene,
SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated
with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search
provides these key capabilities:
- Near-real-time indexing
- Batch indexing
- Simple, full-text data exploration and navigated drill down
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-
0/Cloudera-Search-User-Guide/csug_introducing.html
Cloudera search
Tika
https://tika.apache.org/download.html
Cloudera search
Tika – image
Cloudera search
Tika – PDF file
Cloudera search
Tika – gazeta.pl
Cloudera search
Tika – formats
Supported Document Formats
• HyperText Markup Language
• XML and derived formats
• Microsoft Office document formats
• OpenDocument Format
• Portable Document Format
• Electronic Publication Format
• Rich Text Format
• Compression and packaging formats
• Text formats
• Audio formats
• Image formats
• Video formats
• Java class files and archives
• The mbox format
https://tika.apache.org/1.4/formats.html
Cloudera search
Solr – how to start it …
.binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
Cloudera Search
Administration
Cloudera Search
Data
id cat name price inStock author series_t sequence_i genre_s
553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy
553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy
055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy
553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi
812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy
812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi
441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy
380014300 book
Nine Princes In
Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy
805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy
080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
Cloudera Search
Output format
Cloudera Search
Simple query
Cloudera Search
Simple query
Cloudera Search
More advanced query
Cloudera Search
Query with facets
Cloudera search
Solr – other features
The MoreLikeThis search component enables users to query for documents
similar to a document in their result list. It is achieved leveraging terms from the
original document to find similar documents in the index
The SpellCheck component is designed to provide inline query suggestions
based on other, similar, terms.
Highlighting in Solr allows fragments of documents that match the user's query
to be included with the query response.
Synonyms, stop words
Cloudera search
Solr – other features – geospacial search
Solr has sophisticated geospatial support, including searching within a
specified distance range of a given location (or within a bounding box),
sorting by distance, or even boosting results by the distance
http://lucene.apache.org/solr/quickstart.html
Cloudera Search
Common Use Cases
Cloudera Search lets your entire business explore and analyze data quickly and
easily for a variety of critical use cases all within a single platform, including:
- Threat detection
- Customer 360-degree visibility
- Improved user experience
- Interactive market segmentation
- Accessible global knowledge base
https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-
solr.html
Cloudera Search
Other Use Cases
Instagram: Instagram (a Facebook company) is one of the famous sites, and it
uses Solr to power its geosearch API
WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and
Solr
Netflix: Solr powers basic movie searching on this extremely busy site
StubHub.com: This ticket reseller uses Solr to help visitors search for concerts
and sporting events.
https://www.safaribooksonline.com/library/view/scaling-apache-
solr/9781783981748/ch01s05.html
How it works ... ?
How it works … ?
Data Source – documents …
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
How it works … ?
Data Source – documents … space of unique terms
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
1 2 3 4
1 2 3 5
6 2 3 4
7 2 3 4
List of unique
words:
1. John
2. has
3. a
4. cat
5. dog
6. Eva
7. George
How it works … ?
Data Source – Documents … boolean search with inverted
index
Term Tot. freq.
John 2
has 4
a 4
cat 2
dog 2
Eva 1
George 1
Doc #
1
2
1
2
3
4
1
2
3
4
1
3
2
4
3
4
Dictionary Documents
How it works … ?
Data Source – documents as vectors
Documents
document 1 John has a cat
document 2 John has a dog
document 3 Eva has a cat
document 4 George has a dog
Space of unique terms -> John has a cat dog Eva George
vector representing doc1 -> 1 1 1 1 0 0 0
vector representing doc2 -> 1 1 1 0 1 0 0
vector representing doc3 -> 0 1 1 1 0 1 0
vector representing doc4 -> 0 1 1 0 1 0 1
How it works … ?
Data Source – Documents … vectors
Summary
1.
Other
Sources
Thank you
Data Wizards
E-mail: artur.grzadziel@gmail.com
Links:
• Cloudera Search:
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-
3-0/Cloudera-Search-User-Guide/csug_introducing.html
• Tika
https://tika.apache.org/
• Apache Solr
http://lucene.apache.org/solr/
https://www.cloudera.com/content/www/en-us/products/apache-
hadoop/apache-solr.html
• Vectors, Inversed Index, Frequency Matrix, etc. ...
http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm

How to build your own google

  • 1.
    How to buildyour own google ... artur.grzadziel@gmail.com Data Wizards Dec 2015
  • 2.
    Artur Grządziel few wordsabout me email: artur.grzadziel@gmail.com Currently: BigData and Machine Learning Leader From Jan 2016: BigData Solution Architect at General Electric PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute Graduated from Warsaw University of Technology and Warsaw School of Economics BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning in real business cases Privately, husband and father pl.linkedin.com/in/ArturGrzadziel
  • 3.
    Introduction Data Wizards Artur represents„Data Wizards” group – informal group of BigData/Machine Learning/Data Science professionals located in Poland and interested in knowledge sharing and addressing business challenges leveraging modern BigData and Machine Learning methods.
  • 4.
  • 5.
    MySearch very high levelarchitecture Data Source Index
  • 6.
    Cloudera search Apache Solrand Tika 1. Other Sources
  • 7.
    Cloudera Search Cloudera Searchis one of Cloudera's near-real-time access products. Cloudera Search enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple, full-text interface for searching. Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search provides these key capabilities: - Near-real-time indexing - Batch indexing - Simple, full-text data exploration and navigated drill down http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3- 0/Cloudera-Search-User-Guide/csug_introducing.html
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Cloudera search Tika –formats Supported Document Formats • HyperText Markup Language • XML and derived formats • Microsoft Office document formats • OpenDocument Format • Portable Document Format • Electronic Publication Format • Rich Text Format • Compression and packaging formats • Text formats • Audio formats • Image formats • Video formats • Java class files and archives • The mbox format https://tika.apache.org/1.4/formats.html
  • 13.
    Cloudera search Solr –how to start it … .binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
  • 14.
  • 15.
    Cloudera Search Data id catname price inStock author series_t sequence_i genre_s 553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy 553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy 055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy 553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi 812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy 812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi 441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy 380014300 book Nine Princes In Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy 805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy 080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Cloudera search Solr –other features The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It is achieved leveraging terms from the original document to find similar documents in the index The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms. Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response. Synonyms, stop words
  • 22.
    Cloudera search Solr –other features – geospacial search Solr has sophisticated geospatial support, including searching within a specified distance range of a given location (or within a bounding box), sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html
  • 23.
    Cloudera Search Common UseCases Cloudera Search lets your entire business explore and analyze data quickly and easily for a variety of critical use cases all within a single platform, including: - Threat detection - Customer 360-degree visibility - Improved user experience - Interactive market segmentation - Accessible global knowledge base https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache- solr.html
  • 24.
    Cloudera Search Other UseCases Instagram: Instagram (a Facebook company) is one of the famous sites, and it uses Solr to power its geosearch API WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and Solr Netflix: Solr powers basic movie searching on this extremely busy site StubHub.com: This ticket reseller uses Solr to help visitors search for concerts and sporting events. https://www.safaribooksonline.com/library/view/scaling-apache- solr/9781783981748/ch01s05.html
  • 25.
  • 26.
    How it works… ? Data Source – documents … Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog
  • 27.
    How it works… ? Data Source – documents … space of unique terms Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog 1 2 3 4 1 2 3 5 6 2 3 4 7 2 3 4 List of unique words: 1. John 2. has 3. a 4. cat 5. dog 6. Eva 7. George
  • 28.
    How it works… ? Data Source – Documents … boolean search with inverted index Term Tot. freq. John 2 has 4 a 4 cat 2 dog 2 Eva 1 George 1 Doc # 1 2 1 2 3 4 1 2 3 4 1 3 2 4 3 4 Dictionary Documents
  • 29.
    How it works… ? Data Source – documents as vectors Documents document 1 John has a cat document 2 John has a dog document 3 Eva has a cat document 4 George has a dog Space of unique terms -> John has a cat dog Eva George vector representing doc1 -> 1 1 1 1 0 0 0 vector representing doc2 -> 1 1 1 0 1 0 0 vector representing doc3 -> 0 1 1 1 0 1 0 vector representing doc4 -> 0 1 1 0 1 0 1
  • 30.
    How it works… ? Data Source – Documents … vectors
  • 31.
  • 32.
    Thank you Data Wizards E-mail:artur.grzadziel@gmail.com Links: • Cloudera Search: http://www.cloudera.com/content/www/en-us/documentation/archive/search/1- 3-0/Cloudera-Search-User-Guide/csug_introducing.html • Tika https://tika.apache.org/ • Apache Solr http://lucene.apache.org/solr/ https://www.cloudera.com/content/www/en-us/products/apache- hadoop/apache-solr.html • Vectors, Inversed Index, Frequency Matrix, etc. ... http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm