SlideShare a Scribd company logo
How to build your own google ...
artur.grzadziel@gmail.com
Data Wizards
Dec 2015
Artur Grządziel
few words about me
email: artur.grzadziel@gmail.com
Currently: BigData and Machine Learning Leader
From Jan 2016: BigData Solution Architect at General Electric
PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute
Graduated from Warsaw University of Technology and Warsaw School of Economics
BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning
in real business cases
Privately, husband and father
pl.linkedin.com/in/ArturGrzadziel
Introduction
Data Wizards
Artur represents „Data Wizards” group – informal group of
BigData/Machine Learning/Data Science professionals located in
Poland and interested in knowledge sharing and addressing business
challenges leveraging modern BigData and Machine Learning
methods.
Agenda
1. Cloudera search
2. How it works?
MySearch
very high level architecture
Data
Source
Index
Cloudera search
Apache Solr and Tika
1.
Other
Sources
Cloudera Search
Cloudera Search is one of Cloudera's near-real-time access products.
Cloudera Search enables non-technical users to search and explore data stored
in or ingested into Hadoop and HBase. Users do not need SQL or programming
skills to use Cloudera Search because it provides a simple, full-text interface for
searching.
Cloudera Search incorporates Apache Solr, which includes Apache Lucene,
SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated
with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search
provides these key capabilities:
- Near-real-time indexing
- Batch indexing
- Simple, full-text data exploration and navigated drill down
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-
0/Cloudera-Search-User-Guide/csug_introducing.html
Cloudera search
Tika
https://tika.apache.org/download.html
Cloudera search
Tika – image
Cloudera search
Tika – PDF file
Cloudera search
Tika – gazeta.pl
Cloudera search
Tika – formats
Supported Document Formats
• HyperText Markup Language
• XML and derived formats
• Microsoft Office document formats
• OpenDocument Format
• Portable Document Format
• Electronic Publication Format
• Rich Text Format
• Compression and packaging formats
• Text formats
• Audio formats
• Image formats
• Video formats
• Java class files and archives
• The mbox format
https://tika.apache.org/1.4/formats.html
Cloudera search
Solr – how to start it …
.binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
Cloudera Search
Administration
Cloudera Search
Data
id cat name price inStock author series_t sequence_i genre_s
553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy
553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy
055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy
553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi
812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy
812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi
441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy
380014300 book
Nine Princes In
Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy
805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy
080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
Cloudera Search
Output format
Cloudera Search
Simple query
Cloudera Search
Simple query
Cloudera Search
More advanced query
Cloudera Search
Query with facets
Cloudera search
Solr – other features
The MoreLikeThis search component enables users to query for documents
similar to a document in their result list. It is achieved leveraging terms from the
original document to find similar documents in the index
The SpellCheck component is designed to provide inline query suggestions
based on other, similar, terms.
Highlighting in Solr allows fragments of documents that match the user's query
to be included with the query response.
Synonyms, stop words
Cloudera search
Solr – other features – geospacial search
Solr has sophisticated geospatial support, including searching within a
specified distance range of a given location (or within a bounding box),
sorting by distance, or even boosting results by the distance
http://lucene.apache.org/solr/quickstart.html
Cloudera Search
Common Use Cases
Cloudera Search lets your entire business explore and analyze data quickly and
easily for a variety of critical use cases all within a single platform, including:
- Threat detection
- Customer 360-degree visibility
- Improved user experience
- Interactive market segmentation
- Accessible global knowledge base
https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-
solr.html
Cloudera Search
Other Use Cases
Instagram: Instagram (a Facebook company) is one of the famous sites, and it
uses Solr to power its geosearch API
WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and
Solr
Netflix: Solr powers basic movie searching on this extremely busy site
StubHub.com: This ticket reseller uses Solr to help visitors search for concerts
and sporting events.
https://www.safaribooksonline.com/library/view/scaling-apache-
solr/9781783981748/ch01s05.html
How it works ... ?
How it works … ?
Data Source – documents …
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
How it works … ?
Data Source – documents … space of unique terms
Document Content
1 John has a cat
2 John has a dog
3 Eva has a cat
4 George has a dog
1 2 3 4
1 2 3 5
6 2 3 4
7 2 3 4
List of unique
words:
1. John
2. has
3. a
4. cat
5. dog
6. Eva
7. George
How it works … ?
Data Source – Documents … boolean search with inverted
index
Term Tot. freq.
John 2
has 4
a 4
cat 2
dog 2
Eva 1
George 1
Doc #
1
2
1
2
3
4
1
2
3
4
1
3
2
4
3
4
Dictionary Documents
How it works … ?
Data Source – documents as vectors
Documents
document 1 John has a cat
document 2 John has a dog
document 3 Eva has a cat
document 4 George has a dog
Space of unique terms -> John has a cat dog Eva George
vector representing doc1 -> 1 1 1 1 0 0 0
vector representing doc2 -> 1 1 1 0 1 0 0
vector representing doc3 -> 0 1 1 1 0 1 0
vector representing doc4 -> 0 1 1 0 1 0 1
How it works … ?
Data Source – Documents … vectors
Summary
1.
Other
Sources
Thank you
Data Wizards
E-mail: artur.grzadziel@gmail.com
Links:
• Cloudera Search:
http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-
3-0/Cloudera-Search-User-Guide/csug_introducing.html
• Tika
https://tika.apache.org/
• Apache Solr
http://lucene.apache.org/solr/
https://www.cloudera.com/content/www/en-us/products/apache-
hadoop/apache-solr.html
• Vectors, Inversed Index, Frequency Matrix, etc. ...
http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm

More Related Content

What's hot

2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
Christian Martorella
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
Fabien Gandon
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Linked Data:Libraries and Beyond
Linked Data:Libraries and BeyondLinked Data:Libraries and Beyond
Linked Data:Libraries and Beyond
Jessica Hedgecock and John Shannon
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
donaldlsmithjr
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
Prof. Wim Van Criekinge
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh mcpcmattc
 

What's hot (7)

2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Linked Data:Libraries and Beyond
Linked Data:Libraries and BeyondLinked Data:Libraries and Beyond
Linked Data:Libraries and Beyond
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
Search engines coh m
Search engines coh mSearch engines coh m
Search engines coh m
 

Viewers also liked

Ask Data Anything
Ask Data AnythingAsk Data Anything
Ask Data Anything
Data Science Warsaw
 
Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICM
Data Science Warsaw
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Data Science Warsaw
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Data Science Warsaw
 
Data science warsaw inaugural meetup
Data science warsaw   inaugural meetupData science warsaw   inaugural meetup
Data science warsaw inaugural meetup
Data Science Warsaw
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
Data Science Warsaw
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chain
Data Science Warsaw
 
Data Science Warsaw
Data Science WarsawData Science Warsaw
Data Science Warsaw
Data Science Warsaw
 
Analiza języka naturalnego
Analiza języka naturalnegoAnaliza języka naturalnego
Analiza języka naturalnego
Data Science Warsaw
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Data Science Warsaw
 
Data science w ubezpieczeniach
Data science w ubezpieczeniachData science w ubezpieczeniach
Data science w ubezpieczeniach
Data Science Warsaw
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnych
Data Science Warsaw
 
ARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINA
Edgar Mrtinez
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 
To się w ram ie nie zmieści
To się w ram ie nie zmieściTo się w ram ie nie zmieści
To się w ram ie nie zmieści
Data Science Warsaw
 

Viewers also liked (20)

Ask Data Anything
Ask Data AnythingAsk Data Anything
Ask Data Anything
 
Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICM
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'a
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
 
Data science warsaw inaugural meetup
Data science warsaw   inaugural meetupData science warsaw   inaugural meetup
Data science warsaw inaugural meetup
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chain
 
Data Science Warsaw
Data Science WarsawData Science Warsaw
Data Science Warsaw
 
Analiza języka naturalnego
Analiza języka naturalnegoAnaliza języka naturalnego
Analiza języka naturalnego
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
 
unidad 1
unidad 1unidad 1
unidad 1
 
Trash Talk
Trash TalkTrash Talk
Trash Talk
 
unidad 1
unidad 1unidad 1
unidad 1
 
Data science w ubezpieczeniach
Data science w ubezpieczeniachData science w ubezpieczeniach
Data science w ubezpieczeniach
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnych
 
ARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINAARTRITIS – ENCEFALITIS CAPRINA
ARTRITIS – ENCEFALITIS CAPRINA
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
 
QIIP
QIIPQIIP
QIIP
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
 
To się w ram ie nie zmieści
To się w ram ie nie zmieściTo się w ram ie nie zmieści
To się w ram ie nie zmieści
 

Similar to How to build your own google

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
Trey Grainger
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
Trey Grainger
 
SolrMeter Lightning talk - Lucene Revolution 2010
SolrMeter   Lightning talk - Lucene Revolution 2010SolrMeter   Lightning talk - Lucene Revolution 2010
SolrMeter Lightning talk - Lucene Revolution 2010
Tomás Fernández Löbbe
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
Trey Grainger
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
Trey Grainger
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
Jay Bharat
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
Optum
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
enterprisesearchmeetup
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
Abanti Aazmin
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
ProjectHub
ProjectHubProjectHub
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in DataSecure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
IJERA Editor
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
Fernando Meyer
 
eDiscovery and Microsoft Teams
eDiscovery and Microsoft TeamseDiscovery and Microsoft Teams
eDiscovery and Microsoft Teams
Albert Hoitingh
 
Breaking the Google Addiction
Breaking the Google AddictionBreaking the Google Addiction
Breaking the Google Addiction
Alan Manifold
 
Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365
Mitul Rana
 

Similar to How to build your own google (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
SolrMeter Lightning talk - Lucene Revolution 2010
SolrMeter   Lightning talk - Lucene Revolution 2010SolrMeter   Lightning talk - Lucene Revolution 2010
SolrMeter Lightning talk - Lucene Revolution 2010
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Scalable Search Analytics
Scalable Search AnalyticsScalable Search Analytics
Scalable Search Analytics
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in DataSecure Syntactic key Ranked Search over Encrypted Cloud in Data
Secure Syntactic key Ranked Search over Encrypted Cloud in Data
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
 
eDiscovery and Microsoft Teams
eDiscovery and Microsoft TeamseDiscovery and Microsoft Teams
eDiscovery and Microsoft Teams
 
Breaking the Google Addiction
Breaking the Google AddictionBreaking the Google Addiction
Breaking the Google Addiction
 
Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365Building Efficient eDiscovery and Compliance with SharePoint and O365
Building Efficient eDiscovery and Compliance with SharePoint and O365
 

More from Data Science Warsaw

CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
Ile informacji jest w danych?
Ile informacji jest w danych?Ile informacji jest w danych?
Ile informacji jest w danych?
Data Science Warsaw
 
Otwarte Miasta
Otwarte MiastaOtwarte Miasta
Otwarte Miasta
Data Science Warsaw
 
Azure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurzeAzure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurze
Data Science Warsaw
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
Data Science Warsaw
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
Data Science Warsaw
 
Haven 2 0
Haven 2 0 Haven 2 0

More from Data Science Warsaw (7)

CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
 
Ile informacji jest w danych?
Ile informacji jest w danych?Ile informacji jest w danych?
Ile informacji jest w danych?
 
Otwarte Miasta
Otwarte MiastaOtwarte Miasta
Otwarte Miasta
 
Azure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurzeAzure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurze
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Haven 2 0
Haven 2 0 Haven 2 0
Haven 2 0
 

Recently uploaded

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 

Recently uploaded (20)

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 

How to build your own google

  • 1. How to build your own google ... artur.grzadziel@gmail.com Data Wizards Dec 2015
  • 2. Artur Grządziel few words about me email: artur.grzadziel@gmail.com Currently: BigData and Machine Learning Leader From Jan 2016: BigData Solution Architect at General Electric PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute Graduated from Warsaw University of Technology and Warsaw School of Economics BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning in real business cases Privately, husband and father pl.linkedin.com/in/ArturGrzadziel
  • 3. Introduction Data Wizards Artur represents „Data Wizards” group – informal group of BigData/Machine Learning/Data Science professionals located in Poland and interested in knowledge sharing and addressing business challenges leveraging modern BigData and Machine Learning methods.
  • 5. MySearch very high level architecture Data Source Index
  • 6. Cloudera search Apache Solr and Tika 1. Other Sources
  • 7. Cloudera Search Cloudera Search is one of Cloudera's near-real-time access products. Cloudera Search enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple, full-text interface for searching. Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search provides these key capabilities: - Near-real-time indexing - Batch indexing - Simple, full-text data exploration and navigated drill down http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3- 0/Cloudera-Search-User-Guide/csug_introducing.html
  • 12. Cloudera search Tika – formats Supported Document Formats • HyperText Markup Language • XML and derived formats • Microsoft Office document formats • OpenDocument Format • Portable Document Format • Electronic Publication Format • Rich Text Format • Compression and packaging formats • Text formats • Audio formats • Image formats • Video formats • Java class files and archives • The mbox format https://tika.apache.org/1.4/formats.html
  • 13. Cloudera search Solr – how to start it … .binsolr start –e cloud -noprompt http://lucene.apache.org/solr/
  • 15. Cloudera Search Data id cat name price inStock author series_t sequence_i genre_s 553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy 553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy 055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy 553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi 812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy 812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi 441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy 380014300 book Nine Princes In Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy 805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy 080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy
  • 21. Cloudera search Solr – other features The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It is achieved leveraging terms from the original document to find similar documents in the index The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms. Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response. Synonyms, stop words
  • 22. Cloudera search Solr – other features – geospacial search Solr has sophisticated geospatial support, including searching within a specified distance range of a given location (or within a bounding box), sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html
  • 23. Cloudera Search Common Use Cases Cloudera Search lets your entire business explore and analyze data quickly and easily for a variety of critical use cases all within a single platform, including: - Threat detection - Customer 360-degree visibility - Improved user experience - Interactive market segmentation - Accessible global knowledge base https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache- solr.html
  • 24. Cloudera Search Other Use Cases Instagram: Instagram (a Facebook company) is one of the famous sites, and it uses Solr to power its geosearch API WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and Solr Netflix: Solr powers basic movie searching on this extremely busy site StubHub.com: This ticket reseller uses Solr to help visitors search for concerts and sporting events. https://www.safaribooksonline.com/library/view/scaling-apache- solr/9781783981748/ch01s05.html
  • 25. How it works ... ?
  • 26. How it works … ? Data Source – documents … Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog
  • 27. How it works … ? Data Source – documents … space of unique terms Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog 1 2 3 4 1 2 3 5 6 2 3 4 7 2 3 4 List of unique words: 1. John 2. has 3. a 4. cat 5. dog 6. Eva 7. George
  • 28. How it works … ? Data Source – Documents … boolean search with inverted index Term Tot. freq. John 2 has 4 a 4 cat 2 dog 2 Eva 1 George 1 Doc # 1 2 1 2 3 4 1 2 3 4 1 3 2 4 3 4 Dictionary Documents
  • 29. How it works … ? Data Source – documents as vectors Documents document 1 John has a cat document 2 John has a dog document 3 Eva has a cat document 4 George has a dog Space of unique terms -> John has a cat dog Eva George vector representing doc1 -> 1 1 1 1 0 0 0 vector representing doc2 -> 1 1 1 0 1 0 0 vector representing doc3 -> 0 1 1 1 0 1 0 vector representing doc4 -> 0 1 1 0 1 0 1
  • 30. How it works … ? Data Source – Documents … vectors
  • 32. Thank you Data Wizards E-mail: artur.grzadziel@gmail.com Links: • Cloudera Search: http://www.cloudera.com/content/www/en-us/documentation/archive/search/1- 3-0/Cloudera-Search-User-Guide/csug_introducing.html • Tika https://tika.apache.org/ • Apache Solr http://lucene.apache.org/solr/ https://www.cloudera.com/content/www/en-us/products/apache- hadoop/apache-solr.html • Vectors, Inversed Index, Frequency Matrix, etc. ... http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm