SlideShare a Scribd company logo
1 of 29
Download to read offline
Adventures in Discoverability with C* and Solr

Patricia Gorla, Systems Engineer
@patriciagorla
@o19s

#CASSANDRAEU

CASSANDRASUMMITEU
About Me
• Solr
• Cassandra
• Information retrieval

Paul Hostetler - phostetler.com

#CASSANDRAEU

CASSANDRASUMMITEU
How Do I Find What I’m Looking For?
Simple

Complex

Aristotle’s birthplace?

All ancient Greek philosophers?

Coordinates of Stagira?

All cities within 100km of Stagira

#CASSANDRAEU

CASSANDRASUMMITEU
How Do I Find What I’m Looking For?
Simple

Complex

Aristotle’s birthplace?
select birthPlace
where name = “Aristotle”;

All ancient Greek philosophers?

Coordinates of Stagira?
select coord
where name = “Stagira”;

All cities within 100km of Stagira

#CASSANDRAEU

CASSANDRASUMMITEU
How Do I Find What I’m Looking For?
Simple
Aristotle’s birthplace?
select birthPlace
where name = “Aristotle”;

Complex
All ancient Greek philosophers?
create index on tag;
select *
where tag = “Greek philosophy”;

Coordinates of Stagira?
select coord
where name = “Stagira”;

#CASSANDRAEU

All cities within 100km of Stagira
???

CASSANDRASUMMITEU
How Do I Find What I’m Looking For?
Simple
Aristotle’s birthplace?
q=Aristotle&fl=birthPlace

All ancient Greek philosophers?
q=Greek philosophy

Coordinates of Stagira?
q=Stagira&fl=point

All cities within 100km of Stagira
q=*:*&fq={!geofilt pt=40.530,
23.752 sfield=point d=100}

#CASSANDRAEU

CASSANDRASUMMITEU
Approaches to Search
• Google Site Search
• MySQL ‘like’ statements

Seth Casteel - littlefriendsphoto.com

#CASSANDRAEU

CASSANDRASUMMITEU
Approaches to Search

• Full-text search
• Ranking (Scoring)
• Tokenization
• Stemming
• Faceting

#CASSANDRAEU

CASSANDRASUMMITEU
Approaches to Search

• Full-text search
• Ranking (Scoring)
• Tokenization
• Stemming
• Faceting

#CASSANDRAEU

and many more!

CASSANDRASUMMITEU
Inverted Index
[1] Pleasure in the job puts perfection in the work.
[2] Education is the best provision for the journey to old age.
[3] If some animals are good at hunting and others are suitable
for hunting, then the gods must clearly smile on hunting.
[4] It is the mark of an educated mind to be able to entertain a
thought without absorbing it.

Term

Freq

Documents

education

[2] [4]

hunting

3

[3]

perfection
#CASSANDRAEU

2
1

[1]
CASSANDRASUMMITEU
Defining Field Types
<fieldType name="text_general" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory” ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
#CASSANDRAEU

CASSANDRASUMMITEU
Index-side Analysis

Punctuation
Stop Words
Lowercase
Stemming

#CASSANDRAEU

Hope is a waking dream.
Hope is a waking dream
Hope
waking dream
hope
waking dream
hope
wake dream

CASSANDRASUMMITEU
Query-side Analysis

Punctuation
Stop Words
Lowercase
Stemming
Synonyms
#CASSANDRAEU

Hope is a waking dream.
Hope is a waking dream
Hope
waking dream
hope
waking dream
hope
wake dream
hope
wake dream
desire awake wish
CASSANDRASUMMITEU
Faceting
facet_fields: {
tags: [hunting: 1, education: 2, work: 2],
locations: [Stagira: 5, Chalcis: 3]
}

#CASSANDRAEU

CASSANDRASUMMITEU
Approaches to Distribution
• Pay Google more $$
• MySQL ‘like’ shards
• Master/Slave replication
• SolrCloud

#CASSANDRAEU

CASSANDRASUMMITEU
“Distributed search is hard.”

#CASSANDRAEU

CASSANDRASUMMITEU
Solr + Cassandra: Datastax Enterprise
• Full-text search
• Tokenization
• Stemming
• Date ranges
• Aggregation
• High Availability
• Distributed Nature

#CASSANDRAEU

CASSANDRASUMMITEU
Examining DBpedia.org Datasets
Person
<http://xmlns.com/foaf/0.1/name> "Aristotle"@en .
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> Person .
<http://purl.org/dc/elements/1.1/description> "Greek philosopher"@en .
<http://dbpedia.org/ontology/birthPlace> Stagira
<http://dbpedia.org/ontology/deathPlace> Chalcis .

Place
<http://dbpedia.org/resource/Stagira> <http://www.opengis.net/gml/_Feature> .
<http://dbpedia.org/resource/Stagira#lat> "40.5916667" .
<http://dbpedia.org/resource/Stagira#long> "23.7947222" .
<http://dbpedia.org/resource/Stagira#point> "40.591667 23.7947222"@en .

#CASSANDRAEU

CASSANDRASUMMITEU
“Love is a single soul
inhabiting two bodies.”
#CASSANDRAEU

CASSANDRASUMMITEU
Querying for data
curl http://localhost:8983/solr/solr.person/q=Aristotle

#CASSANDRAEU

CASSANDRASUMMITEU
Filtering by location
curl http://localhost:8983/solr/solr.location/select?q=*:*
&spatial=true&fq={!geofilt pt=40.53027,23.7525 sfield=point
d=100}

#CASSANDRAEU

CASSANDRASUMMITEU
“There is no great genius without a
mixture of madness.”
#CASSANDRAEU

CASSANDRASUMMITEU
Schema.xml
Unified Schema
<fields>
<field name="id" type="string" />
<field name="name" type="text" />
<dynamicField name="*Date" type="date" />
<dynamicField name="*Place" type="text" />
<dynamicField name="*Point" type="location" />
<dynamicField name="*_tag"

type="text" />

</fields>

#CASSANDRAEU

CASSANDRASUMMITEU
Upload to DSE, Create Core
curl http://localhost:8983/solr/resource/solr.location/schema.xml 
--data-binary @solr/location_schema.xml 
-H 'Content-type:text/xml; charset=utf-8 '

curl http://localhost:8983/solr/resource/solr.location/solrconfig.xml 
--data-binary @solr/location_solrconfig.xml 
-H 'Content-type:text/xml; charset=utf-8 '
curl http://localhost:8983/solr/admin/cores?action=CREATE&name=solr.location

#CASSANDRAEU

CASSANDRASUMMITEU
Cassandra Schema
Location Schema

gc_grace_seconds=864000 AND

cqlsh:solr> DESC COLUMNFAMILY location;

read_repair_chance=0.100000 AND

CREATE TABLE location (

replicate_on_write='true' AND

id text PRIMARY KEY,
"_docBoost" text,
"_dynFld" text,
location text,

populate_io_cache_on_flush='false' AND
compaction={'class':
'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression':
'SnappyCompressor'};

name text,
solr_query text,
tags text
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND

CREATE INDEX
solr_location__docBoost_index ON location
("_docBoost");
...
CREATE INDEX
solr_location_solr_query_index ON
location (solr_query);

dclocal_read_repair_chance=0.000000 AND
#CASSANDRAEU

CASSANDRASUMMITEU
What Changes
Solr

Cassandra

• No multiValued fields

• No composite columns

• No JOIN*

• No counter columns

#CASSANDRAEU

CASSANDRASUMMITEU
Bringing it All Together
•

Fault tolerant, available
search

#CASSANDRAEU

CASSANDRASUMMITEU
“Thank you.”

#CASSANDRAEU

CASSANDRASUMMITEU
THANK YOU

pgorla@opensourceconnections.com
@patriciagorla
@o19s
All information, including slides, are on http://github.com/pgorla/million-books
#CASSANDRAEU

CASSANDRASUMMITEU

More Related Content

Similar to Adventures in Discovery with Cassandra and Solr

A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)BTI360
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseAlexandre Rafalovitch
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자Donghyeok Kang
 
6 things about perl 6
6 things about perl 66 things about perl 6
6 things about perl 6brian d foy
 
6 more things about Perl 6
6 more things about Perl 66 more things about Perl 6
6 more things about Perl 6brian d foy
 
Introdução a buscas com solr
Introdução a buscas com solrIntrodução a buscas com solr
Introdução a buscas com solrClaudson Oliveira
 
Dev in Santos - Como NÃO fazer pesquisas usando LIKE
Dev in Santos - Como NÃO fazer pesquisas usando LIKEDev in Santos - Como NÃO fazer pesquisas usando LIKE
Dev in Santos - Como NÃO fazer pesquisas usando LIKEFabio Akita
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UNLucidworks
 

Similar to Adventures in Discovery with Cassandra and Solr (10)

A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)A noobs lesson on solr (configuration)
A noobs lesson on solr (configuration)
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
Solr Anti Patterns
Solr Anti PatternsSolr Anti Patterns
Solr Anti Patterns
 
6 things about perl 6
6 things about perl 66 things about perl 6
6 things about perl 6
 
6 more things about Perl 6
6 more things about Perl 66 more things about Perl 6
6 more things about Perl 6
 
Introdução a buscas com solr
Introdução a buscas com solrIntrodução a buscas com solr
Introdução a buscas com solr
 
Dev in Santos - Como NÃO fazer pesquisas usando LIKE
Dev in Santos - Como NÃO fazer pesquisas usando LIKEDev in Santos - Como NÃO fazer pesquisas usando LIKE
Dev in Santos - Como NÃO fazer pesquisas usando LIKE
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
 
Fuzzing - A Tale of Two Cultures
Fuzzing - A Tale of Two CulturesFuzzing - A Tale of Two Cultures
Fuzzing - A Tale of Two Cultures
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Adventures in Discovery with Cassandra and Solr

  • 1. Adventures in Discoverability with C* and Solr Patricia Gorla, Systems Engineer @patriciagorla @o19s #CASSANDRAEU CASSANDRASUMMITEU
  • 2. About Me • Solr • Cassandra • Information retrieval Paul Hostetler - phostetler.com #CASSANDRAEU CASSANDRASUMMITEU
  • 3. How Do I Find What I’m Looking For? Simple Complex Aristotle’s birthplace? All ancient Greek philosophers? Coordinates of Stagira? All cities within 100km of Stagira #CASSANDRAEU CASSANDRASUMMITEU
  • 4. How Do I Find What I’m Looking For? Simple Complex Aristotle’s birthplace? select birthPlace where name = “Aristotle”; All ancient Greek philosophers? Coordinates of Stagira? select coord where name = “Stagira”; All cities within 100km of Stagira #CASSANDRAEU CASSANDRASUMMITEU
  • 5. How Do I Find What I’m Looking For? Simple Aristotle’s birthplace? select birthPlace where name = “Aristotle”; Complex All ancient Greek philosophers? create index on tag; select * where tag = “Greek philosophy”; Coordinates of Stagira? select coord where name = “Stagira”; #CASSANDRAEU All cities within 100km of Stagira ??? CASSANDRASUMMITEU
  • 6. How Do I Find What I’m Looking For? Simple Aristotle’s birthplace? q=Aristotle&fl=birthPlace All ancient Greek philosophers? q=Greek philosophy Coordinates of Stagira? q=Stagira&fl=point All cities within 100km of Stagira q=*:*&fq={!geofilt pt=40.530, 23.752 sfield=point d=100} #CASSANDRAEU CASSANDRASUMMITEU
  • 7. Approaches to Search • Google Site Search • MySQL ‘like’ statements Seth Casteel - littlefriendsphoto.com #CASSANDRAEU CASSANDRASUMMITEU
  • 8. Approaches to Search • Full-text search • Ranking (Scoring) • Tokenization • Stemming • Faceting #CASSANDRAEU CASSANDRASUMMITEU
  • 9. Approaches to Search • Full-text search • Ranking (Scoring) • Tokenization • Stemming • Faceting #CASSANDRAEU and many more! CASSANDRASUMMITEU
  • 10. Inverted Index [1] Pleasure in the job puts perfection in the work. [2] Education is the best provision for the journey to old age. [3] If some animals are good at hunting and others are suitable for hunting, then the gods must clearly smile on hunting. [4] It is the mark of an educated mind to be able to entertain a thought without absorbing it. Term Freq Documents education [2] [4] hunting 3 [3] perfection #CASSANDRAEU 2 1 [1] CASSANDRASUMMITEU
  • 11. Defining Field Types <fieldType name="text_general" class="solr.TextField" > <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/> <filter class="solr.StopFilterFactory” ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> #CASSANDRAEU CASSANDRASUMMITEU
  • 12. Index-side Analysis Punctuation Stop Words Lowercase Stemming #CASSANDRAEU Hope is a waking dream. Hope is a waking dream Hope waking dream hope waking dream hope wake dream CASSANDRASUMMITEU
  • 13. Query-side Analysis Punctuation Stop Words Lowercase Stemming Synonyms #CASSANDRAEU Hope is a waking dream. Hope is a waking dream Hope waking dream hope waking dream hope wake dream hope wake dream desire awake wish CASSANDRASUMMITEU
  • 14. Faceting facet_fields: { tags: [hunting: 1, education: 2, work: 2], locations: [Stagira: 5, Chalcis: 3] } #CASSANDRAEU CASSANDRASUMMITEU
  • 15. Approaches to Distribution • Pay Google more $$ • MySQL ‘like’ shards • Master/Slave replication • SolrCloud #CASSANDRAEU CASSANDRASUMMITEU
  • 16. “Distributed search is hard.” #CASSANDRAEU CASSANDRASUMMITEU
  • 17. Solr + Cassandra: Datastax Enterprise • Full-text search • Tokenization • Stemming • Date ranges • Aggregation • High Availability • Distributed Nature #CASSANDRAEU CASSANDRASUMMITEU
  • 18. Examining DBpedia.org Datasets Person <http://xmlns.com/foaf/0.1/name> "Aristotle"@en . <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> Person . <http://purl.org/dc/elements/1.1/description> "Greek philosopher"@en . <http://dbpedia.org/ontology/birthPlace> Stagira <http://dbpedia.org/ontology/deathPlace> Chalcis . Place <http://dbpedia.org/resource/Stagira> <http://www.opengis.net/gml/_Feature> . <http://dbpedia.org/resource/Stagira#lat> "40.5916667" . <http://dbpedia.org/resource/Stagira#long> "23.7947222" . <http://dbpedia.org/resource/Stagira#point> "40.591667 23.7947222"@en . #CASSANDRAEU CASSANDRASUMMITEU
  • 19. “Love is a single soul inhabiting two bodies.” #CASSANDRAEU CASSANDRASUMMITEU
  • 20. Querying for data curl http://localhost:8983/solr/solr.person/q=Aristotle #CASSANDRAEU CASSANDRASUMMITEU
  • 21. Filtering by location curl http://localhost:8983/solr/solr.location/select?q=*:* &spatial=true&fq={!geofilt pt=40.53027,23.7525 sfield=point d=100} #CASSANDRAEU CASSANDRASUMMITEU
  • 22. “There is no great genius without a mixture of madness.” #CASSANDRAEU CASSANDRASUMMITEU
  • 23. Schema.xml Unified Schema <fields> <field name="id" type="string" /> <field name="name" type="text" /> <dynamicField name="*Date" type="date" /> <dynamicField name="*Place" type="text" /> <dynamicField name="*Point" type="location" /> <dynamicField name="*_tag" type="text" /> </fields> #CASSANDRAEU CASSANDRASUMMITEU
  • 24. Upload to DSE, Create Core curl http://localhost:8983/solr/resource/solr.location/schema.xml --data-binary @solr/location_schema.xml -H 'Content-type:text/xml; charset=utf-8 ' curl http://localhost:8983/solr/resource/solr.location/solrconfig.xml --data-binary @solr/location_solrconfig.xml -H 'Content-type:text/xml; charset=utf-8 ' curl http://localhost:8983/solr/admin/cores?action=CREATE&name=solr.location #CASSANDRAEU CASSANDRASUMMITEU
  • 25. Cassandra Schema Location Schema gc_grace_seconds=864000 AND cqlsh:solr> DESC COLUMNFAMILY location; read_repair_chance=0.100000 AND CREATE TABLE location ( replicate_on_write='true' AND id text PRIMARY KEY, "_docBoost" text, "_dynFld" text, location text, populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; name text, solr_query text, tags text ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND comment='' AND CREATE INDEX solr_location__docBoost_index ON location ("_docBoost"); ... CREATE INDEX solr_location_solr_query_index ON location (solr_query); dclocal_read_repair_chance=0.000000 AND #CASSANDRAEU CASSANDRASUMMITEU
  • 26. What Changes Solr Cassandra • No multiValued fields • No composite columns • No JOIN* • No counter columns #CASSANDRAEU CASSANDRASUMMITEU
  • 27. Bringing it All Together • Fault tolerant, available search #CASSANDRAEU CASSANDRASUMMITEU
  • 29. THANK YOU pgorla@opensourceconnections.com @patriciagorla @o19s All information, including slides, are on http://github.com/pgorla/million-books #CASSANDRAEU CASSANDRASUMMITEU