SlideShare a Scribd company logo
NoSQL: Apache SOLR

                                                Apache Hadoop
                       By Dmitry Kan for NerdCamp, April 23 2011
dmitry.kan@gmail.com
Dilbert: expert in NoSQL
•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL
movement "departs from the relational model altogether; it should
therefore have been called more appropriately 'NoREL', or something to
that effect.“ (wikipedia)
•NoSQL = Not Only SQL
•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google


•Data storage: billion gigabytes (GB) of data
•Interconnected data: hyperlinks, blog pingbacks, social networks
•Complex Data structure: hierarchical nested data structures easily
(multiple relational tables in SQL)
•Performance: the more data in SQL, the likely it to degrade


•NoSQL is not:
    •… SQL and not relational
    •… replacement for SQL, but compliment
    •... There is no fixed schema and no joins
    •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-
    out” (spreading the load over many commodity systems) – horizontal
    scaling
NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms
•Column Family Stores: keys point to multiple columns (Google’s BigTable)
•Document Databases: documents are collections of other key-value
collections
•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players
•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage
service)
•Cassandra: open-sourced by Facebook, column oriented NoSQL DB
•BigTable: Google’s proprietary column oriented DB (App Engine)
•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)
•Neo4j: OS graph DB

Querying NoSQL DB:
•Data model specific
•RESTful interfaces or query APIs
•SPARQL: declarative query specification for graph DBs
Simple Protocol And RDFQuery Language
(courtesy of about.com and IBM)
Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}




  stats!
Some stats from (Information Week) via
about.com (2010):
•44% biz IT professionals haven’t heard of NoSQL
•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):
•10% heard and used the NoSQL
•Much more people know about cloud, which can
become more and more a driving platform behind
NoSQL


Does the world of NoSQL have enough mass to
appeal to IT now?
“Solr is the popular, blazing
                                                fast open source enterprise
                                                search platform from the
                                                Apache Lucene project.”

                                                Created by Yonik Seeley at
                                                CNET

                                                Features:
                                                •Full-text search
                                                •Hit highlighting
http://lucene.apache.org/solr/                  •Faceted search (Dynamic
http://lucene.apache.org/solr/tutorial.html     clustering)
http://lucene.apache.org/java/docs/index.html   •DB integration
                                                •Rich doc handling
Books                                           •Geospatial search
                                                •Distributed search
                                                •Replicataion
                                                •REST-like HTTP/XML & JSON
                                                APIS
drupal



Companies using SOLR
Curent version: Apache Solr 3.1 (March 31, 2011)   Operating system support
 License: ASL 2.0                                   All with a Java VM, including:
 Features:                                          Linux (all versions)
 •Faceted navigation                                Windows (all versions)
 •Hit highlighting                                  MacOS (all versions)
 •GEO search: filter and sort by distance           Unix variants
 •Spellcheck and auto suggest                       App-server support
 •Advanced ranking and sorting                      Apache Tomcat, Jetty, Resin,
 •Distributed and replicated search                 WebLogic™, WebSphere™,
 •Structured / unstructured search                  GlassFish, dmServer™, JBoss™
 •Rich plugin architecture, extensible              and many more
                                                    Java version requirement
                                                    Java JDK 1.5 or later
                                                    Client API support
                                                    Java, .NET, PHP, Python, Ruby
                                                    (on
                                                    Rails), C++, XML/HTTP,
Overview of current state                           JSON/HTTP ++


April 2011
Faceted search
•A technique for refining search results
•Concept composition:
    • Article + in English + about nerdcamp
    • Finnish rap + < 1 minute + released in 2001


•Types:
    • Standard facets (list of facets with values)
    • Hierarchical facet values (taxonomy of facet
      values)
    • Range / query facets: by date, by price, by
      alphabet, by interval
Spatial Search

Combines location data with text data
•Represent spatial data in the index
•Filter by some spatial concept such as a bounding box or other shape
•Sort by distance
•Score/boost by distance

•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store -->
<field name="store">40.7143,-74.006</field> <!-- NYC store -->
<field name="store">37.7752,-122.4232</field> <!-- San Francisco store --
>

•bbox: bounding box filter (bbox is a range of lats and lons that
encompasses the circle of radius d)
•geodist: the distance function
Hit highlighting

Example from solr admin
Spellcheck and autosuggest

Spellcheck:
•Query suggestion for a missspelled query term
http://localhost:8983/solr/spell?q=hell
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru
e
<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int
name="numFound">1</int> <int name="startOffset">0</int> <int
name="endOffset">4</int> <arr name="suggestion"> <str>dell</str>
</arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int>
<int name="startOffset">5</int> <int name="endOffset">14</int> <arr
name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str
name="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:
Example with solr and jquery
Advanced sorting, ranking and searching

•sort=score+asc
•sort=Author+desc,score+desc
•boosting single documents

•Term Frequency—tf
•Inverse Document Frequency – idf
•Co-ordination Factor – coord (the greater the # of queried terms match,
the greater the score)
•Field Length – fieldNorm (the shorter the matching field is in number of
indexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search
•Smashing~0.7 yields more results than just Smashing
Distributed and replicated search




Before doing this:
•Consider vertical scaling (faster and better machine)
•Rethink the data model (what data goes to which solr index)
•Remove logging on updates (and / or searches)
•Redesign you index: make as many fields non-indexed and non-stored (use cases)
•Check your Internet connection
Extendability
Plugins:
•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
                  SolrParams params, SolrQueryRequest req) {}

}
SOLR I/O
•Nutch (crawler)
•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document
import, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format
SOLR Processing Pipeline
•On each step, a document gets transformed
•Stop words removal
•Stemming
•(smart) Tokenization
•Ngrams (letter level and word level)
•Regular expressions
•Low casing
•Reversed wildcard
•Duplicate removal
Solr on the cloud
Hadoop: MapReduce
ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo
Batch indexing, no realtime search yet




 Hadoop vital components: Core and API

 MapReduce -- computation model
 HDFS
 I/O
 ZooKeeper
 Pig (adds level of abstraction for processing
 large datasets)
Solr on the cloud
Does it shine? Yes, but not fully
References
[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide
Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI
[2] "Dynamo: Amazon’s Highly Available Key-value Store".
http://www.cs.uwaterloo.ca/:
WATERLOO. p. 2/22. Retrieved 2011-04-05.
"Dynamo: a highly available and scalable distributed data store"
[3] http://cassandra.apache.org/
[4] http://labs.google.com/papers/bigtable.html
[5] http://aws.amazon.com/ (look for SimpleDB)
[6] http://couchdb.apache.org/
[7] http://neo4j.org/
[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL
http://bit.ly/go5ios
[9] http://drupal.org/
[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination
[11] http://wiki.apache.org/solr/SpatialSearch
[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
References
[14] Using Nutch with SOLR,
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[15] http://tika.apache.org/
[16] http://lucene.apache.org/solr/

More Related Content

What's hot

SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Lucidworks
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Erik Hatcher
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
Jonathan-Andornot
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
David Smiley
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesNot Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Brett Meyer
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Alexandre Rafalovitch
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Building a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF PublishingBuilding a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF Publishing
dr0i
 
eZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipeseZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipes
Paul Borgermans
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDuraSpace
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache SolrSolr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache Solr
Erik Hatcher
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Lucidworks
 
Solr 4
Solr 4Solr 4
Solr 4
Erik Hatcher
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
Atlogys Technical Consulting
 

What's hot (20)

SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesNot Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Building a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF PublishingBuilding a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF Publishing
 
eZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipeseZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipes
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache SolrSolr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache Solr
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Solr 4
Solr 4Solr 4
Solr 4
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 

Viewers also liked

Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)
Thibaud Vibes
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryDmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
Dmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
Dmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
Dmitry Kan
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
Dmitry Kan
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
Dmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
Dmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
Gang Tao
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
Dmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
Dmitry Kan
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
Dmitry Kan
 

Viewers also liked (20)

Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational Dictionary
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
 

Similar to NoSQL, Apache SOLR and Apache Hadoop

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
Alkuvoima
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
Stefan Schmidt
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...OpenBlend society
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
Jake Mannix
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
Tony Tam
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
Antonio Peric-Mazar
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
Niko Neugebauer
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
MapR Technologies
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
hdhappy001
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 

Similar to NoSQL, Apache SOLR and Apache Hadoop (20)

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Solr
SolrSolr
Solr
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 

More from Dmitry Kan

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
Dmitry Kan
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Dmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
Dmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
Dmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
Dmitry Kan
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine Translation
Dmitry Kan
 

More from Dmitry Kan (7)

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine Translation
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

NoSQL, Apache SOLR and Apache Hadoop

  • 1. NoSQL: Apache SOLR Apache Hadoop By Dmitry Kan for NerdCamp, April 23 2011 dmitry.kan@gmail.com
  • 3. •The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ (wikipedia) •NoSQL = Not Only SQL •Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google •Data storage: billion gigabytes (GB) of data •Interconnected data: hyperlinks, blog pingbacks, social networks •Complex Data structure: hierarchical nested data structures easily (multiple relational tables in SQL) •Performance: the more data in SQL, the likely it to degrade •NoSQL is not: •… SQL and not relational •… replacement for SQL, but compliment •... There is no fixed schema and no joins •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales- out” (spreading the load over many commodity systems) – horizontal scaling
  • 4. NoSQL Categories •Key-value Stores: bigh hashtable with caching mechanisms •Column Family Stores: keys point to multiple columns (Google’s BigTable) •Document Databases: documents are collections of other key-value collections •Graph Databases: nodes, relationships between nodes and nodes props Major NoSQL players •Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage service) •Cassandra: open-sourced by Facebook, column oriented NoSQL DB •BigTable: Google’s proprietary column oriented DB (App Engine) •CouchDB: OS document oriented NoSQL DB (as well as MongoDB) •Neo4j: OS graph DB Querying NoSQL DB: •Data model specific •RESTful interfaces or query APIs •SPARQL: declarative query specification for graph DBs
  • 5. Simple Protocol And RDFQuery Language (courtesy of about.com and IBM) Example of retrieving the URL of a blogger PREFIX foaf <http://xmlns.com/foaf/0.1/> SELECT ?url FROM <bloggers.rdf> WHERE { ?contributor foaf:name "Jon Foobar" . ?contributor foaf:weblog ?url . } stats!
  • 6. Some stats from (Information Week) via about.com (2010): •44% biz IT professionals haven’t heard of NoSQL •1%: NoSQL is strategic direction •Some stats from NerdCamp (April 2011): •10% heard and used the NoSQL •Much more people know about cloud, which can become more and more a driving platform behind NoSQL Does the world of NoSQL have enough mass to appeal to IT now?
  • 7. “Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.” Created by Yonik Seeley at CNET Features: •Full-text search •Hit highlighting http://lucene.apache.org/solr/ •Faceted search (Dynamic http://lucene.apache.org/solr/tutorial.html clustering) http://lucene.apache.org/java/docs/index.html •DB integration •Rich doc handling Books •Geospatial search •Distributed search •Replicataion •REST-like HTTP/XML & JSON APIS
  • 9.
  • 10. Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support License: ASL 2.0 All with a Java VM, including: Features: Linux (all versions) •Faceted navigation Windows (all versions) •Hit highlighting MacOS (all versions) •GEO search: filter and sort by distance Unix variants •Spellcheck and auto suggest App-server support •Advanced ranking and sorting Apache Tomcat, Jetty, Resin, •Distributed and replicated search WebLogic™, WebSphere™, •Structured / unstructured search GlassFish, dmServer™, JBoss™ •Rich plugin architecture, extensible and many more Java version requirement Java JDK 1.5 or later Client API support Java, .NET, PHP, Python, Ruby (on Rails), C++, XML/HTTP, Overview of current state JSON/HTTP ++ April 2011
  • 11. Faceted search •A technique for refining search results •Concept composition: • Article + in English + about nerdcamp • Finnish rap + < 1 minute + released in 2001 •Types: • Standard facets (list of facets with values) • Hierarchical facet values (taxonomy of facet values) • Range / query facets: by date, by price, by alphabet, by interval
  • 12. Spatial Search Combines location data with text data •Represent spatial data in the index •Filter by some spatial concept such as a bounding box or other shape •Sort by distance •Score/boost by distance •<field name="store">45.17614,-93.87341</field> <!-- Buffalo store --> <field name="store">40.7143,-74.006</field> <!-- NYC store --> <field name="store">37.7752,-122.4232</field> <!-- San Francisco store -- > •bbox: bounding box filter (bbox is a range of lats and lons that encompasses the circle of radius d) •geodist: the distance function
  • 14. Spellcheck and autosuggest Spellcheck: •Query suggestion for a missspelled query term http://localhost:8983/solr/spell?q=hell ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru e <lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">4</int> <arr name="suggestion"> <str>dell</str> </arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int> <int name="startOffset">5</int> <int name="endOffset">14</int> <arr name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str name="collation">dell ultrasharp</str> </lst> </lst> Autosuggest: Example with solr and jquery
  • 15. Advanced sorting, ranking and searching •sort=score+asc •sort=Author+desc,score+desc •boosting single documents •Term Frequency—tf •Inverse Document Frequency – idf •Co-ordination Factor – coord (the greater the # of queried terms match, the greater the score) •Field Length – fieldNorm (the shorter the matching field is in number of indexed terms, the greater the document’s score) •AND, OR, NOT, NEAR, fuzzy search •Smashing~0.7 yields more results than just Smashing
  • 16. Distributed and replicated search Before doing this: •Consider vertical scaling (faster and better machine) •Rethink the data model (what data goes to which solr index) •Remove logging on updates (and / or searches) •Redesign you index: make as many fields non-indexed and non-stored (use cases) •Check your Internet connection
  • 17. Extendability Plugins: •Query parser: extend LuceneQParserPlugin public class NerdCampQParserPlugin extends LuceneQParserPlugin { public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {} }
  • 18. SOLR I/O •Nutch (crawler) •CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document import, like pdf), your format •Output: xml, json, python, javabin, csv… , your format
  • 19. SOLR Processing Pipeline •On each step, a document gets transformed •Stop words removal •Stemming •(smart) Tokenization •Ngrams (letter level and word level) •Regular expressions •Low casing •Reversed wildcard •Duplicate removal
  • 20. Solr on the cloud Hadoop: MapReduce ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo Batch indexing, no realtime search yet Hadoop vital components: Core and API MapReduce -- computation model HDFS I/O ZooKeeper Pig (adds level of abstraction for processing large datasets)
  • 21. Solr on the cloud Does it shine? Yes, but not fully
  • 22. References [1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI [2] "Dynamo: Amazon’s Highly Available Key-value Store". http://www.cs.uwaterloo.ca/: WATERLOO. p. 2/22. Retrieved 2011-04-05. "Dynamo: a highly available and scalable distributed data store" [3] http://cassandra.apache.org/ [4] http://labs.google.com/papers/bigtable.html [5] http://aws.amazon.com/ (look for SimpleDB) [6] http://couchdb.apache.org/ [7] http://neo4j.org/ [8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL http://bit.ly/go5ios [9] http://drupal.org/ [10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination [11] http://wiki.apache.org/solr/SpatialSearch [12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html [13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
  • 23. References [14] Using Nutch with SOLR, http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [15] http://tika.apache.org/ [16] http://lucene.apache.org/solr/