SlideShare a Scribd company logo
1 of 19
Grant Ingersoll, CTO Lucidworks April 15, 2015
Hadoop-scale Search
with Solr
Viva La Evolución
10M+total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.
Solr in a Nutshell
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations
and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
What’s old is new again!
• Build/Store indexes in HDFS
• https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
• Block cache is your friend
Deployment and Security support
• https://github.com/LucidWorks/yarn-proto
• Slider and Ambari support coming soon
• Authz, Authc and Doc Filtering coming in April/May
Hadoop Basics
Lucidworks + Hadoop
• Ingestion tools for various file formats, etc.
• Hive 2-way Load/Store support
• Pig Load/Store
• http://lucidworks.com/product/integrations/hadoop/
• (More on Spark in a bit)
Case 1: Compliance
• Monitoring and customer service search for large volume transactional data
• Initial Setup:
• 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine
• Indexing from Kafka to Solr (Lucidworks Fusion)
• 14B+ docs indexed/searchable in POC (disk limited)
• Growth to 4B+ per day w/ 6 month life expectancy
Case 2: Web Analytics
• Large scale ad-hoc analytics over weblogs using Tableau as a
front end BI tool for Solr
• Initial setup:
• 4 machines, 128 GB of RAM, several Solr nodes per machine
• Data originally in Hive
• POC: 10s of Billions of events growing to 150B+ per week
current_log_writer collection alias rolls
over to a new transient collection every
two hours; the shards in the transient
collection are merged into the 2-hour
shard and added to the daily collection
Connector writes to the collection alias,
up to 50K docs / sec
Latest 2-hour shard gets built from
merging shards at time bucket boundary
Multiple shards needed
to support 50K writes per second
Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
Sample ArchitectureFusion Logstash
connector
current_log_writer
(Collection Alias)
logs_feb26_h24

(Transient Collection)
Shard
1
Shard
2
Shard
4
Shard
3
logs_feb01

(daily collection)
logs_feb25

(daily collection)
logs_feb26

(daily collection)
h02

Shard
h24

Shard
h22

Shard
Every daily collection has 12 (or 24) shards, each covering 2-
hour blocks of log messages
h02

Shard
h24

Shard
h22

Shard
h02

Shard
h24

Shard
h22

Shard Can add replicas
to support higher
query volume &
fault-tolerance
Sample Query Execution
recent_logs
(collection alias)
logs_feb01

(daily collection)
logs_feb25

(daily collection)
todays_logs
(collection alias)
Fusion SiLK
Dashboard
todays_logs
collection alias rolls
over to a new day
automatically at
day boundary
logs_feb26

(daily collection)
Case 3: Lots of Users, Lots of Data
• Search of consumer data storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw

and

http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
• Search of consumer cloud storage
• Key challenges: not all users are equals. Users grow and change all the
time
• Petabytes of data, millions of users, 1000’s of nodes,
• 1000’s of collections while isolating access
• Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw

and

http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large-
number-of-collections-shalin-shekhar-mangar
• Improve Zookeeper interactions
and performance to handle
thousands of collections
• Deep paging
• Split shards on arbitrary hash
ranges
• Large scale testing
• Collection migration
Case 3: Key Solr Improvements
https://github.com/LucidWorks/solr-scale-tk
Testing: Solr Scale Toolkit
Python w/ Fabric, Boto Etc.

Test Automation Scripts
Kafka

MQ / Data Integration
Logstash
Log Agg / Analysis
CollectD/SiLK
System / JMX Monitoring
Test
Results
DB
Support Services
Client Node 1
JMeter / Client Nodes
Client Node 2
Zookeeper
Test Data
Stored in
Amazon S3
Node 1: Custom AMI
Solr Cluster (NxM Nodes)
Solr Node 1 8983 Core Core
Solr Node M 898X Core Core
SolrCloud Traffic Between All
Solr Nodes and ZK
Key Point: Each test will define the
density of cores per node and
number of Solr nodes per machine,
as well as the instance type and
number of machines
ZK Ensemble
ZooKeeper-1
ZooKeeper-2
ZooKeeper-3
System monitoring

of N Machines
JMX Notifications
Logs aggregated from
NxM Solr Nodes
ZK JMX Stats
Easily deploy clusters of
nodes using our custom
AMI’s
Indexing & query
requests from tests
• Power user search and recommendations over news content and
engagement signals (shares, views, etc.) using Lucidworks Fusion
• Combines content and collaborative filtering approaches to
calculate search boosts and “people who did X also did Y” results
• Data: ~10M events (POC) growing to 3-4B per month
Case 4: Signals for Search and Discovery
• Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a
scheduled service for large scale aggregations, machine
learning and more
• We’ve already seen 3x speedup in some tests
• Will ship w/ ALS rec algo and Mahout algos
• Solr as Spark RRD
• https://github.com/LucidWorks/spark-solr
Next Level Signals
Billions of Docs
Optional
REST
Security woven
throughout
Proxy
Recs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HDFS
Shared Config
Mgmt
Leader
Election
Load
Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users
• Native, pluggable Security in Solr (April/May)
• Numerous performance enhancements for replication in
shards
• Cross core ValueSources
• Many new extensions for facets and analytics
• Percentiles (t-digest)
• Facet combinations
• Dynamic expressions over result sets
Roadmap
Next steps
Download Fusion: http://www.lucidworks.com/products/fusion
Contact Lucidworks: http://lucidworks.com/company/contact/
Contact Me: grant@lucidworks.com @gsingers
Hadoop-scale Search with Solr

More Related Content

What's hot

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Lucidworks
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Electionravikgiitk
 
Scaling Elasticsearch at Synthesio
Scaling Elasticsearch at SynthesioScaling Elasticsearch at Synthesio
Scaling Elasticsearch at SynthesioFred de Villamil
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Lucidworks
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to YouAmazon Web Services
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Lucidworks
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Lucidworks
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud ComputingBryan Bende
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 

What's hot (20)

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
 
Scaling Elasticsearch at Synthesio
Scaling Elasticsearch at SynthesioScaling Elasticsearch at Synthesio
Scaling Elasticsearch at Synthesio
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
 
Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
What's New on AWS and What it Means to You
What's New on AWS and What it Means to YouWhat's New on AWS and What it Means to You
What's New on AWS and What it Means to You
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 

Viewers also liked

Урок по биологии
Урок по биологииУрок по биологии
Урок по биологииkoneqq
 
evaluation question 2
evaluation question 2evaluation question 2
evaluation question 2mollyturrell
 
Календарное тематическое планирование 5 класс
Календарное тематическое планирование 5 классКалендарное тематическое планирование 5 класс
Календарное тематическое планирование 5 классkoneqq
 
Internacionalización de negocios web
Internacionalización de negocios webInternacionalización de negocios web
Internacionalización de negocios webAntevenio S.A
 
SC-PT-TP1617-G2-R
SC-PT-TP1617-G2-RSC-PT-TP1617-G2-R
SC-PT-TP1617-G2-RTrabalho_SC
 
Selecting a credit card presentation
Selecting a credit card presentationSelecting a credit card presentation
Selecting a credit card presentationShannon Gilliland
 
Communicaton styles
Communicaton stylesCommunicaton styles
Communicaton stylesksvsprakash
 
Redes de Mercadeo El éxito sucederá cuando seas una persona nueva
 Redes de Mercadeo El éxito sucederá cuando seas una persona nueva Redes de Mercadeo El éxito sucederá cuando seas una persona nueva
Redes de Mercadeo El éxito sucederá cuando seas una persona nuevaMaria Velarde-Peru
 
Story of my life,Helen Keller,chapter 11
Story of my life,Helen Keller,chapter 11Story of my life,Helen Keller,chapter 11
Story of my life,Helen Keller,chapter 11POOJA JAYAPRASAD
 
Redes sociales empresas y negocios
Redes sociales empresas y negociosRedes sociales empresas y negocios
Redes sociales empresas y negociosMaria Bandres
 
Век XIX (Русская культура второй половины XIX века)
Век XIX (Русская культура второй половины XIX века)Век XIX (Русская культура второй половины XIX века)
Век XIX (Русская культура второй половины XIX века)koneqq
 
Masters in Media Psychology - Fielding Graduate University
Masters in Media Psychology - Fielding Graduate UniversityMasters in Media Psychology - Fielding Graduate University
Masters in Media Psychology - Fielding Graduate UniversityPamela Rutledge
 
Brand archetypes from Sol Marketing
Brand archetypes from Sol MarketingBrand archetypes from Sol Marketing
Brand archetypes from Sol MarketingDeb Gabor
 

Viewers also liked (20)

Урок по биологии
Урок по биологииУрок по биологии
Урок по биологии
 
evaluation question 2
evaluation question 2evaluation question 2
evaluation question 2
 
Календарное тематическое планирование 5 класс
Календарное тематическое планирование 5 классКалендарное тематическое планирование 5 класс
Календарное тематическое планирование 5 класс
 
Internacionalización de negocios web
Internacionalización de negocios webInternacionalización de negocios web
Internacionalización de negocios web
 
SC-PT-TP1617-G2-R
SC-PT-TP1617-G2-RSC-PT-TP1617-G2-R
SC-PT-TP1617-G2-R
 
Selecting a credit card presentation
Selecting a credit card presentationSelecting a credit card presentation
Selecting a credit card presentation
 
Save
SaveSave
Save
 
Communicaton styles
Communicaton stylesCommunicaton styles
Communicaton styles
 
Redes de Mercadeo El éxito sucederá cuando seas una persona nueva
 Redes de Mercadeo El éxito sucederá cuando seas una persona nueva Redes de Mercadeo El éxito sucederá cuando seas una persona nueva
Redes de Mercadeo El éxito sucederá cuando seas una persona nueva
 
UK 2013
UK 2013UK 2013
UK 2013
 
Story of my life,Helen Keller,chapter 11
Story of my life,Helen Keller,chapter 11Story of my life,Helen Keller,chapter 11
Story of my life,Helen Keller,chapter 11
 
Astronomy
AstronomyAstronomy
Astronomy
 
Redes sociales empresas y negocios
Redes sociales empresas y negociosRedes sociales empresas y negocios
Redes sociales empresas y negocios
 
Taxable income
Taxable incomeTaxable income
Taxable income
 
Active or Passive Voice?
Active or Passive Voice?Active or Passive Voice?
Active or Passive Voice?
 
Век XIX (Русская культура второй половины XIX века)
Век XIX (Русская культура второй половины XIX века)Век XIX (Русская культура второй половины XIX века)
Век XIX (Русская культура второй половины XIX века)
 
Burn the Burnout
Burn the BurnoutBurn the Burnout
Burn the Burnout
 
Masters in Media Psychology - Fielding Graduate University
Masters in Media Psychology - Fielding Graduate UniversityMasters in Media Psychology - Fielding Graduate University
Masters in Media Psychology - Fielding Graduate University
 
Brand archetypes from Sol Marketing
Brand archetypes from Sol MarketingBrand archetypes from Sol Marketing
Brand archetypes from Sol Marketing
 
Jeremy Bowes and Peter Jones: Synthesis Maps as Design Constructs
Jeremy Bowes and Peter Jones: Synthesis Maps as Design ConstructsJeremy Bowes and Peter Jones: Synthesis Maps as Design Constructs
Jeremy Bowes and Peter Jones: Synthesis Maps as Design Constructs
 

Similar to Hadoop-scale Search with Solr

Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionLucidworks
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationNitin Sharma
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction abenyeung1
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkQAware GmbH
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming ApplicationsC4Media
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaLucidworks
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 

Similar to Hadoop-scale Search with Solr (20)

Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr 4
Solr 4Solr 4
Solr 4
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Leveraging the Power of Solr with Spark
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Hadoop-scale Search with Solr

  • 1. Grant Ingersoll, CTO Lucidworks April 15, 2015 Hadoop-scale Search with Solr
  • 3. 10M+total downloads Solr is both established & growing 250,000+ monthly downloads Largest community of developers. 2500+open Solr jobs. Solr most widely used search solution on the planet. Lucidworks Unmatched Solr expertise. 1/3 of the active committers 70% of the open source code is committed Lucene/Solr Revolution world’s largest open source user conference dedicated to Lucene/Solr. Solr has tens of thousands of applications in production. You use Solr everyday. Solr in a Nutshell
  • 4. • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features
  • 5. What’s old is new again! • Build/Store indexes in HDFS • https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS • Block cache is your friend Deployment and Security support • https://github.com/LucidWorks/yarn-proto • Slider and Ambari support coming soon • Authz, Authc and Doc Filtering coming in April/May Hadoop Basics
  • 6. Lucidworks + Hadoop • Ingestion tools for various file formats, etc. • Hive 2-way Load/Store support • Pig Load/Store • http://lucidworks.com/product/integrations/hadoop/ • (More on Spark in a bit)
  • 7. Case 1: Compliance • Monitoring and customer service search for large volume transactional data • Initial Setup: • 20 machines, 32 GB RAM, 800 GB SSD, 2 Solr nodes per machine • Indexing from Kafka to Solr (Lucidworks Fusion) • 14B+ docs indexed/searchable in POC (disk limited) • Growth to 4B+ per day w/ 6 month life expectancy
  • 8. Case 2: Web Analytics • Large scale ad-hoc analytics over weblogs using Tableau as a front end BI tool for Solr • Initial setup: • 4 machines, 128 GB of RAM, several Solr nodes per machine • Data originally in Hive • POC: 10s of Billions of events growing to 150B+ per week
  • 9. current_log_writer collection alias rolls over to a new transient collection every two hours; the shards in the transient collection are merged into the 2-hour shard and added to the daily collection Connector writes to the collection alias, up to 50K docs / sec Latest 2-hour shard gets built from merging shards at time bucket boundary Multiple shards needed to support 50K writes per second Every daily collection has 12 (or 24) shards, each covering 2- hour blocks of log messages Sample ArchitectureFusion Logstash connector current_log_writer (Collection Alias) logs_feb26_h24
 (Transient Collection) Shard 1 Shard 2 Shard 4 Shard 3 logs_feb01
 (daily collection) logs_feb25
 (daily collection) logs_feb26
 (daily collection) h02
 Shard h24
 Shard h22
 Shard
  • 10. Every daily collection has 12 (or 24) shards, each covering 2- hour blocks of log messages h02
 Shard h24
 Shard h22
 Shard h02
 Shard h24
 Shard h22
 Shard Can add replicas to support higher query volume & fault-tolerance Sample Query Execution recent_logs (collection alias) logs_feb01
 (daily collection) logs_feb25
 (daily collection) todays_logs (collection alias) Fusion SiLK Dashboard todays_logs collection alias rolls over to a new day automatically at day boundary logs_feb26
 (daily collection)
  • 11. Case 3: Lots of Users, Lots of Data • Search of consumer data storage • Key challenges: not all users are equals. Users grow and change all the time • Petabytes of data, millions of users, 1000’s of nodes • Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw
 and
 http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large- number-of-collections-shalin-shekhar-mangar • Search of consumer cloud storage • Key challenges: not all users are equals. Users grow and change all the time • Petabytes of data, millions of users, 1000’s of nodes, • 1000’s of collections while isolating access • Learn more: https://www.youtube.com/watch?v=_Erkln5WWLw
 and
 http://www.slideshare.net/lucidworks/scaling-solrcloud-to-a-large- number-of-collections-shalin-shekhar-mangar
  • 12. • Improve Zookeeper interactions and performance to handle thousands of collections • Deep paging • Split shards on arbitrary hash ranges • Large scale testing • Collection migration Case 3: Key Solr Improvements
  • 13. https://github.com/LucidWorks/solr-scale-tk Testing: Solr Scale Toolkit Python w/ Fabric, Boto Etc.
 Test Automation Scripts Kafka
 MQ / Data Integration Logstash Log Agg / Analysis CollectD/SiLK System / JMX Monitoring Test Results DB Support Services Client Node 1 JMeter / Client Nodes Client Node 2 Zookeeper Test Data Stored in Amazon S3 Node 1: Custom AMI Solr Cluster (NxM Nodes) Solr Node 1 8983 Core Core Solr Node M 898X Core Core SolrCloud Traffic Between All Solr Nodes and ZK Key Point: Each test will define the density of cores per node and number of Solr nodes per machine, as well as the instance type and number of machines ZK Ensemble ZooKeeper-1 ZooKeeper-2 ZooKeeper-3 System monitoring
 of N Machines JMX Notifications Logs aggregated from NxM Solr Nodes ZK JMX Stats Easily deploy clusters of nodes using our custom AMI’s Indexing & query requests from tests
  • 14. • Power user search and recommendations over news content and engagement signals (shares, views, etc.) using Lucidworks Fusion • Combines content and collaborative filtering approaches to calculate search boosts and “people who did X also did Y” results • Data: ~10M events (POC) growing to 3-4B per month Case 4: Signals for Search and Discovery
  • 15. • Lucidworks Fusion 1.4 (May ’15) will ship Apache Spark as a scheduled service for large scale aggregations, machine learning and more • We’ve already seen 3x speedup in some tests • Will ship w/ ALS rec algo and Mahout algos • Solr as Spark RRD • https://github.com/LucidWorks/spark-solr Next Level Signals
  • 16. Billions of Docs Optional REST Security woven throughout Proxy Recs Worker Pipes Metrics NLP Sched. Blobs Admin Connectors Worker Cluster Mgr. Spark Shards Shards Solr HDFS Shared Config Mgmt Leader Election Load Balancing ZK 1 Zookeeper ZK N Signals Fusion Architecture Millions of Users
  • 17. • Native, pluggable Security in Solr (April/May) • Numerous performance enhancements for replication in shards • Cross core ValueSources • Many new extensions for facets and analytics • Percentiles (t-digest) • Facet combinations • Dynamic expressions over result sets Roadmap
  • 18. Next steps Download Fusion: http://www.lucidworks.com/products/fusion Contact Lucidworks: http://lucidworks.com/company/contact/ Contact Me: grant@lucidworks.com @gsingers