SlideShare a Scribd company logo
1 of 19
Solr & Fusion for Big Data
• Where search fits in the
big data landscape?
• Solr on HDFS
• Indexing strategies
• End-to-end security
• Lambda architecture
• Spark and how we use it
in Fusion
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
Why search for big data?
• Speed at scale
• Basic analytics (facets, pivot facets, facets + stats) +
visualizations
• Query structured and unstructured data
• Ad hoc exploration is inherent in big data
• People grok search
• Context for aggregations (drill into the numbers)
Common use case:
log analysis
• Time-ordered data
• Raw data stored in
HDFS
• How much data? How
fast?
• Access patterns?
• Schema design ~ no free
lunch at scale
Time-based Partitioning Scheme
Fusion
Log Analytics
Dashboard
logs_feb26
(daily collection)
logs_feb25
(daily collection)
logs_feb01
(daily collection)
h00
(shard)
h22
(shard)
h23
(shard)
h00
(shard)
h22
(shard)
h23
(shard)
Add replicas
to support higher
query volume &
fault-tolerance
recent_logs
(colllection alias)
Use a collection
alias to make multiple
collections look like a
single collection; minimize
exposure to partitioning
strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
Solr on HDFS
• Maturing solution still some issues
• My test showed ~23-25% slower than local SSD
• Better ROI, operational efficiency, security
• Needed for YARN
• Enables auto add replicas
• Interesting features coming soon: ZooKeeper lock (SOLR-
8169) and replicas share index (SOLR-6237)
Solr on HDFS
Solr
shard1 / replica1
block cache
Solr
shard1 / replica2
block cache
writes
reads
HDFS
DataNode C
HDFS
DataNode B
HDFS
DataNode A writes
reads
HDFS block replication
Solr replication
Auto Add Replica
HDFS
DataNode C
block cache
Solr
shard1 / replica1
writes
reads
HDFS
DataNode A
HDFS block replication
Solr
shard1 / replica2
block cache
HDFS
DataNode Bwrites
reads
Solr replication
overseer
ZooKeeper
watches
Solr
shard1 / replica3
writes
reads
Indexing Strategies
• Many tools available!
• MapReduce indexer (Solr contrib)
• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase
• Storm to Solr or Fusion (github.com/LucidWorks/storm-solr)
• Spark to Solr or Fusion (github.com/LucidWorks/spark-solr)
• Lucidworks Fusion Connectors
Any Data. Any Source.
Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolr
Client
HDFS
Get collection metadata
from ZooKeeper
(e.g. shard leader URL)
Send updates to shard
leaders in parallel
Fusion Pipeline
docs
…N map tasks (1 per block)
30+ index stages
- Field mapping
- JavaScript
- Tika parsing
- NLP
- Regex
- JDBC lookup
Many common file formats supported:
CSV, SequenceFile, grok, XML, warc
Security
• End-to-end security is now a reality for Hadoop
• Kerberos authentication (ZK, Solr, HDFS, jobs)
• Pluggable authorization framework
• Collection and document-level access controls (via Fusion)
• SSL
• Apache Ranger (centralized admin, auditing, monitoring for
Hadoop)
Cluster Sizing Worksheet
• There is no formula, only guidelines!
• # of documents / avg. doc size / number of fields
• Updates per second / soft-commit frequency
• Storage type (local SSD vs. HDFS)
• Sharding scheme (time-based vs. hash-based)
• Peak QPS / 95th percentile response time / query complexity
• Must test your data on your servers ;-)
• Search engine fits
perfectly with lambda
• Use batch layer to build
indexes instead of
“views”
• Speed layer uses Spark
streaming to build near
real-time index
• Aggregation collections
for historical data
Lambda Architecture
source: http://lambda-architecture.net/
Spark
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
engine
cluster
mgmt
Tachyon
languages Scala Java Python R
shared
memory
The most relevant results
every single time.
Massive scale. Real-time.
Secure.
Any data. Any source.
Lucidworks Is Search
Any questions?
• Try Fusion http://lucidworks.com/products/fusion/download
• LinkedIn / Twitter / Solr JIRA: @thelabdude

More Related Content

What's hot

Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Caserta
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at UberDataWorks Summit
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Lucidworks
 

What's hot (20)

Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
 

Viewers also liked

{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4琛琳 饶
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Whoscall 的 Realtime Monitoring 經驗分享
Whoscall 的 Realtime Monitoring 經驗分享Whoscall 的 Realtime Monitoring 經驗分享
Whoscall 的 Realtime Monitoring 經驗分享William Yeh
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭台灣資料科學年會
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
 

Viewers also liked (8)

More kibana
More kibanaMore kibana
More kibana
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Whoscall 的 Realtime Monitoring 經驗分享
Whoscall 的 Realtime Monitoring 經驗分享Whoscall 的 Realtime Monitoring 經驗分享
Whoscall 的 Realtime Monitoring 經驗分享
 
Logstash
LogstashLogstash
Logstash
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 

Similar to Solr & Fusion for Big Data Search and Analytics

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 

Similar to Solr & Fusion for Big Data Search and Analytics (20)

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 

More from Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Solr & Fusion for Big Data Search and Analytics

  • 1.
  • 2. Solr & Fusion for Big Data • Where search fits in the big data landscape? • Solr on HDFS • Indexing strategies • End-to-end security • Lambda architecture • Spark and how we use it in Fusion
  • 3. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 4. Why search for big data? • Speed at scale • Basic analytics (facets, pivot facets, facets + stats) + visualizations • Query structured and unstructured data • Ad hoc exploration is inherent in big data • People grok search • Context for aggregations (drill into the numbers)
  • 5. Common use case: log analysis • Time-ordered data • Raw data stored in HDFS • How much data? How fast? • Access patterns? • Schema design ~ no free lunch at scale
  • 6. Time-based Partitioning Scheme Fusion Log Analytics Dashboard logs_feb26 (daily collection) logs_feb25 (daily collection) logs_feb01 (daily collection) h00 (shard) h22 (shard) h23 (shard) h00 (shard) h22 (shard) h23 (shard) Add replicas to support higher query volume & fault-tolerance recent_logs (colllection alias) Use a collection alias to make multiple collections look like a single collection; minimize exposure to partitioning strategy in client layer Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
  • 7. Solr on HDFS • Maturing solution still some issues • My test showed ~23-25% slower than local SSD • Better ROI, operational efficiency, security • Needed for YARN • Enables auto add replicas • Interesting features coming soon: ZooKeeper lock (SOLR- 8169) and replicas share index (SOLR-6237)
  • 8. Solr on HDFS Solr shard1 / replica1 block cache Solr shard1 / replica2 block cache writes reads HDFS DataNode C HDFS DataNode B HDFS DataNode A writes reads HDFS block replication Solr replication
  • 9. Auto Add Replica HDFS DataNode C block cache Solr shard1 / replica1 writes reads HDFS DataNode A HDFS block replication Solr shard1 / replica2 block cache HDFS DataNode Bwrites reads Solr replication overseer ZooKeeper watches Solr shard1 / replica3 writes reads
  • 10. Indexing Strategies • Many tools available! • MapReduce indexer (Solr contrib) • LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase • Storm to Solr or Fusion (github.com/LucidWorks/storm-solr) • Spark to Solr or Fusion (github.com/LucidWorks/spark-solr) • Lucidworks Fusion Connectors
  • 11. Any Data. Any Source.
  • 12. Fusion Indexing Pipelines in MapReduce Solr Map Task (or reducer if needed) ZooKeeper CloudSolr Client HDFS Get collection metadata from ZooKeeper (e.g. shard leader URL) Send updates to shard leaders in parallel Fusion Pipeline docs …N map tasks (1 per block) 30+ index stages - Field mapping - JavaScript - Tika parsing - NLP - Regex - JDBC lookup Many common file formats supported: CSV, SequenceFile, grok, XML, warc
  • 13. Security • End-to-end security is now a reality for Hadoop • Kerberos authentication (ZK, Solr, HDFS, jobs) • Pluggable authorization framework • Collection and document-level access controls (via Fusion) • SSL • Apache Ranger (centralized admin, auditing, monitoring for Hadoop)
  • 14. Cluster Sizing Worksheet • There is no formula, only guidelines! • # of documents / avg. doc size / number of fields • Updates per second / soft-commit frequency • Storage type (local SSD vs. HDFS) • Sharding scheme (time-based vs. hash-based) • Peak QPS / 95th percentile response time / query complexity • Must test your data on your servers ;-)
  • 15. • Search engine fits perfectly with lambda • Use batch layer to build indexes instead of “views” • Speed layer uses Spark streaming to build near real-time index • Aggregation collections for historical data Lambda Architecture source: http://lambda-architecture.net/
  • 16. Spark Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (BSP) Hadoop YARN Mesos Standalone HDFS Execution Model The Shuffle Caching engine cluster mgmt Tachyon languages Scala Java Python R shared memory
  • 17. The most relevant results every single time. Massive scale. Real-time. Secure. Any data. Any source.
  • 19. Any questions? • Try Fusion http://lucidworks.com/products/fusion/download • LinkedIn / Twitter / Solr JIRA: @thelabdude

Editor's Notes

  1. I don’t have to tell you that big data is a popular topic of discussion in IT circles these days. What I want to talk about today is how Solr and Lucidworks Fusion fit into the big data landscape. Don’t worry, I’ll try to keep the hype and grandiose statements to a minimum. I will get technical in a few places because it’s important to understand the details. Search is a critical component of any big data strategy Fusion & Solr are first-class citizens in the Hadoop ecosystem Big data doesn’t have to be hard – Fusion makes it easy
  2. Search engines contain mission-critical data and are typically on the front-line, directly serving users
  3. Before I was a Solr committer, I was a Solr user, one of the first adopters of SolrCloud actually. I worked on a team that built and supported a big data framework built on Hadoop, Storm, Cassandra, Solr, and Postgres. Effectively, we computed performance metrics for brands by analyzing social media data IT organizations are consolidating data infrastructure for improved ROI, efficiency, security, and governance. Solr is included as part of Cloudera, Hortonworks, and MapR Hadoop distributions Users get search because they see it everyday; BI / dashboards / SQL are powerful, but not necessarily intuitive Vast amount of exhaust created by users interacting with searchable content Often times, it’s a small department in a larger organization that uses search to expose medium data to deliver business insights and then the “search engine” evolves into an insights engine on larger and larger data sets. If you could plan for all the possible queries you need to serve, then traditional BI / data warehousing techniques will still serve you well. Search fills the void where users need fast, ad hoc query capabilities to do exploratory analysis.
  4. Let’s imagine we have time-ordered data, such as logs of user activity. You can insert any scale that fits your needs here. We work with customers that have a billions log events per day up to 10’s of billions. Let’s work through a quick example to illustrate some of the questions that come up and how we tackle them at Lucidworks Bunch of log data HDFS, want to index it for ad hoc queries and basic visualizations, i.e. the kinds that you can power with simple analytical functions like faceting First thing we have to identify is what data are we indexing? where is it coming from? how much data is there? how quickly do we need it to be indexed? But wait … step back a sec … how are people going to search this data? There are three important decisions emerge when designing your search solution: data partitioning scheme (time-based: hourly, daily, 15-minutes, etc) doc values: fields you need to sort and facet on should have doc values range queries need trie-fields to be indexed what fields must be stored / indexed What type of visualizations make sense for this data? What type of aggregations do you want to perform and at what time granularity? So here we’re starting to see some of the same considerations when designing data warehouse, i.e. there’s no free lunch, esp. at scale
  5. The key-takeaway here is that you use your investment in Hadoop to scale complex document processing using Fusion pipelines by running a pipeline in each map or reduce task