SlideShare a Scribd company logo
1 of 19
Solr & Fusion for Big Data
• Where search fits in the
big data landscape?
• Solr on HDFS
• Indexing strategies
• End-to-end security
• Lambda architecture
• Spark and how we use it
in Fusion
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
Why search for big data?
• Speed at scale
• Basic analytics (facets, pivot facets, facets + stats) +
visualizations
• Query structured and unstructured data
• Ad hoc exploration is inherent in big data
• People grok search
• Context for aggregations (drill into the numbers)
Common use case:
log analysis
• Time-ordered data
• Raw data stored in
HDFS
• How much data? How
fast?
• Access patterns?
• Schema design ~ no free
lunch at scale
Time-based Partitioning Scheme
Fusion
Log Analytics
Dashboard
logs_feb26
(daily collection)
logs_feb25
(daily collection)
logs_feb01
(daily collection)
h00
(shard)
h22
(shard)
h23
(shard)
h00
(shard)
h22
(shard)
h23
(shard)
Add replicas
to support higher
query volume &
fault-tolerance
recent_logs
(colllection alias)
Use a collection
alias to make multiple
collections look like a
single collection; minimize
exposure to partitioning
strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
Solr on HDFS
• Maturing solution still some issues
• My test showed ~23-25% slower than local SSD
• Better ROI, operational efficiency, security
• Needed for YARN
• Enables auto add replicas
• Interesting features coming soon: ZooKeeper lock (SOLR-
8169) and replicas share index (SOLR-6237)
Solr on HDFS
Solr
shard1 / replica1
block cache
Solr
shard1 / replica2
block cache
writes
reads
HDFS
DataNode C
HDFS
DataNode B
HDFS
DataNode A writes
reads
HDFS block replication
Solr replication
Auto Add Replica
HDFS
DataNode C
block cache
Solr
shard1 / replica1
writes
reads
HDFS
DataNode A
HDFS block replication
Solr
shard1 / replica2
block cache
HDFS
DataNode Bwrites
reads
Solr replication
overseer
ZooKeeper
watches
Solr
shard1 / replica3
writes
reads
Indexing Strategies
• Many tools available!
• MapReduce indexer (Solr contrib)
• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase
• Storm to Solr or Fusion (github.com/LucidWorks/storm-solr)
• Spark to Solr or Fusion (github.com/LucidWorks/spark-solr)
• Lucidworks Fusion Connectors
Any Data. Any Source.
Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolr
Client
HDFS
Get collection metadata
from ZooKeeper
(e.g. shard leader URL)
Send updates to shard
leaders in parallel
Fusion Pipeline
docs
…N map tasks (1 per block)
30+ index stages
- Field mapping
- JavaScript
- Tika parsing
- NLP
- Regex
- JDBC lookup
Many common file formats supported:
CSV, SequenceFile, grok, XML, warc
Security
• End-to-end security is now a reality for Hadoop
• Kerberos authentication (ZK, Solr, HDFS, jobs)
• Pluggable authorization framework
• Collection and document-level access controls (via Fusion)
• SSL
• Apache Ranger (centralized admin, auditing, monitoring for
Hadoop)
Cluster Sizing Worksheet
• There is no formula, only guidelines!
• # of documents / avg. doc size / number of fields
• Updates per second / soft-commit frequency
• Storage type (local SSD vs. HDFS)
• Sharding scheme (time-based vs. hash-based)
• Peak QPS / 95th percentile response time / query complexity
• Must test your data on your servers ;-)
• Search engine fits
perfectly with lambda
• Use batch layer to build
indexes instead of
“views”
• Speed layer uses Spark
streaming to build near
real-time index
• Aggregation collections
for historical data
Lambda Architecture
source: http://lambda-architecture.net/
Spark
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
engine
cluster
mgmt
Tachyon
languages Scala Java Python R
shared
memory
The most relevant results
every single time.
Massive scale. Real-time.
Secure.
Any data. Any source.
Lucidworks Is Search
Any questions?
• Try Fusion http://lucidworks.com/products/fusion/download
• LinkedIn / Twitter / Solr JIRA: @thelabdude

More Related Content

What's hot

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Lucidworks (Archived)
 

What's hot (20)

Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
 

Viewers also liked

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 

Viewers also liked (8)

More kibana
More kibanaMore kibana
More kibana
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Whoscall 的 Realtime Monitoring 經驗分享
Whoscall 的 Realtime Monitoring 經驗分享Whoscall 的 Realtime Monitoring 經驗分享
Whoscall 的 Realtime Monitoring 經驗分享
 
Logstash
LogstashLogstash
Logstash
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 

Similar to Webinar: Solr & Fusion for Big Data

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
gregchanan
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 

Similar to Webinar: Solr & Fusion for Big Data (20)

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 

More from Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Webinar: Solr & Fusion for Big Data

  • 1.
  • 2. Solr & Fusion for Big Data • Where search fits in the big data landscape? • Solr on HDFS • Indexing strategies • End-to-end security • Lambda architecture • Spark and how we use it in Fusion
  • 3. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 4. Why search for big data? • Speed at scale • Basic analytics (facets, pivot facets, facets + stats) + visualizations • Query structured and unstructured data • Ad hoc exploration is inherent in big data • People grok search • Context for aggregations (drill into the numbers)
  • 5. Common use case: log analysis • Time-ordered data • Raw data stored in HDFS • How much data? How fast? • Access patterns? • Schema design ~ no free lunch at scale
  • 6. Time-based Partitioning Scheme Fusion Log Analytics Dashboard logs_feb26 (daily collection) logs_feb25 (daily collection) logs_feb01 (daily collection) h00 (shard) h22 (shard) h23 (shard) h00 (shard) h22 (shard) h23 (shard) Add replicas to support higher query volume & fault-tolerance recent_logs (colllection alias) Use a collection alias to make multiple collections look like a single collection; minimize exposure to partitioning strategy in client layer Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
  • 7. Solr on HDFS • Maturing solution still some issues • My test showed ~23-25% slower than local SSD • Better ROI, operational efficiency, security • Needed for YARN • Enables auto add replicas • Interesting features coming soon: ZooKeeper lock (SOLR- 8169) and replicas share index (SOLR-6237)
  • 8. Solr on HDFS Solr shard1 / replica1 block cache Solr shard1 / replica2 block cache writes reads HDFS DataNode C HDFS DataNode B HDFS DataNode A writes reads HDFS block replication Solr replication
  • 9. Auto Add Replica HDFS DataNode C block cache Solr shard1 / replica1 writes reads HDFS DataNode A HDFS block replication Solr shard1 / replica2 block cache HDFS DataNode Bwrites reads Solr replication overseer ZooKeeper watches Solr shard1 / replica3 writes reads
  • 10. Indexing Strategies • Many tools available! • MapReduce indexer (Solr contrib) • LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase • Storm to Solr or Fusion (github.com/LucidWorks/storm-solr) • Spark to Solr or Fusion (github.com/LucidWorks/spark-solr) • Lucidworks Fusion Connectors
  • 11. Any Data. Any Source.
  • 12. Fusion Indexing Pipelines in MapReduce Solr Map Task (or reducer if needed) ZooKeeper CloudSolr Client HDFS Get collection metadata from ZooKeeper (e.g. shard leader URL) Send updates to shard leaders in parallel Fusion Pipeline docs …N map tasks (1 per block) 30+ index stages - Field mapping - JavaScript - Tika parsing - NLP - Regex - JDBC lookup Many common file formats supported: CSV, SequenceFile, grok, XML, warc
  • 13. Security • End-to-end security is now a reality for Hadoop • Kerberos authentication (ZK, Solr, HDFS, jobs) • Pluggable authorization framework • Collection and document-level access controls (via Fusion) • SSL • Apache Ranger (centralized admin, auditing, monitoring for Hadoop)
  • 14. Cluster Sizing Worksheet • There is no formula, only guidelines! • # of documents / avg. doc size / number of fields • Updates per second / soft-commit frequency • Storage type (local SSD vs. HDFS) • Sharding scheme (time-based vs. hash-based) • Peak QPS / 95th percentile response time / query complexity • Must test your data on your servers ;-)
  • 15. • Search engine fits perfectly with lambda • Use batch layer to build indexes instead of “views” • Speed layer uses Spark streaming to build near real-time index • Aggregation collections for historical data Lambda Architecture source: http://lambda-architecture.net/
  • 16. Spark Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (BSP) Hadoop YARN Mesos Standalone HDFS Execution Model The Shuffle Caching engine cluster mgmt Tachyon languages Scala Java Python R shared memory
  • 17. The most relevant results every single time. Massive scale. Real-time. Secure. Any data. Any source.
  • 19. Any questions? • Try Fusion http://lucidworks.com/products/fusion/download • LinkedIn / Twitter / Solr JIRA: @thelabdude

Editor's Notes

  1. I don’t have to tell you that big data is a popular topic of discussion in IT circles these days. What I want to talk about today is how Solr and Lucidworks Fusion fit into the big data landscape. Don’t worry, I’ll try to keep the hype and grandiose statements to a minimum. I will get technical in a few places because it’s important to understand the details. Search is a critical component of any big data strategy Fusion & Solr are first-class citizens in the Hadoop ecosystem Big data doesn’t have to be hard – Fusion makes it easy
  2. Search engines contain mission-critical data and are typically on the front-line, directly serving users
  3. Before I was a Solr committer, I was a Solr user, one of the first adopters of SolrCloud actually. I worked on a team that built and supported a big data framework built on Hadoop, Storm, Cassandra, Solr, and Postgres. Effectively, we computed performance metrics for brands by analyzing social media data IT organizations are consolidating data infrastructure for improved ROI, efficiency, security, and governance. Solr is included as part of Cloudera, Hortonworks, and MapR Hadoop distributions Users get search because they see it everyday; BI / dashboards / SQL are powerful, but not necessarily intuitive Vast amount of exhaust created by users interacting with searchable content Often times, it’s a small department in a larger organization that uses search to expose medium data to deliver business insights and then the “search engine” evolves into an insights engine on larger and larger data sets. If you could plan for all the possible queries you need to serve, then traditional BI / data warehousing techniques will still serve you well. Search fills the void where users need fast, ad hoc query capabilities to do exploratory analysis.
  4. Let’s imagine we have time-ordered data, such as logs of user activity. You can insert any scale that fits your needs here. We work with customers that have a billions log events per day up to 10’s of billions. Let’s work through a quick example to illustrate some of the questions that come up and how we tackle them at Lucidworks Bunch of log data HDFS, want to index it for ad hoc queries and basic visualizations, i.e. the kinds that you can power with simple analytical functions like faceting First thing we have to identify is what data are we indexing? where is it coming from? how much data is there? how quickly do we need it to be indexed? But wait … step back a sec … how are people going to search this data? There are three important decisions emerge when designing your search solution: data partitioning scheme (time-based: hourly, daily, 15-minutes, etc) doc values: fields you need to sort and facet on should have doc values range queries need trie-fields to be indexed what fields must be stored / indexed What type of visualizations make sense for this data? What type of aggregations do you want to perform and at what time granularity? So here we’re starting to see some of the same considerations when designing data warehouse, i.e. there’s no free lunch, esp. at scale
  5. The key-takeaway here is that you use your investment in Hadoop to scale complex document processing using Fusion pipelines by running a pipeline in each map or reduce task