SlideShare a Scribd company logo
1 of 36
1
Real-time searching
of big data with
Solr and Hadoop
Rod Cope, CTO
2
• Introduction
• The problem
• The solution
• Top 10 lessons
• Final thoughts
• Q&A
3
Rod Cope, CTO
Rogue Wave Software
4
• “Big data”
– All the world’s open source software
– Metadata, code, indexes
– Individual tables contain many terabytes
– Relational databases aren’t scale-free
• Growing every day
• Need real-time random access to all data
• Long-running and complex analysis jobs
5
• Hadoop, HBase, and Solr
– Hadoop – distributed file system, map/reduce
– HBase – “NoSQL” data store – column-oriented
– Solr – search server based on Lucene
– All are scalable, flexible, fast, well-supported, and used in
production environments
• And a supporting cast of thousands…
– Stargate, MySQL, Rails, Redis, Resque,
– Nginx, Unicorn, HAProxy, Memcached,
– Ruby, JRuby, CentOS, …
6
Internet Application LAN Data LAN *Caching and load balancing not shown
7
• HBase  NoSQL
– Think hash table, not relational database
• How do find my data if primary key won’t cut it?
• Solr to the rescue
– Very fast, highly scalable search server with built-in sharding
and replication – based on Lucene
– Dynamic schema, powerful query language, faceted search,
accessible via simple REST-like web API w/XML, JSON, Ruby,
and other data formats
8
• Sharding
– Query any server – it executes the same query against all other servers in
the group
– Returns aggregated result to original caller
• Async replication (slaves poll their masters)
– Can use repeaters if replicating across data centers
• OpenLogic
– Solr farm, sharded, cross-replicated, fronted with HAProxy
• Load balanced writes across masters, reads across masters and slaves
• Be careful not to over-commit
– Billions of lines of code in HBase, all indexed in Solr for real-time search in
multiple ways
– Over 20 Solr fields indexed per source file
9
10
11
12
13
14
15
• Experiment with different Solr merge factors
– During huge loads, it can help to use a higher factor for load
performance
• Minimize index manipulation gymnastics
• Start with something like 25
– When you’re done with the massive initial load/import, turn it
back down for search performance
• Minimize number of queries
• Start with something like 5
• Note that a small merge factor will hurt indexing performance if you
need to do massive loads on a frequent basis or continuous
indexing
16
• Test your write-focused load balancing
– Look for large skews in Solr index size
– Note: you may have to commit, optimize, write again, and commit
before you can really tell
• Make sure your replication slaves are keeping up
– Using identical hardware helps
– If index directories don’t look the same, something is wrong
17
• Don’t commit to Solr too frequently
– It’s easy to auto-commit or commit after every record
– Doing this 100’s of times per second will take Solr down,
especially if you have serious warm up queries configured
• Avoid putting large values in HBase (> 5MB)
– Works, but may cause instability and/or performance issues
– Rows and columns are cheap, so use more of them instead
18
• Don’t use a single machine to load the cluster
– You might not live long enough to see it finish
• At OpenLogic, we spread raw source data across many machines and
hard drives via NFS
– Be very careful with NFS configuration – can hang machines
• Load data into HBase via Hadoop map/reduce jobs
– Turn off WAL for much better performance
• put.setWriteToWAL(false)
– Index in Solr as you go
• Good way to test your load balancing write schemes and replication set up
– This will find your weak spots!
19
• Writing data loading jobs can be tedious
• Scripting is faster and easier than writing Java
• Great for system administration tasks, testing
• Standard HBase shell is based on JRuby
• Very easy Map/Reduce jobs with J/Ruby and Wukong
• Used heavily at OpenLogic
– Productivity of Ruby
– Power of Java Virtual Machine
– Ruby on Rails, Hadoop integration, GUI clients
20
21
JRuby
list = ["Rod", "Neeta", "Eric", "Missy"]
shorts = list.find_all { |name| name.size <= 4 }
puts shorts.size
shorts.each { |name| puts name }
-> 2
-> Rod
Eric
Groovy
list = ["Rod", "Neeta", "Eric", "Missy"]
shorts = list.findAll { name -> name.size() <= 4 }
println shorts.size
shorts.each { name -> println name }
-> 2
-> Rod
Eric
22
• Hadoop
– SPOF around Namenode, append functionality
• HBase
– Backup, replication, and indexing solutions in flux
• Solr
– Several competing solutions around cloud-like scalability and
fault-tolerance, including ZooKeeper and Hadoop integration
– No clear winner, none quite ready for production
23
• Many moving parts
– It’s easy to let typos slip through
– Consider automated configuration via Chef, Puppet, or similar
• Pay attention to the details
– Operating system – max open files, sockets, and other limits
– Hadoop and HBase configuration
• http://hbase.apache.org/book.html#trouble
– Solr merge factor and norms
• Don’t starve HBase or Solr for memory
– Swapping will cripple your system
24
• “Commodity hardware” != 3 year old desktop
• Dual quad-core, 32GB RAM, 4+ disks
• Don’t bother with RAID on Hadoop data disks
– Be wary of non-enterprise drives
• Expect ugly hardware issues at some point
25
• Environment
– 100+ CPU cores
– 100+ Terabytes of disk
– Machines don’t have identity
– Add capacity by plugging in new machines
• Why not Amazon EC2?
– Great for computational bursts
– Expensive for long-term storage of big data
– Not yet consistent enough for mission-critical usage of HBase
26
• Dual quad-core and dual hex-core
• Dell boxes
• 32-64GB RAM
– ECC (highly recommended by Google)
• 6 x 2TB enterprise hard drives
• RAID 1 on two of the drives
– OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.
– Key “source” data backups
• Hadoop datanode gets remaining drives
• Redundant enterprise switches
• Dual- and quad-gigabit NIC’s
27
• Amazon EC2
– EBS Storage
• 100TB * $0.10/GB/month = $120k/year
– Double Extra Large instances
• 13 EC2 compute units, 34.2GB RAM
• 20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year
• 3 year reserved instances
– 20 * 4k = $80k up front to reserve
– (20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate
– Totals for 20 virtual machines
• 1st year cost: $120k + $80k + $86k = $286k
• 2nd & 3rd year costs: $120k + $86k = $206k
• Average: ($286k + $206k + $206k) / 3 = $232k/year
28
• Buy your own
– 20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k
• Over 33 EC2 compute units each
– Total: $53k/year (amortized over 3 years)
29
• Amazon EC2
– 20 instances * 13 EC2 compute units = 260 EC2 compute units
– Cost: $232k/year
• Buy your own
– 20 machines * 33 EC2 compute units = 660 EC2 compute units
– Cost: $53k/year
– Does not include hosting and maintenance costs
• Don’t think system administration goes away
– You still “own” all the instances – monitoring, debugging, support
30
• Hardware
– Power supplies, hard drives
• Operating System
– Kernel panics, zombie processes, dropped packets
• Software Servers
– Hadoop datanodes, HBase regionservers, Stargate servers, Solr
servers
• Your Code and Data
– Stray map/reduce jobs, strange corner cases in your data leading
to program failures
31
32
• Hadoop, HBase, Solr
• Apache, Tomcat, ZooKeeper,
HAProxy
• Stargate, JRuby, Lucene, Jetty,
HSQLDB, Geronimo
• Apache Commons, JUnit
• CentOS
• Dozens more
• Too expensive to build or buy
everything
33
• You can host big data in your own private cloud
– Tools are available today that didn’t exist a few years ago
– Fast to prototype – production readiness takes time
– Expect to invest in training and support
• Hbase and Solr are fast
– 100+ random queries/sec per instance
– Give them memory and stand back
• Hbase Scales, Solr scales (to a point)
– Don’t worry about out-growing a few machines
– Do worry about out-growing a rack of Solr instances
• Look for ways to partition your data other than “automatic” sharding
34
Q&A
35
Rod Cope
rod.cope@roguewave.com
See us in action:
www.roguewave.com
36

More Related Content

What's hot

Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
Hassan Islamov
 

What's hot (20)

Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
HBase lon meetup
HBase lon meetupHBase lon meetup
HBase lon meetup
 
Shard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stackShard-Query, an MPP database for the cloud using the LAMP stack
Shard-Query, an MPP database for the cloud using the LAMP stack
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
March 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling HadoopMarch 2011 HUG: Scaling Hadoop
March 2011 HUG: Scaling Hadoop
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
Let's Talk Operations! (Hadoop Summit 2014)
Let's Talk Operations! (Hadoop Summit 2014)Let's Talk Operations! (Hadoop Summit 2014)
Let's Talk Operations! (Hadoop Summit 2014)
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
 
Divide and conquer in the cloud
Divide and conquer in the cloudDivide and conquer in the cloud
Divide and conquer in the cloud
 
Conquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard queryConquering "big data": An introduction to shard query
Conquering "big data": An introduction to shard query
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 

Viewers also liked

Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
slashn
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 

Viewers also liked (13)

Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
 
Five ways to protect your software supply chain from hacks, quacks, and wrecks
Five ways to protect your software supply chain from hacks, quacks, and wrecksFive ways to protect your software supply chain from hacks, quacks, and wrecks
Five ways to protect your software supply chain from hacks, quacks, and wrecks
 
Shifting the conversation from active interception to proactive neutralization
Shifting the conversation from active interception to proactive neutralization Shifting the conversation from active interception to proactive neutralization
Shifting the conversation from active interception to proactive neutralization
 
Top 5 best practice for delivering secure in-vehicle software
Top 5 best practice for delivering secure in-vehicle softwareTop 5 best practice for delivering secure in-vehicle software
Top 5 best practice for delivering secure in-vehicle software
 
Enterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's PerspectiveEnterprise Search: An Information Architect's Perspective
Enterprise Search: An Information Architect's Perspective
 
Legal and Practical Concerns with Software Development
Legal and Practical Concerns with Software DevelopmentLegal and Practical Concerns with Software Development
Legal and Practical Concerns with Software Development
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Dlvr.it 使用說明
Dlvr.it 使用說明Dlvr.it 使用說明
Dlvr.it 使用說明
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 

Similar to Real-time searching of big data with Solr and Hadoop

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
elliando dias
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
DataWorks Summit
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 

Similar to Real-time searching of big data with Solr and Hadoop (20)

The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Redis - The Universal NoSQL Tool
Redis - The Universal NoSQL ToolRedis - The Universal NoSQL Tool
Redis - The Universal NoSQL Tool
 

More from Rogue Wave Software

More from Rogue Wave Software (20)

The Global Influence of Open Banking, API Security, and an Open Data Perspective
The Global Influence of Open Banking, API Security, and an Open Data PerspectiveThe Global Influence of Open Banking, API Security, and an Open Data Perspective
The Global Influence of Open Banking, API Security, and an Open Data Perspective
 
No liftoff, touchdown, or heartbeat shall miss because of a software failure
No liftoff, touchdown, or heartbeat shall miss because of a software failureNo liftoff, touchdown, or heartbeat shall miss because of a software failure
No liftoff, touchdown, or heartbeat shall miss because of a software failure
 
Disrupt or be disrupted – Using secure APIs to drive digital transformation
Disrupt or be disrupted – Using secure APIs to drive digital transformationDisrupt or be disrupted – Using secure APIs to drive digital transformation
Disrupt or be disrupted – Using secure APIs to drive digital transformation
 
Leveraging open banking specifications for rigorous API security – What’s in...
Leveraging open banking specifications for rigorous API security –  What’s in...Leveraging open banking specifications for rigorous API security –  What’s in...
Leveraging open banking specifications for rigorous API security – What’s in...
 
Adding layers of security to an API in real-time
Adding layers of security to an API in real-timeAdding layers of security to an API in real-time
Adding layers of security to an API in real-time
 
Getting the most from your API management platform: A case study
Getting the most from your API management platform: A case studyGetting the most from your API management platform: A case study
Getting the most from your API management platform: A case study
 
Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
 
The forgotten route: Making Apache Camel work for you
The forgotten route: Making Apache Camel work for youThe forgotten route: Making Apache Camel work for you
The forgotten route: Making Apache Camel work for you
 
Are open source and embedded software development on a collision course?
Are open source and embedded software development on a  collision course?Are open source and embedded software development on a  collision course?
Are open source and embedded software development on a collision course?
 
Three big mistakes with APIs and microservices
Three big mistakes with APIs and microservices Three big mistakes with APIs and microservices
Three big mistakes with APIs and microservices
 
5 strategies for enterprise cloud infrastructure success
5 strategies for enterprise cloud infrastructure success5 strategies for enterprise cloud infrastructure success
5 strategies for enterprise cloud infrastructure success
 
PSD2 & Open Banking: How to go from standards to implementation and compliance
PSD2 & Open Banking: How to go from standards to implementation and compliancePSD2 & Open Banking: How to go from standards to implementation and compliance
PSD2 & Open Banking: How to go from standards to implementation and compliance
 
Java 10 and beyond: Keeping up with the language and planning for the future
Java 10 and beyond: Keeping up with the language and planning for the futureJava 10 and beyond: Keeping up with the language and planning for the future
Java 10 and beyond: Keeping up with the language and planning for the future
 
How to keep developers happy and lawyers calm (Presented at ESC Boston)
How to keep developers happy and lawyers calm (Presented at ESC Boston)How to keep developers happy and lawyers calm (Presented at ESC Boston)
How to keep developers happy and lawyers calm (Presented at ESC Boston)
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
 
How to migrate SourcePro apps from Solaris to Linux
How to migrate SourcePro apps from Solaris to LinuxHow to migrate SourcePro apps from Solaris to Linux
How to migrate SourcePro apps from Solaris to Linux
 
Approaches to debugging mixed-language HPC apps
Approaches to debugging mixed-language HPC appsApproaches to debugging mixed-language HPC apps
Approaches to debugging mixed-language HPC apps
 
Enterprise Linux: Justify your migration from Red Hat to CentOS
Enterprise Linux: Justify your migration from Red Hat to CentOSEnterprise Linux: Justify your migration from Red Hat to CentOS
Enterprise Linux: Justify your migration from Red Hat to CentOS
 
Walk through an enterprise Linux migration
Walk through an enterprise Linux migrationWalk through an enterprise Linux migration
Walk through an enterprise Linux migration
 
How to keep developers happy and lawyers calm
How to keep developers happy and lawyers calmHow to keep developers happy and lawyers calm
How to keep developers happy and lawyers calm
 

Recently uploaded

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 

Real-time searching of big data with Solr and Hadoop

  • 1. 1 Real-time searching of big data with Solr and Hadoop Rod Cope, CTO
  • 2. 2 • Introduction • The problem • The solution • Top 10 lessons • Final thoughts • Q&A
  • 3. 3 Rod Cope, CTO Rogue Wave Software
  • 4. 4 • “Big data” – All the world’s open source software – Metadata, code, indexes – Individual tables contain many terabytes – Relational databases aren’t scale-free • Growing every day • Need real-time random access to all data • Long-running and complex analysis jobs
  • 5. 5 • Hadoop, HBase, and Solr – Hadoop – distributed file system, map/reduce – HBase – “NoSQL” data store – column-oriented – Solr – search server based on Lucene – All are scalable, flexible, fast, well-supported, and used in production environments • And a supporting cast of thousands… – Stargate, MySQL, Rails, Redis, Resque, – Nginx, Unicorn, HAProxy, Memcached, – Ruby, JRuby, CentOS, …
  • 6. 6 Internet Application LAN Data LAN *Caching and load balancing not shown
  • 7. 7 • HBase  NoSQL – Think hash table, not relational database • How do find my data if primary key won’t cut it? • Solr to the rescue – Very fast, highly scalable search server with built-in sharding and replication – based on Lucene – Dynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/XML, JSON, Ruby, and other data formats
  • 8. 8 • Sharding – Query any server – it executes the same query against all other servers in the group – Returns aggregated result to original caller • Async replication (slaves poll their masters) – Can use repeaters if replicating across data centers • OpenLogic – Solr farm, sharded, cross-replicated, fronted with HAProxy • Load balanced writes across masters, reads across masters and slaves • Be careful not to over-commit – Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple ways – Over 20 Solr fields indexed per source file
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. 15 • Experiment with different Solr merge factors – During huge loads, it can help to use a higher factor for load performance • Minimize index manipulation gymnastics • Start with something like 25 – When you’re done with the massive initial load/import, turn it back down for search performance • Minimize number of queries • Start with something like 5 • Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing
  • 16. 16 • Test your write-focused load balancing – Look for large skews in Solr index size – Note: you may have to commit, optimize, write again, and commit before you can really tell • Make sure your replication slaves are keeping up – Using identical hardware helps – If index directories don’t look the same, something is wrong
  • 17. 17 • Don’t commit to Solr too frequently – It’s easy to auto-commit or commit after every record – Doing this 100’s of times per second will take Solr down, especially if you have serious warm up queries configured • Avoid putting large values in HBase (> 5MB) – Works, but may cause instability and/or performance issues – Rows and columns are cheap, so use more of them instead
  • 18. 18 • Don’t use a single machine to load the cluster – You might not live long enough to see it finish • At OpenLogic, we spread raw source data across many machines and hard drives via NFS – Be very careful with NFS configuration – can hang machines • Load data into HBase via Hadoop map/reduce jobs – Turn off WAL for much better performance • put.setWriteToWAL(false) – Index in Solr as you go • Good way to test your load balancing write schemes and replication set up – This will find your weak spots!
  • 19. 19 • Writing data loading jobs can be tedious • Scripting is faster and easier than writing Java • Great for system administration tasks, testing • Standard HBase shell is based on JRuby • Very easy Map/Reduce jobs with J/Ruby and Wukong • Used heavily at OpenLogic – Productivity of Ruby – Power of Java Virtual Machine – Ruby on Rails, Hadoop integration, GUI clients
  • 20. 20
  • 21. 21 JRuby list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.find_all { |name| name.size <= 4 } puts shorts.size shorts.each { |name| puts name } -> 2 -> Rod Eric Groovy list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.findAll { name -> name.size() <= 4 } println shorts.size shorts.each { name -> println name } -> 2 -> Rod Eric
  • 22. 22 • Hadoop – SPOF around Namenode, append functionality • HBase – Backup, replication, and indexing solutions in flux • Solr – Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integration – No clear winner, none quite ready for production
  • 23. 23 • Many moving parts – It’s easy to let typos slip through – Consider automated configuration via Chef, Puppet, or similar • Pay attention to the details – Operating system – max open files, sockets, and other limits – Hadoop and HBase configuration • http://hbase.apache.org/book.html#trouble – Solr merge factor and norms • Don’t starve HBase or Solr for memory – Swapping will cripple your system
  • 24. 24 • “Commodity hardware” != 3 year old desktop • Dual quad-core, 32GB RAM, 4+ disks • Don’t bother with RAID on Hadoop data disks – Be wary of non-enterprise drives • Expect ugly hardware issues at some point
  • 25. 25 • Environment – 100+ CPU cores – 100+ Terabytes of disk – Machines don’t have identity – Add capacity by plugging in new machines • Why not Amazon EC2? – Great for computational bursts – Expensive for long-term storage of big data – Not yet consistent enough for mission-critical usage of HBase
  • 26. 26 • Dual quad-core and dual hex-core • Dell boxes • 32-64GB RAM – ECC (highly recommended by Google) • 6 x 2TB enterprise hard drives • RAID 1 on two of the drives – OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc. – Key “source” data backups • Hadoop datanode gets remaining drives • Redundant enterprise switches • Dual- and quad-gigabit NIC’s
  • 27. 27 • Amazon EC2 – EBS Storage • 100TB * $0.10/GB/month = $120k/year – Double Extra Large instances • 13 EC2 compute units, 34.2GB RAM • 20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year • 3 year reserved instances – 20 * 4k = $80k up front to reserve – (20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate – Totals for 20 virtual machines • 1st year cost: $120k + $80k + $86k = $286k • 2nd & 3rd year costs: $120k + $86k = $206k • Average: ($286k + $206k + $206k) / 3 = $232k/year
  • 28. 28 • Buy your own – 20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k • Over 33 EC2 compute units each – Total: $53k/year (amortized over 3 years)
  • 29. 29 • Amazon EC2 – 20 instances * 13 EC2 compute units = 260 EC2 compute units – Cost: $232k/year • Buy your own – 20 machines * 33 EC2 compute units = 660 EC2 compute units – Cost: $53k/year – Does not include hosting and maintenance costs • Don’t think system administration goes away – You still “own” all the instances – monitoring, debugging, support
  • 30. 30 • Hardware – Power supplies, hard drives • Operating System – Kernel panics, zombie processes, dropped packets • Software Servers – Hadoop datanodes, HBase regionservers, Stargate servers, Solr servers • Your Code and Data – Stray map/reduce jobs, strange corner cases in your data leading to program failures
  • 31. 31
  • 32. 32 • Hadoop, HBase, Solr • Apache, Tomcat, ZooKeeper, HAProxy • Stargate, JRuby, Lucene, Jetty, HSQLDB, Geronimo • Apache Commons, JUnit • CentOS • Dozens more • Too expensive to build or buy everything
  • 33. 33 • You can host big data in your own private cloud – Tools are available today that didn’t exist a few years ago – Fast to prototype – production readiness takes time – Expect to invest in training and support • Hbase and Solr are fast – 100+ random queries/sec per instance – Give them memory and stand back • Hbase Scales, Solr scales (to a point) – Don’t worry about out-growing a few machines – Do worry about out-growing a rack of Solr instances • Look for ways to partition your data other than “automatic” sharding
  • 35. 35 Rod Cope rod.cope@roguewave.com See us in action: www.roguewave.com
  • 36. 36