Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Running Solr at Memory Speed with Alluxio
Timothy Potter
Lucidworks
Agenda
• Overview of Alluxio
• Running Solr on Alluxio
• Interesting Use Cases
• Futures
• Questions?
3
01
Cool things I’ve learned about Alluxio …
• Fastest growing open source project in big data
space
• Baidu reported hav...
4
01
Alluxio Basics
• Hadoop FileSystem API: alluxio://…
• Supports single node up to massive
clusters
• Uses ZK for HA st...
5
01
Configure Solr to use Alluxio
• mkdir or mount Solr root dir in Alluxio
bin/alluxio fs mkdir /solr
• Set start-up opt...
6
01
Solr on Alluxio Tips & Tricks
• Run an Alluxio worker on each Solr node
• Write mode should be CACHE_THROUGH to ensur...
7
01
Use Case 1: Replace the OS cache with Local under FS
• Index performance
~ 5M docs, ~4K docs/sec, <1% diff than local...
8
01
Use Case 2: Use cloud storage as under FS (S3, GCS, Azure)
• Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on loc...
9
01
Use Case 3: Time-based Partitioning
• Fits nicely with write-once indexes: signals, logs
• Use Alluxio’s TTL feature ...
1
01
Use Case 4: Cloud-based Recovery
• Solr auto-add replica (have to use
the HdfsUpdateLog)
<updateLog class=“solr.HdfsU...
1
01
Synergy with Analytics & Machine Learning
• Solr streaming expressions power analytics jobs that may
require massive ...
1
01
Work in progress …
• ALLUXIO-2995: Perf issue (fixed in 1.6.0)
Work-around is: alluxio.user.file.cache.partially.read...
1
01
FAQ
• Does Alluxio support running in HA mode?
• How does data locality work with Solr & Alluxio?
• What block size d...
Thank You
Upcoming SlideShare
Loading in …5
×

Running Solr in the Cloud at Memory Speed with Alluxio

1,280 views

Published on

In this talk, I introduce Alluxio, the fastest growing open source project in the big data ecosystem, and show how to leverage it for optimizing Solr performance. I'll begin with a brief introduction about how Alluxio works and why it's interesting for the Solr community. Next, I describe how to run Solr on Alluxio and cover basic integration scenarios. Lastly, I provide some performance comparisons between running Solr on Alluxio vs. a local FS and HDFS. Attendees will come away with a new toolset to help them use Solr to tackle a wide array of big data problems.

Published in: Data & Analytics
  • Be the first to comment

Running Solr in the Cloud at Memory Speed with Alluxio

  1. 1. Running Solr at Memory Speed with Alluxio Timothy Potter Lucidworks
  2. 2. Agenda • Overview of Alluxio • Running Solr on Alluxio • Interesting Use Cases • Futures • Questions?
  3. 3. 3 01 Cool things I’ve learned about Alluxio … • Fastest growing open source project in big data space • Baidu reported having an Alluxio cluster with 1000 workers and 50TB of RAM … in Feb 2016! • Brings cloud-storage into the compute layer; data access at memory speed • No need to move / migrate data into Alluxio; just mount the under storage! • Apache 2.0 licensed but also has a commercial offering with support if needed
  4. 4. 4 01 Alluxio Basics • Hadoop FileSystem API: alluxio://… • Supports single node up to massive clusters • Uses ZK for HA stuff; master/worker model • Supports many popular storage systems: HDFS, S3, Azure Blob store, GCS, GlusterFS … • Alluxio FUSE to mount as FS on Linux memory-centric virtual distributed storage system
  5. 5. 5 01 Configure Solr to use Alluxio • mkdir or mount Solr root dir in Alluxio bin/alluxio fs mkdir /solr • Set start-up options in bin/solr.in.sh: solr.directoryFactory=HdfsDirectoryFactory solr.lock.type=hdfs solr.hdfs.home=alluxio://master:19998/solr solr.hdfs.confdir=/path/hadoop-conf • Add a core-site.xml to set: fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem fs.alluxio.impl.disable.cache=true alluxio.user.file.writetype.default=CACHE_THROUGH • Add alluxio client JAR to Solr classpath Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to server/solr-webapp/webapp/WEB-INF/lib/ • Upconfig alluxio configset to ZK bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf see: http://bit.ly/2y33wQs
  6. 6. 6 01 Solr on Alluxio Tips & Tricks • Run an Alluxio worker on each Solr node • Write mode should be CACHE_THROUGH to ensure Solr files get persisted to the under storage, e.g. S3 • Admin can “pin” an index directory to ensure it stays cached in memory • Set TTL on index directories that can be freed from memory after a given timeframe • Load command moves data from the under storage into Alluxio, such as after restoring an index from backup
  7. 7. 7 01 Use Case 1: Replace the OS cache with Local under FS • Index performance ~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk • Query performance (9gb index, 5M docs, r4.xlarge) * NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology Metrics Alluxio MMap/SSD HDFS QPS 36 42 20 Max QTime 2212 ms 1789 ms 5612 ms Stddev QTime 335 ms 353 ms 609 ms Median QTime 70 ms 9 ms 187 ms 75% 372 ms 383 ms 754 ms 95% 972 ms 996 ms 1723 ms 99% 1426 ms 1349 ms 2599 ms
  8. 8. 8 01 Use Case 2: Use cloud storage as under FS (S3, GCS, Azure) • Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local • As expected, query perf metrics nearly identical  • Mount the cloud storage system to a directory in Alluxio bin/alluxio fs mount alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio • Deploy cloud instances with lots of memory, e.g. r4’s in EC2 • Use tiered storage to take advantage of the ephemeral disks (fast SSDs) • “pin” specific indexes for better performance guarantees S3 or GCS Alluxio (memory) 10 to 100 Gbps 100 Mbps to 10 Gbps
  9. 9. 9 01 Use Case 3: Time-based Partitioning • Fits nicely with write-once indexes: signals, logs • Use Alluxio’s TTL feature to “free” indexes on aged out partitions • Tiered storage also allows you to have hot (memory), warm (SSD), cool (HDD), and cold (S3) partitions • Allocators and evictors to re-arrange blocks between tiers; easy to plug-in advanced strategies Solr Partition 9-15 Solr Partition 9-14 Alluxio (memory) Alluxio (SSD) Solr Partition 9-13 S3 or GCS
  10. 10. 1 01 Use Case 4: Cloud-based Recovery • Solr auto-add replica (have to use the HdfsUpdateLog) <updateLog class=“solr.HdfsUpdateLog”> … • Alluxio will pull the files from memory on another worker if they’re available or go back to under FS storage • Wise to have some auto-warming queries / caches configured so that replicas don’t get marked as active in the cluster until they are warmed up … thanks Shalin! SOLR-6086 S3 or GCS Solr Replica Alluxio (memory) Node 1 (us-east-1d) Node 2 (us-east-1c) Solr overseer Solr Replica Add Replica Alluxio (memory)
  11. 11. 1 01 Synergy with Analytics & Machine Learning • Solr streaming expressions power analytics jobs that may require massive result sets at once • Hybrid solutions that mix Solr with compute frameworks like Spark and Flink • Alluxio speeds up SparkSQL and ML jobs • Fusion SQL ~ Keeping expensive views in Alluxio for analytics dashboards (complex queries against data loaded from Solr)
  12. 12. 1 01 Work in progress … • ALLUXIO-2995: Perf issue (fixed in 1.6.0) Work-around is: alluxio.user.file.cache.partially.read.block=false • Orphaned write.lock prevents core initialization after crash, SOLR- 8335 and SOLR-8169 bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock • SOLR-11335: Closing FileSystem object retrieved from get() fs.alluxio.impl.disable.cache = true (in core-site.xml) • SOLR-6237: Shared replicas • SOLR-9515: Couldn’t get Solr running with s3a w/o Alluxio; classpath issues  • Test ASYNC_THROUGH write mode with Solr
  13. 13. 1 01 FAQ • Does Alluxio support running in HA mode? • How does data locality work with Solr & Alluxio? • What block size do you recommend for Solr? • What’s the overhead of CACHE_THROUGH during indexing? • What about Solr’s block cache? • Does Alluxio work with Solr 7?
  14. 14. Thank You

×