Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-time Search on Terabytes of Data Per Day: Lessons Learned

323 views

Published on

The challenge with operating modern data centers is the figuring out how to collect, store, and search all of the log and event data generated by complex infrastructure. To address these challenges at Rocana we're applying big data technologies that are designed to handle this massive scale. In particular we need to simultaneously offer full-text and faceted search queries against large volumes of historical data as well as perform near-real time search against events collected in the last minute. In this talk we describe the challenges we've experienced performing search against petabytes of historical data while scaling to terabytes of new data ingest per day. We'll also share the key lessons we've learned and detail our architecture for handling search at these massive volumes. We detail how we use Apache Kafka to stream event data in near real-time to Apache Hadoop HDFS for deep storage and how we leverage Apache Solr and Apache Lucene to scale our search infrastructure to terabytes per day of ingest. We present our solution in the context of an overall event data management solution and show real-world scalability results.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Real-time Search on Terabytes of Data Per Day: Lessons Learned

  1. 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead - @fwiffo Data Day Seattle 2016 Real-time Search on Terabytes of Data Per Day Lessons Learned
  2. 2. © Rocana, Inc. All Rights Reserved. | 2 Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  3. 3. © Rocana, Inc. All Rights Reserved. | 3
  4. 4. © Rocana, Inc. All Rights Reserved. | 4 Context • We built a system for large scale realtime collection, processing, and analysis of event-oriented machine data • On prem or in the cloud, but not SaaS • Supportability is a big deal for us • Predictability of performance under load and failures • Ease of configuration and operation • Behavior in wacky environments • All of our decisions are informed by this - YMMV
  5. 5. © Rocana, Inc. All Rights Reserved. | 5 What I mean by “scale” • Typical: 10s of TB of new data per day • Average event size ~200-500 bytes • 20TB per day • @200 bytes = 1.2M events / second, ~109.9B events / day, 40.1T events / year • @500 bytes = 509K events / second, ~43.9B events / day, 16T events / year, • Retaining years of data online for query
  6. 6. © Rocana, Inc. All Rights Reserved. | 6 General purpose search – the good parts • We originally built against Solr Cloud (but most of this goes for Elastic Search too) • Amazing feature set for general purpose search • Good support for moderate scale • Excellent at • Content search – news sites, document repositories • Finite size datasets – product catalogs, job postings, things you prune • Low(er) cardinality datasets that (mostly) fit in memory
  7. 7. © Rocana, Inc. All Rights Reserved. | 7 Problems with general purpose search systems • Fixed shard allocation models – always N partitions • Multi-level and semantic partitioning is painful without building your own macro query planner • All shards open all the time; poor resource control for high retention • APIs are record-at-a-time focused for NRT indexing; poor ingest performance (aka: please stop making everything REST!) • Ingest concurrency is wonky • High write amplification on data we know won’t change • Other smaller stuff…
  8. 8. © Rocana, Inc. All Rights Reserved. | 8 “Well actually…” Plenty of ways to push general purpose systems (We tried many of them) • Using multiple collections as partitions, macro query planning • Running multiple JVMs per node for better utilization • Pushing historical searches into another system • Building weirdo caches of things At some point the cost of hacking outweighed the cost of building
  9. 9. © Rocana, Inc. All Rights Reserved. | 9 Warning! • This is not a condemnation of general purpose search systems! • Unless the sky is falling, use one of those systems
  10. 10. © Rocana, Inc. All Rights Reserved. | 10 We built a thing: Rocana Search High cardinality, low latency, parallel search system for time-oriented events
  11. 11. © Rocana, Inc. All Rights Reserved. | 11 Key Goals for Rocana Search • Higher indexing throughput per node than Solr for time-oriented event data • Scale horizontally better than Solr • Support an arbitrary number of dynamically created partitions • Arbitrarily large amounts of indexed data on disk • all data queryable without wasting resources for infrequently used data • Ability to add/remove Search nodes dynamically without any manual restarts or rebalances
  12. 12. © Rocana, Inc. All Rights Reserved. | 12 Some Key Features of Rocana Search • Fully parallelized ingest and query, built for large clusters • Every node is an indexer Hadoop Node Rocana Search Hadoop Node Rocana SearchHadoop Node Rocana Search Hadoop Node Rocana Search Kafka
  13. 13. © Rocana, Inc. All Rights Reserved. | 13 Some Key Features of Rocana Search • Every node is a query coordinator and executor Query Client Rocana Search Coord Exec Rocana Search Coord Exec Rocana Search Coord Exec Rocana Search Coord Exec
  14. 14. © Rocana, Inc. All Rights Reserved. | 14 Architecture (A single node) RS HDFS MetadataIndex Management Coordinator ExecutorLucene Indexes Query Client Kafka Data Producers ZK
  15. 15. © Rocana, Inc. All Rights Reserved. | 15 Sharding Model: datasets, partitions, and slices • A search dataset is split into partitions by a partition strategy • Think: “By year, month, day” • Partitioning invisible to queries (e.g. `time:[x TO y] AND host:z` works normally) • Partitions are divided into slices to support lock-free parallel writes • Think: “This day has 20 slices, each of which is independent for write” • Number of slices == Kafka partitions
  16. 16. © Rocana, Inc. All Rights Reserved. | 16 Datasets, partitions, and slices Dataset “events” Partition “2016/01/01” Slice 0 Slice 1 Slice 2 Slice N Partition “2016/01/02” Slice 0 Slice 1 Slice 2 Slice N
  17. 17. © Rocana, Inc. All Rights Reserved. | 17 From events to partitions to slices Partition 2016/01/01 Slice 0 Slice 1 Topic events KP 0 KP 1 Event 1 2016/01/01 Event 2 2016/01/01 Event 3 2016/01/02 Partition 2016/01/02 Slice 0 Slice 1 E1 E2 E3
  18. 18. © Rocana, Inc. All Rights Reserved. | 18 Assigning slices to nodes Node 1 Partition 2016/01/01 S 0 S 2 Partition 2016/01/02 S 0 S 2 Partition 2016/01/03 S 0 S 2 Partition 2016/01/04 S 0 S 2 Topic eventsKP 0 KP 2 KP 1 KP 3 Node 2 Partition 2016/01/01 S 1 S 3 Partition 2016/01/02 S 1 S 3 Partition 2016/01/03 S 1 S 3 Partition 2016/01/04 S 1 S 3
  19. 19. © Rocana, Inc. All Rights Reserved. | 19 The write path • One of the search nodes is the exclusive owner of KP 0 and KP 1 • Consume a batch of events • Use the partition strategy to figure out to which RS partition it belongs • Kafka messages carry the partition so we know the slice • Event written to the proper partition/slice • Eventually the indexes are committed • If the partition or slice is new, metadata service is informed
  20. 20. © Rocana, Inc. All Rights Reserved. | 20 Query • Queries submitted to coordinator via RPC • Coordinator parses query and aggressively prunes partitions to search by analyzing predicates • Coordinator schedules and monitors fragments, merges results, responds to client • Fragments are submitted to executors for processing • Executors search exactly what they’re told, stream to coordinator • Fragment is generated for every slice that may contain data
  21. 21. © Rocana, Inc. All Rights Reserved. | 21 Some benefits of the design • Search processes are on the same nodes as the HDFS DataNode • First replica of any event received by search from Kafka is written locally • Unless nodes fail, all reads are local (HDFS short circuit reads) • Linux kernel page cache is useful here • HDFS caching could also be used (not yet doing this) • Search uses off-heap block cache as well • In case of failure, any search node can read any index • HDFS overhead winds up being very little, still get the advantages
  22. 22. © Rocana, Inc. All Rights Reserved. | 22 Benchmarks
  23. 23. © Rocana, Inc. All Rights Reserved. | 23 Disclaimer • Yes we are a vendor • No you shouldn't take our word • Ask me about POCs • Yes I'll show you results anyway 
  24. 24. © Rocana, Inc. All Rights Reserved. | 24 Event ingest and indexing • Most recent data we have for Rocana Search vs. Solr • G1GC garbage collector • CDH 5.5.1 • AWS (d2.2xlarge) – 8 cpus, 60 GB RAM • 4 data nodes, 8 Kafka partitions • 32 Solr shards • 4 day run, ~1.4 TiB indexed on disk
  25. 25. © Rocana, Inc. All Rights Reserved. | 25 Event ingest and indexing Day 1 (325 byte events) Day 2 (685 byte events) Day 3 (2.2 KiB events) Day 4 (685 byte events) Solr 12,500 eps 3.9 MiB/s 9,000 eps 5.9 MiB/s 2,500 eps 5.4 MiB/s 5,500 eps 3.6 MiB/s Rocana Search 46,800 eps 14.5 MiB/s 43,800 eps 28.6 MiB/s 15,000 eps 32 MiB/s 43,800 eps 28.6 MiB/s RS vs. Solr 3.7x faster 4.9x faster 6.0x faster 8.0x faster
  26. 26. © Rocana, Inc. All Rights Reserved. | 26 Query • 6 hour query, facet by 4 fields (host, service, location, event_type_id) • 2 scenarios • Query under no active ingest • Query while ingesting events for the time period being queried • Queries repeated ~105 times against each system • Take advantage of OS page cache and Solr/RS block cache
  27. 27. © Rocana, Inc. All Rights Reserved. | 27 Query (No Ingest)
  28. 28. © Rocana, Inc. All Rights Reserved. | 28 Query (No Ingest) No Ingest Percentile Solr (sec) Rocana Search (sec) Comparison 6 hour query, 325 byte events 50th 1.8 2.95 Solr 1.6x faster 90th 1.9 3.12 Solr 1.6x faster 95th 2.1 3.25 Solr 1.5x faster
  29. 29. © Rocana, Inc. All Rights Reserved. | 29 Query (Simultaneous Ingest)
  30. 30. © Rocana, Inc. All Rights Reserved. | 30 Query (Simultaneous Ingest) Simultaneous Ingest Percentile Solr (sec) Rocana Search (sec) Comparison 6 hour query, 325 byte events 50th 5.7 10.0 Solr 1.75x faster 75th 9.8 11.5 Solr 1.17x faster 90th 24.2 12.2 RS 2x faster 95th 34.3 13.0 RS 2.6x faster
  31. 31. © Rocana, Inc. All Rights Reserved. | 31 What we’ve really shown In the context of search, scale means: • High cardinality: Billions of unique events per day • High speed ingest: Hundreds of thousands of events per second • Not having to age data out of the dataset • Handling large, concurrent queries, while ingesting data • Fully utilizing modern hardware These things are very possible
  32. 32. © Rocana, Inc. All Rights Reserved. | 32 Thank you! Questions? joey@rocana.com @fwiffo The Rocana Search Team: • Michael Peterson - @quux00 • Mark Tozzi - @not_napoleon • Brad Cupit - @bradcupit • Brett Hoerner - @bretthoerner • Joey Echeverria - @fwiffo • Eric Sammer - @esammer • Marvin Anderson

×