Integrate Solr with real-time stream processing applications

Presented by Timothy Potter, Founder, Text Centrix

Presented by Timothy Potter, Founder, Text Centrix

Storm is a real-time distributed computation system used to process massive streams of data. Many organizations are turning to technologies like Storm to complement batch-oriented big data technologies, such as Hadoop, to deliver time-sensitive analytics at scale. This talk introduces on an emerging architectural pattern of integrating Solr and Storm to process big data in real time. There are a number of natural integration points between Solr and Storm, such as populating a Solr index or supplying data to Storm using Solr’s real-time get support. In this session, Timothy will cover the basic concepts of Storm, such as spouts and bolts. He’ll then provide examples of how to integrate Solr into Storm to perform large-scale indexing in near real-time. In addition, we'll see how to embed Solr in a Storm bolt to match incoming tuples against pre-configured queries, commonly known as percolator. Attendees will come away from this presentation with a good introduction to stream processing technologies and several real-world use cases of how to integrate Solr with Storm.

  • 2. whoami independent consultant search / big data projects soon to be joining engineering team @LucidWorks co-author Solr In Action previously big data architect Dachis Group
  • 3. my storm story re-designed a complex batch-oriented indexing pipeline based on Hadoop (Oozie, Pig, Hive, Sqoop) to real-time storm topology
  • 4. agenda walk through how to develop a storm topology common integration points with Solr (near real-time indexing, percolator, real-time get)
  • 5. example listen to click events from URL shortener ( to determine trending US government sites stream of click events: ->
  • 6. beyond word count tackle real challenges you’ll encounter when developing a storm topology and what about ... unit testing, dependency injection, measure runtime behavior of your components, separation of concerns, reducing boilerplate, hiding complexity ...
  • 7. storm open source distributed computation system scalability, fault-tolerance, guaranteed message processing (optional)
  • 8. storm primitives •  •  •  •  •  tuple: ordered list of values stream: unbounded sequence of tuples spout: emit a stream of tuples (source) bolt: performs some operation on each tuple topology: dag of spouts and tuples
  • 9. solution requirements •  •  •  •  •  receive click events from stream count frequency of pages in a time window rank top N sites per time window extract title, body text, image for each link persist rankings and metadata for visualization
  • 10. trending snapshot (sept 12, 2013)
  • 11. API bolt spout field grouping hash global grouping EnrichLink Bolt Spout field grouping hash provided by in the storm-starter project Solr Indexing Bolt field grouping obj Rolling Count Bolt Intermediate Rankings Bolt Total Rankings Bolt stream data store Solr grouping global grouping Persist Rankings Bolt Metrics DB
  • 12. stream grouping •  •  •  •  shuffle: random distribution of tuples to all instances of a bolt field(s): group tuples by one or more fields in common global: reduce down to one all: replicate stream to all instances of a bolt source:
  • 13. useful storm concepts bolts can receive input from many spouts tuples in a stream can be grouped together streams can be split and joined bolts can inject new tuples into the stream components can be distributed across a cluster at a configurable parallelism level •  optionally, storm keeps track of each tuple emitted by a spout (ack or fail) •  •  •  •  • 
  • 14. tools •  •  •  •  •  Spring framework – dependency injection, configuration, unit testing, mature, etc. Groovy – keeps your code tidy and elegant Mockito – ignore stuff your test doesn’t care about Netty – fast & powerful NIO networking library Coda Hale metrics – get visibility into how your bolts and spouts are performing (at a very low-level)
  • 15. spout easy! just produce a stream of tuples ... and ... avoid blocking when waiting for more data, ease off throttle if topology is not processing fast enough, deal with failed tuples, choose if it should use message Ids for each tuple emitted, data model / schema, etc ...
  • 16. Hide complexity of implementing Storm contract SpringSpout Streaming DataAction (POJO) Streaming DataProvider (POJO) Spring Dependency Injection SpringBolt developer focuses on business logic Spring container (1 per topology per JVM) JDBC WebService
  • 17. streaming data provider class  OneUsaGovStreamingDataProvider  implements  StreamingDataProvider,  MessageHandler  {     Spring Dependency Injection        MessageStream  messageStream            ...            void  open(Map  stormConf)  {  messageStream.receive(this)  }     non-blocking call to get the        boolean  next(NamedValues  nv)  {   next message from                String  msg  =  queue.poll()                  if  (msg)  {                          OneUsaGovRequest  req  =  objectMapper.readValue(msg,  OneUsaGovRequest)                          if  (req  !=  null  &&  req.globalBitlyHash  !=  null)  {                                  nv.set(OneUsaGovTopology.GLOBAL_BITLY_HASH,  req.globalBitlyHash)                                  nv.set(OneUsaGovTopology.JSON_PAYLOAD,  req)                                  return  true   use Jackson JSON parser                        }   to create an object from the                }     raw incoming data                return  false          }            void  handleMessage(String  msg)  {  queue.offer(msg)  }  
  • 18. jackson json to java @JsonIgnoreProperties(ignoreUnknown  =  true)   class  OneUsaGovRequest  implements  Serializable  {            @JsonProperty("a")          String  userAgent;            @JsonProperty("c")   Spring converts json to java object for you:        String  countryCode;      <bean  id="restTemplate"            @JsonProperty("nk")          class="org.springframework.web.client.RestTemplate">          int  knownUser;        <property  name="messageConverters">              <list>          @JsonProperty("g")                <bean  id="messageConverter”          String  globalBitlyHash;          class="...json.MappingJackson2HttpMessageConverter">                        </bean>          @JsonProperty("h")                  </list>          String  encodingUserBitlyHash;              </property>            </bean>          @JsonProperty("l")          String  encodingUserLogin;            ...   }  
  • 19. spout data provider spring-managed bean <bean  id="oneUsaGovStreamingDataProvider"                class="com.bigdatajumpstart.storm.OneUsaGovStreamingDataProvider">          <property  name="messageStream">                  <bean  class="com.bigdatajumpstart.netty.HttpClient">                          <constructor-­‐arg  index="0"  value="${streamUrl}"/>                  </bean>          </property>   </bean>   Note: when building the StormTopology to submit to Storm, you do:  builder.setSpout("­‐spout",                                          new  SpringSpout("oneUsaGovStreamingDataProvider",  spoutFields),  1)  
  • 20. spout data provider unit test class  OneUsaGovStreamingDataProviderTest  extends  StreamingDataProviderTestBase  {            @Test          void  testDataProvider()  {                    String  jsonStr  =  '''{                          "a":  "user-­‐agent",  "c":  "US",                          "nk":  0,  "tz":  "America/Los_Angeles",                          "gr":  "OR",  "g":  "2BktiW",                          "h":  "12Me4B2",  "l":  "usairforce",                          "al":  "en-­‐us",  "hh":  "",                          "r":  "",                          ...                  }'''                    OneUsaGovStreamingDataProvider  dataProvider  =  new  OneUsaGovStreamingDataProvider()                  dataProvider.setMessageStream(mock(MessageStream))          //  Config  setup  in  base  class                  dataProvider.handleMessage(jsonStr)                    NamedValues  record  =  new  NamedValues(OneUsaGovTopology.spoutFields)                  assertTrue                  ...          }   }   mock json to simulate data from feed use Mockito to satisfy dependencies not needed for this test asserts to verify data provider works correctly
  • 21. rolling count bolt •  •  •  •  counts frequency of links in a sliding time window emits topN in current window every M seconds uses tick tuple trick provided by Storm to emit every M seconds (configurable) provided with the storm-starter project
  • 22. enrich link metadata bolt •  •  •  •  •  calls out to API caches results locally in the bolt instance relies on field grouping (incoming tuples) outputs data to be indexed in Solr benefits from parallelism to enrich more links concurrently (watch those rate limits)
  • 23. service class  EmbedlyService  {            @Autowired          RestTemplate  restTemplate   integrate coda hale metrics          String  apiKey            private  Timer  apiTimer  =  MetricsSupport.timer(EmbedlyService,  "apiCall")            Embedly  getLinkMetadata(String  link)  {                  String  urlEncoded  =  URLEncoder.encode(link,"UTF-­‐8")                  URI  uri  =  new  URI("${apiKey}&url=${urlEncoded}")                    Embedly  embedly  =  null                  MetricsSupport.withTimer(apiTimer,  {                          embedly  =  restTemplate.getForObject(uri,  Embedly)                  })                  return  embedly   simple closure to time our        }   requests to the Web service
  • 24. metrics •  •  •  •  capture runtime behavior of the components in your topology Coda Hale metrics - output metrics every N minutes report metrics to JMX, ganglia, graphite, etc
  • 25. -­‐-­‐  Meters  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   EnrichLinkBoltLogic.solrQueries                            count  =  97                    mean  rate  =  0.81  events/second            1-­‐minute  rate  =  0.89  events/second            5-­‐minute  rate  =  1.62  events/second          15-­‐minute  rate  =  1.86  events/second     SolrBoltLogic.linksIndexed                            count  =  60                    mean  rate  =  0.50  events/second            1-­‐minute  rate  =  0.41  events/second            5-­‐minute  rate  =  0.16  events/second          15-­‐minute  rate  =  0.06  events/second     -­‐-­‐  Timers  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   EmbedlyService.apiCall                            count  =  60                    mean  rate  =  0.50  calls/second            1-­‐minute  rate  =  0.40  calls/second            5-­‐minute  rate  =  0.16  calls/second          15-­‐minute  rate  =  0.06  calls/second                                min  =  138.70  milliseconds                                max  =  7642.92  milliseconds                              mean  =  1148.29  milliseconds                          stddev  =  1281.40  milliseconds                          median  =  652.83  milliseconds                              75%  <=  1620.96  milliseconds                              ...  
  • 26. storm cluster concepts •  •  •  •  •  •  nimbus: master node (~job tracker in Hadoop) zookeeper: cluster management / coordination supervisor: one per node in the cluster to manage worker processes worker: one or more per supervisor (JVM process) executor: thread in worker task: work performed by a spout or bolt
  • 27. Topology   JAR   Nimbus   Node  1   Zookeeper   Supervisor  (1  per  node)   Each component (spout or bolt) is distributed across a cluster of workers based on a configurable parallelism Worker  1  (port  6701)   executor   (thread)   JVM  process   ...  N  workers   ...  M  nodes  
  • 28.  @Override    StormTopology  build(StreamingApp  app)  throws  Exception  {   parallelism hint to                   the framework          ...     (can be rebalanced)          TopologyBuilder  builder  =  new  TopologyBuilder()              builder.setSpout("­‐spout",                      new  SpringSpout("oneUsaGovStreamingDataProvider",  spoutFields),  1)              builder.setBolt("enrich-­‐link-­‐bolt",                      new  SpringBolt("enrichLinkAction",  enrichedLinkFields),  3)                                .fieldsGrouping("­‐spout",  globalBitlyHashGrouping)              ...  
  • 29. solr integration points •  •  •  real-time get near real-time indexing (NRT) percolate (match incoming docs to pre-existing queries)
  • 30. real-time get use Solr for fast lookups by document ID class  SolrClient  {            @Autowired          SolrServer  solrServer            SolrDocument  get(String  docId,  String...  fields)  {                  SolrQuery  q  =  new  SolrQuery()                  q.setRequestHandler("/get")                  q.set("id",  docId)                  q.setFields(fields)                  QueryRequest  req  =  new  QueryRequest(q)                  req.setResponseParser(new  BinaryResponseParser())                  QueryResponse  rsp  =  req.process(solrServer)                  return  (SolrDocument)rsp.getResponse().get("doc")          }   }   send the request to the “get” request handler
  • 31. near real-time indexing •  If possible, use CloudSolrServer to route documents directly to the correct shard leaders (SOLR-4816) •  Use <openSearcher>false</openSearcher> for auto “hard” commits •  Use auto soft commits as needed •  Use parallelism of Storm bolt to distribute indexing work to N nodes
  • 32. percolate •  match incoming documents to pre-configured queries (inverted search) –  example: Is this tweet related to campaign Y for brand X? •  use storm’s distributed computation support to evaluate M pre-configured queries per doc
  • 33. two possible approaches •  Lucene-only solution using MemoryIndex –  See presentation by Charlie Hull and Alan Woodward •  EmbeddedSolrServer –  Full solrconfig.xml / schema.xml –  RAMDirectory –  Relies on Storm to scale up documents / second –  Easy solution for up to a few thousand queries
  • 34. PercolatorBolt  1   Embedded   SolrServer   Twi"er   Spout   random   stream   grouping   ...   Pre-­‐configured   queries  stored  in     a  database   Could be 100’s of these PercolatorBolt  N   Embedded   SolrServer   ZeroMQ   pub/sub  to  push   query  changes   to  percolator  
  • 35. tick tuples •  send a special kind of tuple to a bolt every N seconds  if  (TupleHelpers.isTickTuple(input))  {            //  do  special  work    }   used in percolator to delete accumulated documents every minute or so ...
  • 36. references •  Storm Wiki • •  Overview: Krishna Gade • •  Trending Topics: Michael Knoll • •  Understanding Parallelism: Michael Knoll •
  • 37. Q&A get the code: Manning  coupon  codes  for  conference  related  books:     h"p://