Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Searching Billions of Product Logs in Real Time (Use Case)


Published on

Published in: Technology
  • Be the first to comment

Searching Billions of Product Logs in Real Time (Use Case)

  1. 1. SEARCHING BILLIONSOF PRODUCT LOGS INREALTIMERyanTabora -Think Big AnalyticsNoSQL Search Roadshow - June 6, 2013
  2. 2. WHO AM I?RyanTaboraThink Big Analytics - Senior Data EngineerLover of dachshunds, bass, and zombies
  3. 3. OVERVIEWPrimersWhat are product logs?How do they apply to big data?Real use caseReal issues and designsConclusion
  4. 4. PRODUCT LOGS?• Device data• IT, Energy, Healthcare, Manufacturing,Telecom ...• These devices are pushing data back home (pull works too!)• As more devices are sold/installed, more and more datacomes back to ‘home base’
  5. 5. • RealtimeVisualization• Realtime Response• Ad Hoc Analysis• Full Historical Capture• Blended Data SetsPOWER OF DEVICE DATADEVICEDATA
  6. 6. TRADITIONAL APPROACHES• SQL: PostGres, MySQL, Oracle, Microsoft• SQL provides many of the search features required for typicalsearch applications• Joins, regex, group by, sorting, etc• But the these technologies can only scale so far...
  7. 7. • Hadoop• HBase/Cassandra/Accumulo• Search features are very limited• HBase row scans, primary key index• Cassandra limited secondary indexingNEWTECHNIQUESSTORING DATA
  8. 8. • What is an index?• Lucene• Paralleling Index Creation• MapReduce/Flume/Storm• RealTime Search• Searching before it hits diskNEWTECHNIQUESINDEXING DATA
  9. 9. • Solr/ElasticSearch• Both build on top of Lucene• Search servers• RESTful HTTP APIs• Easy to administer• Add powerful text/numerical search capabilitiesNEWTECHNIQUESSEARCHING DATA
  10. 10. BASIC SEARCH FEATURES• Boolean logic (AND, OR + -)• Sorting and Group By• Range queries• Phrase/Prefix/Fuzzy queries
  11. 11. ADVANCED SEARCHFEATURES• Custom ranking/scoring• More like this• Auto suggest• Faceting/Highlighting• Geo-spacial search
  12. 12. SCALING SEARCH• ElasticSearch and SolrCloud both have distributed featuresbuilt in• Auto-sharding• Replication• Query routing• Transaction log
  13. 13. USE CASEProblemSample SolutionCore Design IssuesOther Solutions
  14. 14. THE PROBLEMHomeBaseNetAppFilerNetAppFilerDeviceClient ANetAppFilerNetAppFilerDeviceClient BNetAppFilerNetAppFilerDeviceClient CLogLogLogLogsRESTAPIFull SQLAccessFlat FileAccessLatestAllApplicationsEngineers& Analysts
  15. 15. SEARCH APPLICATIONFEATURES• Find last three days of raw logs from an entire cluster• Group capacity available grouped by machine serial numberand show the largest capacities first• Search all device header lines for “FAILURE”• View all hard disk objects that have product number 2341AB• Find all motherboards with an associated customer ticket
  16. 16. SAMPLE SOLUTIONLogsIngestionParsing/LoadingCustomRESTfulSearchAPIQueriesIndexingHDFSMapReduce
  17. 17. INGESTION
  18. 18. PARSING, LOADING,ANDINDEXINGLoad HBase with parsed objectsStore HBase ROW_IDStore pointer to raw file in HDFSIndex a number of desired fields/ingestion/sequencefile1/ingestion/sequencefile2/ingestion/sequencefile31534 4562 5323 72324601 5105000 1492 2987 4767 5987
  19. 19. INSIDE OF HBASE...... ............ ......... .........object5…...object4...object3object2...object1rowkey
  20. 20. THE SOLR DOCUMENT2343sfOffset/ingest/file2sequenceFile1333-2241-3411cluster_id42ADFF-BZMM...configs.log...2013/05/12WARNING: DISK DEAD...headercontentsfile_namedate_sentsystem_idrowkeySolr Document
  21. 21. SEARCH APPLICATIONSearch ApplicationQueryData locationsStored ObjectUserQuery Results21345867Raw DataLocationRaw DataRow_id
  22. 22. CORE DESIGN ISSUES• Changing the Solr schema (manual reindex)• Elastic shard scaling (manual reindex)• No distributed joining (denormalizing the data)• Replication*• Manually managing Solr partitioning/sharding*• Write durability*
  23. 23. SOLRCLOUD• Automatic shard creation, routing• Replication• Limited to a fixed number of shards defined on initial creation• ZooKeeper for coordination• Large community
  24. 24. ELASTICSEARCH• Similar feature set to Solr• Purpose built for easily managing a distributed index• Rapidly growing community• Custom built coordination mechanism• JSON based API
  25. 25. • Integrates Cassandra and Solr• Automatic indexing in Solr/storing in Cassandra• Automatic partitioning• Automatic reindexing• Not limited to fixed number of shards• Proprietary and costs moneyDATASTAX ENTERPRISE+
  26. 26. • Collecting and analyzing device data/product logs can be avery difficult challenge• You can use NoSQL and search technologies like Solr orElasticSearch in unison...• ...but it is not always easy to integrate search with NoSQLCONCLUSION
  27. 27. QUESTIONS?• Feel free to reach out if you have any questions or need helpwith big data/search!••• @ryantabora•
  28. 28. BONUS SLIDES
  29. 29. HBASE AND SOLR• Automatic partitioning/reindexing• Automatic index updates on HBase inserts/deletes• Mapping HBase cells to a Solr schema• No perfect commercial/open source solution yet• Many many many more...
  30. 30. HBASE + SOLRAUTOMATIC INDEXING• HBase coprocessors are like storedprocs/triggers• New, powerful, and dangerous• Triggers on HBase puts/deletes• Mapping data to a schema?
  31. 31. HBASE + SOLRWRITE DURABILITYSolrShard 1SolrShard 3SolrShard 2HBase Table - SOLR_QUEUEMapReduce Indexing ApplicationSolr QueueReaderCreate SolrDocument from raw ASUP1Get oldest SolrDocument from HBaseQueue Table3Use custom hash algorithm to determinewhich shard to add SolrDocument to4Query Solr, if SolrDocumentwas added, then remove itfrom the SOLR_QUEUE5SolrDocumentSolrDocumentAdd SolrDocument to HBase for durability2
  32. 32. HBASE + SOLRELASTIC SHARDING• HBase’s distributing mechanism uses the concept of regions tosplit data across many nodes• Region splitting can be automatic or manual (performancedegradation as regions split)• Piggybacking Solr sharding on HBase Region splitting