Analytics in olap with lucene & hadoop

1,866 views

Published on

Published in: Economy & Finance, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,866
On SlideShare
0
From Embeds
0
Number of Embeds
375
Actions
Shares
0
Downloads
42
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Analytics in olap with lucene & hadoop

  1. 1. 1Berlin | 07/02/2012 | zanox | Presentation title
  2. 2. TABLE OF CONTENTS1. ABOUT ZANOX AND ME2. REPORTING IN GENERAL3. ANALYTICS4. QUESTIONS2Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  3. 3. THE ZANOX NETWORK3Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  4. 4. WHO AM I?4● Senior Architect at Zanox● 7+ years experience in building distributed search systems based on LuceneLucid Certified Apache Solr/Lucene Developer● 5+ years of using Hadoop and 2+ years of using HBase for data miningCloudera Certified Developer for Apache Hadoop and HBase● I have applied different machine-learning techniques mainly to optimiseresource usage while performing distributed search during my PhD● See my book: “Beyond Centralised Search EnginesAn Agent-Based Filtering Framework”● How can you reach me?● dragan.milosevic@zanox.com● http://www.linkedin.com/in/draganmilosevicBerlin | 08/05/2013 | zanox | Zanox Reporting Systems
  5. 5. 1. ABOUT ZANOX AND ME2. REPORTING IN GENERAL3. ANALYTICS4. QUESTIONS5Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  6. 6. REPORTING IN GENERAL6TrackingCustom Map Reduce Distributed ReportingMSSSJBossSSSSSSSSSMSSSSSSSSSSSSMSSSSSSSSSSSSCCCCCCCDDLGYDDLFMTTTTTTTTOOODDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDDDDD DDD DD DQQQQQQRRRRRRD DD DDDDDDDDDDDDD DD DD DD DD DD DD DOKOK00:0003:0006:0009:0012:0015:0018:0021:0000:0001:0002:0003:0004:0005:0006:00D DD DD DD DD DD DD DD DD D00:00:00.00000:00:00.00500:00:00.01000:00:00.01500:00:00.02000:00:00.02500:00:00.03000:00:00.03500:00:00.04000:00:00.04500:00:00.05000:00:00.05500:00:00.06000:00:00.06500:00:00.07000:00:00.07500:00:00.08000:00:00.08500:00:00.09000:00:00.09500:00:00.10000:00:00.20000:00:00.30000:00:00.40000:00:00.50000:00:00.60000:00:00.65000:00:00.70000:00:00.75000:00:00.80000:00:00.85000:00:00.90000:00:00.91000:00:00.92000:00:00.93000:00:00.94000:00:00.95000:00:0000:02:0000:04:0000:06:0000:08:0000:10:0000:10:2000:10:4000:11:0000:11:2000:11:4000:12:0000:14:0000:16:0000:18:0000:20:0000:22:0000:24:0000:26:0000:28:00Berlin | 08/05/2013 | zanox | Zanox Reporting SystemsName SearcherRole Knows how to build reports for queriesAble to load assigned indexes from HDFSDesign RPC ServerQuery optimiserLucene searcher with MMap directoriesReport aggregation componentName MergerRole Knows how to build sub-queries for searchersAble to assign indexes to searchersAware of available searchersDesign RPC ServerReport aggregation componentSorting and master-data managementName ObserverRole Checks the health of mergers and searchersAutomatic deployment of clustersRepair unhealthy mergers and searchersProvide information about observed clustersDesign RPC ServerName ClientRole Keep track of available mergersRerouting of failed queriesTake care of load when assigning queriesDesign JBoss applicationRPC client for communication with mergersName Tracking ServerRole Tracks events such as sales, clicks, views …Writes tracked events to HDFS…Design Proprietary applicationName Database serverFTP serverLog serverRole Provides data needed for reporting…Design …Name GoogleYahooMivaRole Provides search engine cost data…Design …Data-Import Data producers are pushing data during a whole dayMR Jobs Reporting jobs are started at midnightIt takes around 6 hours to process data from yesterdayIndex loading normally lasts no more that 30 minutesReporting There are multiple search clusters for redundancyEvery cluster has one merger that coordinates searchersThe health of every cluster is ensured by observerSearchers are grouped based on their capabilitiesWithin 100ms queries are propagated to best searchersCapable to provide reports in sub-second timeANALYTICS
  7. 7. 1. ABOUT ZANOX AND ME2. REPORTING IN GENERAL3. ANALYTICS4. QUESTIONS7Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  8. 8. Searching2012-09-032012-09-032012-09-032012-09-032012-09-042012-09-042012-09-042012-09-04ANALYTICS8Berlin | 08/05/2013 | zanox | Zanox Reporting SystemsDate PublisherAdvertiser Keyword CommissionJimHUK Auto insurance company 23.56 €TomDEVK Health insurance 9.24 €BobDEVK Health insurance 43.81 €BillHUK Cheap auto insurance 7.22 €AlHUK Auto insurance company 43.07 €MikeHUK Auto insurance company 29.98 €NickDEVK Health insurance 12.03 €PhilDEVK Cheap auto insurance 9.72 €Amount345.06 €81.35 €512.19 €83.76 €892.12 €98.17 €108.19 €83.21 €OptimizationShardingCommission | Amount23.56 | 345.06 €9.24 | 81.35 €43.81 | 512.19 €7.22 | 83.76 €43.07 | 892.12 €29.98 | 98.17 €12.03 | 108.19 €9.72 | 83.21 €
  9. 9. 2012-09-032012-09-032012-09-032012-09-042012-09-042012-09-04ANALYTICS9Berlin | 08/05/2013 | zanox | Zanox Reporting SystemsDate Advertiser Keyword Publisher – Commission | AmountHUK Auto insurance company Jim – 23.56 | 345.06 €DEVK Health insuranceTom – 9.24 | 81.35 €Bob – 43.81 | 512.19 €HUK Cheap auto insurance Bill – 7.22 | 83.76 €HUK Auto insurance companyAl – 43.07 | 892.12 €Mike – 29.98 | 98.17 €DEVK Health insurance Nick – 12.03 | 108.19 €DEVK Cheap auto insurance Phil – 9.72 | 83.21 €Searching OptimizationSharding
  10. 10. SearchingANALYTICS10Berlin | 08/05/2013 | zanox | Zanox Reporting Systems2012-09-032012-09-032012-09-032012-09-042012-09-042012-09-04Date Advertiser Publisher – Commission | AmountHUK Jim – 23.56 | 345.06 €DEVKTom – 9.24 | 81.35 €Bob – 43.81 | 512.19 €HUK Bill – 7.22 | 83.76 €HUKAl – 43.07 | 892.12 €Mike – 29.98 | 98.17 €DEVK Nick – 12.03 | 108.19 €DEVKKeywordAuto insurance companyHealth insuranceCheap auto insuranceAuto insurance companyHealth insuranceCheap auto insurance Phil – 9.72 | 83.21 €QFILTERGROUP BYJimPUBLISHERHUKADVERTISERHUKADVERTISERRVALUES103,83 €COMMISSION1419,11 €AMOUNTSEARCHINGAGGREGATIONLOADINGWhat does eachsearcher whenquery comes?OptimizationShardingCan we furtherimprove loadingof documents?Yes we can!Find out whichdimensions areimportant filtersand use themwhile indexingdocuments
  11. 11. SearchingANALYTICS11Berlin | 08/05/2013 | zanox | Zanox Reporting Systems2012-09-032012-09-032012-09-032012-09-042012-09-042012-09-04Date Advertiser Publisher – Commission | AmountHUK Jim – 23.56 | 345.06 €DEVKTom – 9.24 | 81.35 €Bob – 43.81 | 512.19 €HUK Bill – 7.22 | 83.76 €HUKAl – 43.07 | 892.12 €Mike – 29.98 | 98.17 €DEVK Nick – 12.03 | 108.19 €DEVKKeywordAuto insurance companyHealth insuranceCheap auto insuranceAuto insurance companyHealth insuranceCheap auto insurance Phil – 9.72 | 83.21 €QFILTERGROUP BYJimPUBLISHERHUKADVERTISERHUKADVERTISEROptimizationShardingRVALUES103,83 €COMMISSION1419,11 €AMOUNT
  12. 12. SearchingANALYTICS12Berlin | 08/05/2013 | zanox | Zanox Reporting Systems● There are 9 dimensions in total, where several● Have huge cardinality (>1 million different values) and● At the same time are rarely used as filter, meaning● They are excellent candidates for being put together with measures.● There are more than 50 measures that● Are freely combined in user-defined formulas, but at the same time● Form only few groups based on usage patterns, meaning● They contribute to better compression because fields become bigger.● Ordering in which documents are inserted in index is driven by● Dimensions that are known to be the most important filters, meaning● Faster loading as needed documents are closer to each other.OptimizationSharding
  13. 13. OptimizationSearching ShardingSharding based on hashingANALYTICS13Berlin | 08/05/2013 | zanox | Zanox Reporting SystemsSEARCHINGAGGREGATIONLOADINGSEARCHINGAGGREGATIONLOADINGSEARCHINGAGGREGATIONLOADINGSEARCHINGAGGREGATIONLOADINGSEARCHINGAGGREGATIONLOADINGDistributed ReportingMSSSSSSSSSSSSMSSSSSSSSSSSS OMSSSSSSSSSSSS OORVALUES16,55 €COMMISSION523,21 €AMOUNTRVALUES103,83 €COMMISSION1419,11 €AMOUNTRVALUES78,72 €COMMISSION879,34 €AMOUNTRVALUES134,23 €COMMISSION1622,56 €AMOUNTRVALUES336,16 €COMMISSION4464,09 €AMOUNTR R R RQFILTERGROUP BYJimPUBLISHERHUKADVERTISERHUKADVERTISERQQQQQRVALUES2,83 €COMMISSION19,87 €AMOUNTRSharding based on hashing Query is sent to each and every searcherSearchers work hard to produce reportsHuge amount of data sent over networkGoals for alternative sharding Think very well while distributing dataNetwork is your friend and precious resourceProduce same report with less searchers
  14. 14. Sharding based on selected filtersOptimizationShardingSearchingANALYTICS14Berlin | 08/05/2013 | zanox | Zanox Reporting SystemsSEARCHINGAGGREGATIONLOADINGSEARCHINGAGGREGATIONLOADINGMS S S S S RVALUES336,16 €COMMISSION4464,09 €AMOUNTQFILTERGROUP BYJimPUBLISHERHUKADVERTISERHUKADVERTISERQQVALUES133,33 €COMMISSION2454,22 €AMOUNTRRRVALUES202,83 €COMMISSION2009,87 €AMOUNTR
  15. 15. OptimizationShardingSearchingANALYTICS15Berlin | 08/05/2013 | zanox | Zanox Reporting Systems● Sharding based on hashing randomly distributes data● But that is not needed because no relevance will be computed.● Queries have to be send to each and every searcher, being suboptimalbecause much more data have to be transmitted over the network.● The slowest searcher determine final response time due to completeness.● How we partition based on filter fields?● Order dimensions based on their usage frequency as filter in particular group.● Greater usage frequency means more important for sharding.● Examples of dimensions that are found to be excellent for sharding are:time related dimensions and advertiser name dimension.● Merger uses knowledge about which data each shard has while routing.● Final result is more stable system and much better QPS.
  16. 16. ANALYTICS16Berlin | 08/05/2013 | zanox | Zanox Reporting Systems3rdGroup2ndGroup1stGroupPublisherAdvertiserKeywordPublisherKeywordPublisherAdvertiserDistributed ReportingMSSSSSSSSSSSSMSSSSSSSSSSSS OMSSSSSSSSSSSS OOOptimizationShardingSearching
  17. 17. We have created separatesearcher group for everypossible combination ofdimensions. Therefore weare well prepared for eachquery that can hit us.But we are attacked wherewe haven’t thought that itis possible. One new fieldhave been requested andthat has spoiled everything17Berlin | 07/02/2012 | zanox | Presentation title3rdGroup2ndGroup1stGroupPublisherAdvertiserKeywordPublisherKeywordPublisherAdvertiserM4thGroupPublisher5thGroupAdvertiser6thGroupAdvertiserKeyword7thGroupKeywordQFILTERGROUP BYJimPUBLISHERKeywordHUKADVERTISERInsuranceKEYWORDWebsite8thGroupAdvertiserWebsiteKeyword9thGroupWebsiteAdvertiserPublisher10thGroupWebsite11thGroupWebsiteKeyword12thGroupWebsiteAdvertiser13thGroupWebsitePublisherPublisherThere are somany groupsbut no onehas fieldsWebsitePublisherKeywordCreating a separate searcher group for each and every possiblecombination of dimensions in each query is not feasible becausenumber of searcher-groups grows extremely fast (exponentially)Number of dimensions Number of searcher groups3 23 – 1 = 74 24 – 1 = 15… …9 29 – 1 = 511Solution is to analyze already processed queries to discard lowperforming groups and to create new ones for emerging queriesANALYTICSOptimizationShardingSearching
  18. 18. OptimizationShardingSearching18Berlin | 07/02/2012 | zanox | Presentation title3rdGroup2ndGroup1stGroupPublisherAdvertiserKeywordPublisherKeywordPublisherAdvertiser4thGroupPublisher5thGroupAdvertiser6thGroupAdvertiserKeyword7thGroupKeyword8thGroupAdvertiserWebsiteKeyword9thGroupWebsiteAdvertiserPublisher10thGroupWebsite11thGroupWebsiteKeyword12thGroupWebsiteAdvertiser13thGroupWebsitePublisherqueries451236response45.54 msqueries9response15.54 msqueries221128response82.87 msqueries2876136response23.54 msqueries236response92.33 msqueries2response147.82 msqueries228732response61.03 msqueries89response215.04 msqueries15response105.02 msqueries5644response29.76 msqueries4response563.04 msqueries32response87.03 msqueries543765response35.87 ms14thGroupWebsitePublisherKeywordMQFILTERGROUP BYJimPUBLISHERKeywordHUKADVERTISERInsuranceKEYWORDWebsitePublisherThere are somany groupsbut no onehas fieldsWebsitePublisherKeywordANALYTICS
  19. 19. 19Berlin | 07/02/2012 | zanox | Presentation title3rdGroup9thGroup1stGroupPublisherAdvertiserKeywordWebsitePublisherPublisherAdvertiser4thGroupPublisher11thGroupWebsiteKeywordqueries20236response45.54 msAdvertiserqueries91234response15.54 msqueries221128response82.87 msqueries286136response23.54 msqueries22336response92.33 ms14thGroupWebsitePublisherKeywordqueries32546response727.23 msMQFILTERGROUP BYJimPUBLISHERHUKADVERTISERInsuranceKEYWORDWebsitePublisher15thGroupKeywordPublisherFieldsWebsitePublisherKeywordqueries9874response129.14 msPublisherKeywordqueries22678response989.28 msFields=We optimized systemby adding specializedgroup that is excellentfor found slow queriesThere is nogroup that isoptimizedfor querieshaving fieldsPublisherKeywordANALYTICSOptimizationShardingSearching
  20. 20. ANALYTICS20Berlin | 08/05/2013 | zanox | Zanox Reporting Systems● Hadoop makes possible to easily build new indexes:● Optimize utilization of Hadoop resources because● Better indexes can be always build by low priority jobs.● Built OLAP continuously adapts itself to queries and● There are no more expensive queries in long run.● How can we decide which group is best of query at hand?● That’s query routing, which we solved either by● Asking searchers only to return number of hits andthen to generate report if being selected, or simpler● By using heuristic based on dimension cardinality.● But query routing is not a topic of this talk.OptimizationShardingSearching
  21. 21. 1. ABOUT ZANOX AND ME2. REPORTING IN GENERAL3. QUERY ROUTING4. QUESTIONS21Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
  22. 22. 22Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

×