Using lucene solr to build advertising systems

3,870 views
3,644 views

Published on

Presented by Hideharu Hatayama, Rakuten, Inc.

I want to talk about architecture patterns of Solr centered ad systems and practical knowledge which we gained by operating the system with high availability for years, and these topics would be applicable for other systems such as e-commerce site or restaurant recommendation site.Through the presentation, I'll aim that beginners will get the hints of how to design their system architecture using Solr with high performance, and how to manage or operate the systems avoiding down time.

Published in: Education, Technology

Using lucene solr to build advertising systems

  1. 1. Using Lucene/Solrto build Advertising SystemsHide (Hatayama Hideharu)Big Data Department, Targeting Section, Advertising GroupRakuten, Inc. May 2nd 2013
  2. 2. 2IntroAgenda | www.lucenerevolution.orghttp://www.lucenerevolution.org/2013/agenda
  3. 3. 3IntroAgenda | www.lucenerevolution.orghttp://www.lucenerevolution.org/2013/agenda35 min...orz my talk is NOT about... m(_ _)m NRT SolrCloud complicated queries or other Solr hot topics my talk is just about Overview of Solr, most common features Our empirical knowledge about Solr
  4. 4. 4Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  5. 5. 5Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  6. 6. 6Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  7. 7. 7Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  8. 8. 8Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  9. 9. 9Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  10. 10. 10Who am I? Hatayama Hideharu (call me Hide) M.Eng, Tokyo Institute of Technology, Japan Worked with advertising system in Rakuten for 3 years ad management system development ad distribution system development system architecture design increase the performance of systems increase profitability of ad services User of Solr, not implementer http://6109.hidepiy.com/
  11. 11. 11Who are we? Rakuten, Inc. Internet services company Founded : Feb. 7th 1997, Tokyo, Japan The first service: Rakuten Ichiba (shopping mall)
  12. 12. 12Who are we?
  13. 13. 13Rakuten in Japan
  14. 14. 14Rakuten Ichiba Ichiba: The largest online shopping mall in Japanuser infocampaignother servicesitem searchcategory navigationpersonalized itemitem historysale eventshop historybookmarked itemservice tab:
  15. 15. 15Rakuten’s Global Expansion★●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● E-CommerceeBookTravelOther services & businessesDevelopment center●
  16. 16. 16Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  17. 17. 17Types of advertisements on Rakuten Ichiba [1/3] Listing Ad (search word related ad)item searchsearched adssearched items
  18. 18. 18Types of advertisements on Rakuten Ichiba [2/3] Display Ad (placement related ad)where, when … Targeting Ad (user related ad)sex, age, browsing history …
  19. 19. 19 ... Ad ?120 ads on 1 page ...orzTypes of advertisements on Rakuten Ichiba [3/3]
  20. 20. 20ad system function landscapead systemRakutenOwnedMedia(Web/Email)Owned AdNetworkRakuten staffMerchantsTool User MediaExternalADNW,AdExOther staffTenancy Ad (Fixed placement/fee/term)P4P Ad (CPM/CPC/CPA etc.)Ad placement def.Sales mgmt.Creative mgmt.Campaign mgmt.Budget mgmt.BiddingAdditional FunctionBig Data Analysis AdvancedtargetingCreativeoptimizationConnect toaffiliate networkProgrammaticmedia buying- Attribution- Behavior- OptimizationDelivery mgmt.ReportingMerchant ToolTargeting/mediaReportingMerchant ToolAd server.ad management ad distributionLog processingTargeting(Placement, keyword,behavioral, demographic, etc.)Beacon server.Redirect server.DevicexPC MobileSmartphoneTablet
  21. 21. 21ad distribution system [1/2]JSONHTMLJavaScript ad searching ad filtering ad sorting logging ...???parameterplacementkeywordad type...cookie
  22. 22. 22ad distribution system [2/2] need high performance, high availabilitye.g., more than 7,000 req / sec for 1 server with 100.00% avail. collect & analyze log, then improve profitability basic architecture is the same for our variety of ad using...Kyoto Tycoon
  23. 23. 23system design: few years ago [1/5]master...: 1 physical server... : SLB: 1 server clusterx4 x4 x4 x4 x4x4 x4x2slaveweb svrapp svrmaster
  24. 24. 24mastersystem design: few years ago [2/5]master...: 1 physical server... : SLB: 1 server clusterx4 x4x2slaveweb svrapp svrx4 x4 x4 x4 x4clusterweb server x 4app server x 5
  25. 25. 25mastersystem design: few years ago [3/5]master...: 1 physical server: SLB: 1 server clusterx2slaveweb svrapp svr...x4 x4 x4 x4 x4x4 x4SLB connectapp <-> Solr
  26. 26. 26system design: few years ago [4/5]master...: 1 physical server... : SLB: 1 server clusterx4 x4 x4 x4 x4x4 x4x2slaveweb svrapp svr High availability, robust simplified task for each serversWeb server only do ApachingSolr server searching... make full use of resources, on demand provisioninge.g., add 1 front clustere.g., swap broken apache servere.g., tune up performance, decrease app server 5 -> 3
  27. 27. 27system design: few years ago [5/5]master...: 1 physical server... : SLB: 1 server clusterx4 x4 x4 x4 x4x4 x4x2slaveweb svrapp svr so many servers, so many configurationswe didn’t have automatic deploy or operation tools so many external networkingApache <-> Tomcatapp <-> Solr... Apache, Tomcat, Solr, and Redis had never died,but the performance was our biggest issue.
  28. 28. 28system design: little bit changed [1/4]master...: 1 physical server... : SLB: 1 server clusterx4 x4x4 x4x2slavemaster
  29. 29. 29system design: little bit changed[2/4]master: 1 physical server... : SLB: 1 server clusterx4 x4x2slavemaster...x4 x4merged web & app server1 physical server both containsApache & Tomcat
  30. 30. 30system design: little bit changed[3/4]master...: 1 physical server... : SLB: 1 server clusterx4 x4x4 x4x2slavemaster easy to understand whole system network easy to operate easy to deploy or change configurations
  31. 31. 31system design: little bit changed [4/4]master...: 1 physical server... : SLB: 1 server clusterx4 x4x4 x4x2slavemaster Solr is still far from apps
  32. 32. 32system design: current[1/4]...: 1 physical server: SLBx2appx2 x2appx2 x2x2master
  33. 33. 33system design: current [2/4]: 1 physical server: SLBx2x2master...appx2 x2appx2 x2Solr slave is includedin app server
  34. 34. 34system design: current [3/4]: 1 physical server: SLBmaster...appx2 x2appx2 x2x2x2SLB connectmaster <-> slave
  35. 35. 35system design: current [4/4]...: 1 physical server: SLBx2appx2 x2appx2 x2x2master no SPOF (Solr master is redundant) easy to understand whole system process easy to operate easy to deploy or change configurations easy to scale out good performance (7000 req / sec by 1 server) but we can’t make full use of server resourcese.g., we want 0.7 Solr instance for 1 app instance...
  36. 36. 36system design: in the near future server instance physical on-premise, private cloud, public cloud PaaS Apache or Nginx? shared cache master <-> slave or SolrCloud? Solr or Elasticsearch? abolish servlet & tomcat style? collaborate more with Hadoop family members
  37. 37. 37system design: in the near future server instance physical on-premise, private cloud, public cloud PaaS Apache or Nginx? shared cache Solr or Elasticsearch? abolish servlet & tomcat style collaborate more with Hadoop family membersm(_ _)mUNDERCONSTRUCTION
  38. 38. 38operation e.g. Solr schema update [1/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2
  39. 39. 39operation e.g. Solr schema update [2/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2Stop replication ofSolr & Redis
  40. 40. 40operation e.g. Solr schema update [3/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2Separated from the netService IN Service IN Service OUT
  41. 41. 41operation e.g. Solr schema update [4/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2update schema & appService IN Service IN Service OUT
  42. 42. 42operation e.g. Solr schema update [5/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2update schemaService IN Service IN Service OUT
  43. 43. 43operation e.g. Solr shcema update [6/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2restart replicationService IN Service IN Service OUT
  44. 44. 44operation e.g. Solr shema update [7/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2test app functionswith reverse proxyService IN Service IN Service OUT
  45. 45. 45operation e.g. Solr shcema update [8/8]: 1 physical server: SLBx2appx2 x2appx2 x2x2masterappx2 x2Service IN Service IN Service INconnected to the net
  46. 46. 46Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  47. 47. 47Solr cache about various kind of Lucene/Solr cache fieldCache (Lucene level) fieldValueCache documentCache filterCache queryResultCache HTTP chache and user defined cache
  48. 48. 48filter cache we’re using it for caching the results of filter queries<!-- default in solrconfig.xml --><filterCache class="solr.FastLRUCache"size="512"initialSize="512"autowarmCount="0"/>
  49. 49. 49query result cache we used to activate it for avoiding useless searching<!-- default in solrconfig.xml --><queryResultCache class="solr.LRUCache"size="512"initialSize="512"autowarmCount="0"/>
  50. 50. 50application cache about cache in app sideprocessing time without Searching is 0 – 1 msec-> convert from doc to DTO is relatively wasteful-> SolrJ with javabin works well, but...
  51. 51. 51sizing & memory usage monitoring -> tuning configuration, memory allocation server: traffic, load, cpu, memory, page, swap Apache: busy, rps, bps, cpu, state, processing time Tomcat: thread, rps, bps, eps, memory, jmx Solr: index size, doc num, memory, cache hit ratioadmin page, admin/Luke, replication?command=details...server mon GrowthForecast Solr admin, command, Luke
  52. 52. 52avoid Full GC Full GCif we allocate 2GB for a tomcat heap-> “Stop the World” would be more than 1 sec Concurrent GC (we’re still struggling in tuning)e.g.,)HEAP_OPTS="-Xmx2g -Xms2g -Xss512k"GC_LOG_OPTS="-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails"FULL_GC_OPTS="-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=32 -XX:TargetSurvivorRatio=90"JMX_OPTS="-Dcom.sun.management.config.file=${CATALINA_HOME}/conf/management.properties"CATALINA_OPTS="-server ${HEAP_OPTS} ${GC_LOG_OPTS} ${FULL_GC_OPTS} ${JMX_OPTS}"
  53. 53. 53Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  54. 54. 54Solr plugin RequestHandler, SearchHandler SearchComponent, QueryComponent QParserPlugin, PostFilter QueryResponseWriter-> implemented these classes for our own use
  55. 55. 55RequestHandler & SearchHandler for logging for health checklike /admin/ calls AdminHandlerspublic class OurRequestHandler extends RequestHandlerBase {/** Logger */private static Logger log = LoggerFactory.getLogger(OurRequestHandler.class);@Override public void init(NamedList args) { super.init(args); }@Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)throws Exception {log.info(req.toString());rsp.setHttpCaching(false);...}}
  56. 56. 56Solr index situation [1/2] Solr’s indexing need huge costs, we thought (just thought...)-> then separated into these two basic stable data additional unstable dataor
  57. 57. 57Solr index situation [2/2] Solr index: for searching keyword, placement data (Japan, Ichiba, footer...) a few GB Redis data (previously MySQL): for filtering or sorting ad status (active or not) ad price, ad rank (based on CTR, CVR...) and ad contents data (image path, link, text...) 100MB – 10GB (depends on advertisement types)
  58. 58. 58searching: handle ads in app [1/2]handle reqsearchfiltersort...
  59. 59. 59searching: handle ads in Solr [2/2]handle reqsearch...
  60. 60. 60Solr with Redis data handling [1/2] ResponseWriter-> unsuitable for searching or filtering SearchComponent-> easy to implement, configure-> basic process is already handled in QueryComponent
  61. 61. 61Solr with Redis data handling [2/2] modify QueryComponent-> good position in terms of functionality-> base for default searching-> relatively big component ConstantScoreQuery with our own Filter?
  62. 62. 62QueryParserPlugin & PostFilter [1/2]e.g.)<!–- solrconfig.xml --><!-- put jar file here --><lib dir=“.../orochi_search” /><!-- define implemented class --><queryParser name=“redis” class=“...orochi.search.ExtendedQParserPlugin” />public class ExtendedQParserPlugin extends QParserPlugin {public void init(NamedList args) { /* NOOP */ }@Override public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {return new QParser(qstr, localParams, params, req) {...@Override public Query parse() throws ParseException {return new RedisPostFilter(rows, preview, currentTimeMillis);}};}}
  63. 63. 63QueryParserPlugin & PostFilter [2/2]public class RedisPostFilter extends ExtendedQueryBase implements PostFilter {public RedisPostFilter(int rows, long preview, long currentTimeMillis) {setCache(false);...}public boolean isValid(int docId, IndexSearcher indexSearcher) {// return the document is valid or not.document = indexSearcher.doc(docId, fieldSelector);...}public DelegatingCollector getFilterCollector(final IndexSearcher indexSearcher) {return new DelegatingCollector() {@Override public void collect(int docId) throws IOException {if (isValid(docId, indexSearcher)) {super.collect(docId);...}}};}@Override public int getCost() { return Math.max(super.getCost(), 100); }...}
  64. 64. 64Merge Solr & Redishandle reqsearch...
  65. 65. 65Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  66. 66. 66Japanese linguisticsすもももももももも(pronunciation) sumomomomomomomomoすもも も もも も もも(words) sumomo mo momo mo momo李も桃も桃(meaning) Plums and peaches are both part of peaches
  67. 67. 67Japanese linguistics最中を食べている最中ですm(_ _)m(pronunciation) monakawotabeteirusaichudesu(meaning) I’m eating monaka. (excuse me)how to separate this sentence into tokens for indexing?
  68. 68. 68Tokenize approach: N-gram最中を食べている最中ですm(_ _)m unigram最 中 を 食 べ て い る 最 中 で す m ( _ _ ) m bigram最中 中を を食 食べ べて てい いる る最 最中 中で です すm m( (_ _ _ _) )m trigram最中を 中を食 を食べ 食べて べてい ている いる最 る最中 最中で 中です ですm すm( m(_ (_ _ _ _) _)m
  69. 69. 69Tokenize approach: Morphological Analysis [1/2]最中を食べている最中ですm(_ _)m using dictionary最中 を 食べ て いる 最中 です m(_ _)m最中 を 食べ て いる 最中 です m(_ _)mtext 最中 を 食べ て いる 最中 です m(_ _)mpartOfSpeechnoun-commonparticle-case-miscverb-mainparticle-conjunctiveverb-auxiliarynoun-adverbialauxiliary-verb-pronunciationmonaka o tabe te iru saichu desu -
  70. 70. 70Tokenize approach: Morphological Analysis [2/2]最中を食べている最中ですm(_ _)m
  71. 71. 71Tokenize approach: compare 2 waysN-gram Morphological Analysisindex size big smallpreparation not needed make & maintain worddictionaryimplementation very easy hardNLP, ML, statisticnew word no problem update dictionary, re-indexsearch relevancy without omissioncontains trivialwith omissionhuman likeprocessing time ... ...
  72. 72. 72Solr with Morphological Analysis ver. -3.5 : setup component & dictionary manuallySenLucene gosen... ver. 3.6- : field type text_ja woks well“kuromoji” is inside
  73. 73. 73issues of kuromoji some adjustments are needed for migrationsupported dictionaries would be different betweenprevious engine & kuromoji half width & full width charactersWindows8 <-> Windows8AKB48 <-> AKB48
  74. 74. 74Japanese Analyzer JapaneseTokenizer JapaneseBaseFormFilter JapanesePartOfSpeechStopFilter CJKWidthFilter StopFilter JapaneseKatakanaStemFilter LowerCaseFilter
  75. 75. 75Agenda1 Introduction of Me & Rakuten2 Solr centered Advertising Systems4 Solr plug-in3 Solr performance5 (Solr with Japanese language)
  76. 76. 76Thank you, San Diegoany question?any comment?any advice?If you have some, let’s talk later (not now...?)
  77. 77. Hide (Hatayama Hideharu)Big Data Department, Targeting Section, Advertising GroupRakuten Inc.blog: http://6109.hidepiy.comfacebook: http://www.facebook.com/hatayama.hideharutwitter: ... I don’t remember

×