Developing	
  Scalable	
  Search	
  for	
  User	
  
Generated	
  Content	
  at	
  PlaySta:on	
  
Alvin	
  Peng	
  
Sr.	
  So=ware	
  Engineer	
  
Sony	
  Interac:ve	
  Entertainment	
  
User	
  Generated	
  Content	
  (UGC)	
  in	
  
PlaySta:on	
  
•  PlaySta:on	
  users	
  can	
  easily	
  share	
  awesome	
  medias	
  
•  Media	
  types	
  
–  Broadcasts	
  
–  Screenshots	
  
–  Videos	
  
•  Medias	
  are	
  posted	
  to	
  third	
  party	
  networks	
  
–  Facebook	
  
–  TwiKer	
  
–  YouTube	
  
–  Dailymo:on	
  
–  Twitch	
  
–  Ustream	
  
–  Niconico	
  
However…
•  There	
  was	
  no	
  central	
  place	
  to	
  show	
  or	
  search	
  
for	
  all	
  these	
  awesome	
  contents	
  
•  Only	
  shown	
  up	
  in	
  users’	
  Ac:vity	
  Feed	
  and	
  
Profile	
  
•  Only	
  sent	
  to	
  friends	
  
•  Basically	
  not	
  visible	
  to	
  majority	
  of	
  our	
  millions	
  
of	
  users	
  
Difficulties of UGC System
•  Searchable	
  
•  Scalability	
  
•  Performance	
  
•  Dynamic	
  content	
  
•  A	
  lot	
  of	
  read	
  
•  A	
  lot	
  write	
  
•  Various	
  searching	
  requirements	
  
Solution
•  SolrCloud	
  based	
  scalable	
  search	
  system	
  for	
  
public	
  UGC	
  by	
  users	
  of	
  PlaySta:on	
  
	
  
Why Solr?
•  Widely	
  used	
  open	
  source	
  search	
  plaTorm	
  
•  Scalable	
  
•  Stable	
  
•  Feature	
  rich	
  
•  Not	
  just	
  a	
  search	
  plaTorm	
  
•  Great	
  Solr	
  community,	
  both	
  individuals	
  and	
  
companies	
  
	
  
Developers of UGC backend
system
•  Alvin	
  Peng	
  
•  David	
  Herrera	
  Rosales	
  
	
  
Live From PlayStation and More
Live From PlayStation and More
Live From PlayStation and More
Live From PlayStation and More
System Architecture
	
  
System Architecture
	
  
System Architecture
	
  
System Architecture
	
  
System Architecture
	
  
System Architecture
	
  
System Architecture
	
  
System Architecture
	
  
UGC SolrCloud System Design
•  Solr	
  5.2.1	
  
•  SolrJ	
  CloudSolrClient	
  
•  Single	
  collec:on	
  
•  3	
  clusters	
  in	
  produc:on	
  environment	
  
–  Broadcasts	
  
–  Screenshots	
  
–  Videos	
  
•  5	
  zookeeper	
  nodes	
  
•  Single	
  shard	
  
•  16	
  Solr	
  nodes	
  per	
  cluster	
  
Solr Schema
•  Field	
  types	
  
–  Class	
  
•  StrField	
  
•  TextField	
  
•  TrieLongField	
  
•  TrieDateField	
  
•  etc.	
  
–  Analyzer	
  
•  Char	
  filter	
  
–  MappingCharFilterFactory	
  
–  HTMLStripCharFilterFactory	
  
–  PaKernReplaceCharFilterFactory	
  
–  etc.	
  
•  Tokenizer	
  
–  StandardTokenizerFactory	
  
–  NGramTokenizerFactory	
  
–  KeywordTokenizerFactory	
  
–  etc.	
  
•  Filter	
  
–  LowerCaseFilterFactory	
  
–  PorterStemFilterFactory	
  
–  StopFilterFactory	
  
–  etc.	
  
–  Index	
  Analyzer	
  and	
  Query	
  Analyzer	
  
	
  
Solr Schema
•  Fields	
  
–  Number	
  of	
  fields	
  
–  Field	
  type	
  
–  Indexed	
  
–  Stored	
  
–  etc.	
  
•  copyField	
  
–  <copyField	
  source="*_t"	
  dest=”anything"	
  
maxChars="25000"	
  />	
  
•  dynamicField	
  
–  <dynamicField	
  name="*_t"	
  type=”text"	
  indexed="true"	
  	
  
stored="true"/>	
  
UGC Multilingual Support
•  Supports	
  about	
  20	
  languages	
  
– English	
  
– Spanish	
  
– Japanese	
  
– etc.	
  
•  Different	
  field	
  types	
  for	
  different	
  languages	
  
•  Different	
  tokenizers	
  and	
  filters	
  
	
  
UGC Solr Configuration
•  Hard	
  commit:	
  15	
  minutes	
  
– Hard	
  commits	
  are	
  about	
  durability	
  	
  
•  So=	
  commit:	
  1	
  minute	
  
– So=	
  commits	
  are	
  about	
  visibility	
  
– Less	
  expensive,	
  but	
  not	
  free	
  
– Use	
  the	
  longest	
  so=	
  commit	
  interval	
  that’s	
  
acceptable	
  for	
  best	
  performance	
  
UGC Stats
•  Online	
  since	
  last	
  Sept.	
  
•  Number	
  of	
  documents	
  
–  Broadcasts:	
  26K	
  
–  Screenshots:	
  5M	
  
–  Videos:	
  20M	
  
•  Average	
  request	
  RPS	
  
–  Total	
  UGC	
  query	
  requests	
  per	
  day	
  >	
  1B	
  
–  Average	
  Solr	
  query	
  RPS:	
  
•  Broadcasts:	
  1600	
  
•  Screenshots:	
  250	
  
•  Videos:	
  250	
  
–  Average	
  Solr	
  update	
  RPS:	
  
•  Broadcasts:	
  500	
  
•  Screenshots:	
  250	
  
•  Videos:	
  500	
  
•  Average	
  query	
  latency	
  
–  Average	
  Solr	
  query	
  latency:	
  
•  Broadcasts:	
  4ms	
  (16ms	
  for	
  leader)	
  
•  Screenshots:	
  14ms	
  (16ms	
  for	
  leader)	
  
•  Videos:	
  60ms	
  (210ms	
  for	
  leader)	
  
–  Average	
  Solr	
  update	
  latency:	
  
•  Broadcasts:	
  8ms	
  (60ms	
  for	
  leader)	
  
•  Screenshots:	
  1ms	
  (10ms	
  for	
  leader)	
  
•  Videos:	
  2ms	
  (24ms	
  for	
  leader)	
  
	
  
Finally
•  Happy	
  searching	
  with	
  Solr!	
  
Q/A
	
  

Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated Content at PlayStation

  • 1.
    Developing  Scalable  Search  for  User   Generated  Content  at  PlaySta:on   Alvin  Peng   Sr.  So=ware  Engineer   Sony  Interac:ve  Entertainment  
  • 2.
    User  Generated  Content  (UGC)  in   PlaySta:on   •  PlaySta:on  users  can  easily  share  awesome  medias   •  Media  types   –  Broadcasts   –  Screenshots   –  Videos   •  Medias  are  posted  to  third  party  networks   –  Facebook   –  TwiKer   –  YouTube   –  Dailymo:on   –  Twitch   –  Ustream   –  Niconico  
  • 3.
    However… •  There  was  no  central  place  to  show  or  search   for  all  these  awesome  contents   •  Only  shown  up  in  users’  Ac:vity  Feed  and   Profile   •  Only  sent  to  friends   •  Basically  not  visible  to  majority  of  our  millions   of  users  
  • 4.
    Difficulties of UGCSystem •  Searchable   •  Scalability   •  Performance   •  Dynamic  content   •  A  lot  of  read   •  A  lot  write   •  Various  searching  requirements  
  • 5.
    Solution •  SolrCloud  based  scalable  search  system  for   public  UGC  by  users  of  PlaySta:on    
  • 6.
    Why Solr? •  Widely  used  open  source  search  plaTorm   •  Scalable   •  Stable   •  Feature  rich   •  Not  just  a  search  plaTorm   •  Great  Solr  community,  both  individuals  and   companies    
  • 7.
    Developers of UGCbackend system •  Alvin  Peng   •  David  Herrera  Rosales    
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    UGC SolrCloud SystemDesign •  Solr  5.2.1   •  SolrJ  CloudSolrClient   •  Single  collec:on   •  3  clusters  in  produc:on  environment   –  Broadcasts   –  Screenshots   –  Videos   •  5  zookeeper  nodes   •  Single  shard   •  16  Solr  nodes  per  cluster  
  • 21.
    Solr Schema •  Field  types   –  Class   •  StrField   •  TextField   •  TrieLongField   •  TrieDateField   •  etc.   –  Analyzer   •  Char  filter   –  MappingCharFilterFactory   –  HTMLStripCharFilterFactory   –  PaKernReplaceCharFilterFactory   –  etc.   •  Tokenizer   –  StandardTokenizerFactory   –  NGramTokenizerFactory   –  KeywordTokenizerFactory   –  etc.   •  Filter   –  LowerCaseFilterFactory   –  PorterStemFilterFactory   –  StopFilterFactory   –  etc.   –  Index  Analyzer  and  Query  Analyzer    
  • 22.
    Solr Schema •  Fields   –  Number  of  fields   –  Field  type   –  Indexed   –  Stored   –  etc.   •  copyField   –  <copyField  source="*_t"  dest=”anything"   maxChars="25000"  />   •  dynamicField   –  <dynamicField  name="*_t"  type=”text"  indexed="true"     stored="true"/>  
  • 23.
    UGC Multilingual Support • Supports  about  20  languages   – English   – Spanish   – Japanese   – etc.   •  Different  field  types  for  different  languages   •  Different  tokenizers  and  filters    
  • 24.
    UGC Solr Configuration • Hard  commit:  15  minutes   – Hard  commits  are  about  durability     •  So=  commit:  1  minute   – So=  commits  are  about  visibility   – Less  expensive,  but  not  free   – Use  the  longest  so=  commit  interval  that’s   acceptable  for  best  performance  
  • 25.
    UGC Stats •  Online  since  last  Sept.   •  Number  of  documents   –  Broadcasts:  26K   –  Screenshots:  5M   –  Videos:  20M   •  Average  request  RPS   –  Total  UGC  query  requests  per  day  >  1B   –  Average  Solr  query  RPS:   •  Broadcasts:  1600   •  Screenshots:  250   •  Videos:  250   –  Average  Solr  update  RPS:   •  Broadcasts:  500   •  Screenshots:  250   •  Videos:  500   •  Average  query  latency   –  Average  Solr  query  latency:   •  Broadcasts:  4ms  (16ms  for  leader)   •  Screenshots:  14ms  (16ms  for  leader)   •  Videos:  60ms  (210ms  for  leader)   –  Average  Solr  update  latency:   •  Broadcasts:  8ms  (60ms  for  leader)   •  Screenshots:  1ms  (10ms  for  leader)   •  Videos:  2ms  (24ms  for  leader)    
  • 26.
  • 27.