ニコニコ動画を検索可能にしてみよう
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

ニコニコ動画を検索可能にしてみよう

  • 26,597 views
Uploaded on

indexing 2.5billion with elasticsearch

indexing 2.5billion with elasticsearch

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
26,597
On Slideshare
24,766
From Embeds
1,831
Number of Embeds
11

Actions

Shares
Downloads
36
Comments
0
Likes
30

Embeds 1,831

http://code46.hatenablog.com 896
http://d.hatena.ne.jp 560
https://twitter.com 231
http://a3no.hatenablog.com 112
http://cloud.feedly.com 12
http://feedly.com 10
http://tweetedtimes.com 4
http://reader.aol.com 2
https://www.google.co.jp 2
http://digg.com 1
https://web.tweetdeck.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ニコニコ動画データ セットを検索可能に してみよう @PENGUINANA_
  • 2. whoami • @PENGUINANA_ / 兼山元太 • エンジニア at *.cookpad.com/* • 検索インフラとサービス開発
  • 3. 身の回りのJSON • tweet • 140 character message
  • 4. 身の回りのJSON • tweet • 140 character message • user_name • datetime • location • reply or not/contains link or not/ retweeted count/reply count ...
  • 5. 身の回りのJSON • access log • ip address • requested content • status code • response time • referrer
  • 6. 身の回りのJSON • event log • user_id • event name • params(hash) • datetime • user agent
  • 7. 身の回りのJSON • dictionary edit request • keyword • operation type • requester • status(applied or not)
  • 8. kibana • http://demo.kibana.org/ • http://www.elasticsearch.org/blog/kibana- whats-cooking/
  • 9. kibana@cookpad • log dashboard for internal API • explore log • capacity planning • performance check • slowquery
  • 10. dashboard for each application
  • 11. テーマ • データサイズに負けずにJSONデータを 柔軟に検索/分析できれば日常が楽にな る • どうやったらできる?難しい?
  • 12. やってみればよい • ニコニコ動画データセット • 検索/分析可能にする
  • 13. データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
  • 14. データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
  • 15. http://goo.gl/FYtO5T
  • 16. http://goo.gl/FYtO5T
  • 17. 結果 • Elasticsearch on AWSで4時間でできた • s3 -> unzip -> Elasticsearch(173k doc/s) • 550円
  • 18. デモ • 25億のコメントをdate facet
  • 19. install • wget https://download.elasticsearch.org/ elasticsearch/elasticsearch/ elasticsearch-0.90.3.noarch.rpm • sudo rpm -i elasticsearch-0.90.3.noarch.rpm
  • 20. install plugins • sudo bin/plugin • .. -install elasticsearch/elasticsearch-cloud-aws • .. -install mobz/elasticsearch-head • .. -install lukas-vlcek/bigdesk • .. -install elasticsearch/elasticsearch-analysis-kuromoji
  • 21. elasticsearch-cloud-aws • cluster node discovery in AWS • add config to elasticsearch.yml cloud: aws: access_key:AKI........... secret_key: mR............. discovery: type: ec2 discovery.ec2.groups: es_test (security_group)
  • 22. elasticsearch-head
  • 23. bigdesk
  • 24. elasticsearch-analysis- kuromoji • japanese analyzer
  • 25. config • # Set a custom allowed content length: • http.max_content_length: 1000m • # Heap Size (defaults to 256m min, 1g max) • ES_HEAP_SIZE=3g • # ElasticSearch data directory • DATA_DIR=/media/ephemeral1/es,/media/ephemeral2/ es,/media/ephemeral3/es
  • 26. make AMI • elasticsearch machine image
  • 27. launch ES Instances • c1.xlarge x 20 • CPU Xeon 8core(2,300MHz) • Memory 7G • Disk 420G x4 • $0.07/hour(spot instance)
  • 28. • download from s3 to nodes • use s3cmd(few minutes with GNU Parallel) • unzip(60GB->300GB) deploy data
  • 29. bulk import { "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } } {"date":"2011-06-18T20:15:30+09:00","no":1,"vpos": 63,"comment":"1","command":"184"} ... { "index" : { "_id" : "sm14784868 2", "parent": "sm14784868" } } {"date":"2011-07-24T02:22:58+09:00","no":2,"vpos": 4651,"comment":"2 get","command":"184"}
  • 30. bulk import • ls request_file* | parallel -j N curl -X POST -s -D - 'http:// localhost:9200/nico2/comment/_bulk' -o /dev/null --data- binary @{}
  • 31. wc -l requests > 4.8billion
  • 32. import... import... import... • all node can handle indexing request • curl bulk import in each node (x20) • I/O into 3 disks • takes 4hours
  • 33. efficiency
  • 34. efficiency "mappings": { "video": { "properties": { "video_id": { "type": "string", "index": "no" }, "title": { "type": "string", "index": "analyzed" }, "description": { "type": "string", "index": "analyzed" }, "thumbnail_url": { "type": "string", "index": "no", "store": "yes" }, "upload_time": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "movie_type": { "type": "string", "index": "not_analyzed" }, "last_res_body": { "type": "string", "index": "analyzed" }, "tags": { "properties": { "tag": { "type": "string", "index": "not_analyzed" } } } } }
  • 35. efficiency "mappings": { "comment": { "_parent": { "type": "video" }, "properties": { "date": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "no": { "type": "integer" }, "vpos": { "type": "integer" }, "comment": { "type": "string" }, "command": { "type": "string" }, "video_id": { "type": "string", "index": "not_analyzed" } } }
  • 36. efficiency • curl -X POST 'http://localhost:9200/nico2' -d @mapping.json
  • 37. shrink curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "move" : { "index" : "nico2", "shard" :33, "from_node" : "nodeA", "to_node" : "nodeB" } } ]} '
  • 38. shrink curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.concurrent_streams": 3 }}' curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.max_bytes_per_sec": "1000mb" }}'
  • 39. Why Elasticsearch? • proven scalable search engine • super flexible config with nice default conf • Great API • growing developer, user base
  • 40. not covered • mapping • query DSL • search performance • cluster operation • healthcheck / cluster statistics • etc...
  • 41. questions?