ニコニコ動画データ
セットを検索可能に
してみよう
@PENGUINANA_
whoami
• @PENGUINANA_ / 兼山元太
• エンジニア at *.cookpad.com/*
• 検索インフラとサービス開発
身の回りのJSON
• tweet
• 140 character message
身の回りのJSON
• tweet
• 140 character message
• user_name
• datetime
• location
• reply or not/contains link or not/
retweeted...
身の回りのJSON
• access log
• ip address
• requested content
• status code
• response time
• referrer
身の回りのJSON
• event log
• user_id
• event name
• params(hash)
• datetime
• user agent
身の回りのJSON
• dictionary edit request
• keyword
• operation type
• requester
• status(applied or not)
kibana
• http://demo.kibana.org/
• http://www.elasticsearch.org/blog/kibana-
whats-cooking/
kibana@cookpad
• log dashboard for internal API
• explore log
• capacity planning
• performance check
• slowquery
dashboard for each application
テーマ
• データサイズに負けずにJSONデータを
柔軟に検索/分析できれば日常が楽にな
る
• どうやったらできる?難しい?
やってみればよい
• ニコニコ動画データセット
• 検索/分析可能にする
データセット
• ニコニコ動画公式データセット
• 800万動画のメタデータ
• 25億コメント
• JSON形式(圧縮:60G 非圧縮:300G)
http://goo.gl/FYtO5T
データセット
• ニコニコ動画公式データセット
• 800万動画のメタデータ
• 25億コメント
• JSON形式(圧縮:60G 非圧縮:300G)
http://goo.gl/FYtO5T
http://goo.gl/FYtO5T
http://goo.gl/FYtO5T
結果
• Elasticsearch on AWSで4時間でできた
• s3 -> unzip -> Elasticsearch(173k doc/s)
• 550円
デモ
• 25億のコメントをdate facet
install
• wget https://download.elasticsearch.org/
elasticsearch/elasticsearch/
elasticsearch-0.90.3.noarch.rpm
• sudo rpm...
install plugins
• sudo bin/plugin
• .. -install elasticsearch/elasticsearch-cloud-aws
• .. -install mobz/elasticsearch-hea...
elasticsearch-cloud-aws
• cluster node discovery in AWS
• add config to elasticsearch.yml
cloud:
aws:
access_key:AKI..........
elasticsearch-head
bigdesk
elasticsearch-analysis-
kuromoji
• japanese analyzer
config
• # Set a custom allowed content length:
• http.max_content_length: 1000m
• # Heap Size (defaults to 256m min, 1g ma...
make AMI
• elasticsearch machine image
launch ES Instances
• c1.xlarge x 20
• CPU Xeon 8core(2,300MHz)
• Memory 7G
• Disk 420G x4
• $0.07/hour(spot instance)
• download from s3 to nodes
• use s3cmd(few minutes with GNU Parallel)
• unzip(60GB->300GB)
deploy data
bulk import
{ "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } }
{"date":"2011-06-18T20:15:30+09:00","no":1,"v...
bulk import
• ls request_file* | parallel -j N curl -X POST -s -D - 'http://
localhost:9200/nico2/comment/_bulk' -o /dev/nu...
wc -l requests
> 4.8billion
import... import...
import...
• all node can handle indexing request
• curl bulk import in each node (x20)
• I/O into 3 di...
efficiency
efficiency
"mappings": {
"video": {
"properties": {
"video_id": { "type": "string", "index": "no" },
"title": { "type": "st...
efficiency
"mappings": {
"comment": {
"_parent": { "type": "video" },
"properties": {
"date": { "type": "date", "format": "...
efficiency
• curl -X POST 'http://localhost:9200/nico2' -d
@mapping.json
shrink
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "nico2", "shard" :33,
"fr...
shrink
curl -XPUT localhost:9200/_cluster/settings -d
'{ "persistent": {
"indices.recovery.concurrent_streams": 3
}}'
curl...
Why Elasticsearch?
• proven scalable search engine
• super flexible config with nice default conf
• Great API
• growing deve...
not covered
• mapping
• query DSL
• search performance
• cluster operation
• healthcheck / cluster statistics
• etc...
questions?
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
Upcoming SlideShare
Loading in...5
×

ニコニコ動画を検索可能にしてみよう

25,839

Published on

indexing 2.5billion with elasticsearch

Published in: Technology, News & Politics
0 Comments
32 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
25,839
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
39
Comments
0
Likes
32
Embeds 0
No embeds

No notes for slide

ニコニコ動画を検索可能にしてみよう

  1. 1. ニコニコ動画データ セットを検索可能に してみよう @PENGUINANA_
  2. 2. whoami • @PENGUINANA_ / 兼山元太 • エンジニア at *.cookpad.com/* • 検索インフラとサービス開発
  3. 3. 身の回りのJSON • tweet • 140 character message
  4. 4. 身の回りのJSON • tweet • 140 character message • user_name • datetime • location • reply or not/contains link or not/ retweeted count/reply count ...
  5. 5. 身の回りのJSON • access log • ip address • requested content • status code • response time • referrer
  6. 6. 身の回りのJSON • event log • user_id • event name • params(hash) • datetime • user agent
  7. 7. 身の回りのJSON • dictionary edit request • keyword • operation type • requester • status(applied or not)
  8. 8. kibana • http://demo.kibana.org/ • http://www.elasticsearch.org/blog/kibana- whats-cooking/
  9. 9. kibana@cookpad • log dashboard for internal API • explore log • capacity planning • performance check • slowquery
  10. 10. dashboard for each application
  11. 11. テーマ • データサイズに負けずにJSONデータを 柔軟に検索/分析できれば日常が楽にな る • どうやったらできる?難しい?
  12. 12. やってみればよい • ニコニコ動画データセット • 検索/分析可能にする
  13. 13. データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
  14. 14. データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
  15. 15. http://goo.gl/FYtO5T
  16. 16. http://goo.gl/FYtO5T
  17. 17. 結果 • Elasticsearch on AWSで4時間でできた • s3 -> unzip -> Elasticsearch(173k doc/s) • 550円
  18. 18. デモ • 25億のコメントをdate facet
  19. 19. install • wget https://download.elasticsearch.org/ elasticsearch/elasticsearch/ elasticsearch-0.90.3.noarch.rpm • sudo rpm -i elasticsearch-0.90.3.noarch.rpm
  20. 20. install plugins • sudo bin/plugin • .. -install elasticsearch/elasticsearch-cloud-aws • .. -install mobz/elasticsearch-head • .. -install lukas-vlcek/bigdesk • .. -install elasticsearch/elasticsearch-analysis-kuromoji
  21. 21. elasticsearch-cloud-aws • cluster node discovery in AWS • add config to elasticsearch.yml cloud: aws: access_key:AKI........... secret_key: mR............. discovery: type: ec2 discovery.ec2.groups: es_test (security_group)
  22. 22. elasticsearch-head
  23. 23. bigdesk
  24. 24. elasticsearch-analysis- kuromoji • japanese analyzer
  25. 25. config • # Set a custom allowed content length: • http.max_content_length: 1000m • # Heap Size (defaults to 256m min, 1g max) • ES_HEAP_SIZE=3g • # ElasticSearch data directory • DATA_DIR=/media/ephemeral1/es,/media/ephemeral2/ es,/media/ephemeral3/es
  26. 26. make AMI • elasticsearch machine image
  27. 27. launch ES Instances • c1.xlarge x 20 • CPU Xeon 8core(2,300MHz) • Memory 7G • Disk 420G x4 • $0.07/hour(spot instance)
  28. 28. • download from s3 to nodes • use s3cmd(few minutes with GNU Parallel) • unzip(60GB->300GB) deploy data
  29. 29. bulk import { "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } } {"date":"2011-06-18T20:15:30+09:00","no":1,"vpos": 63,"comment":"1","command":"184"} ... { "index" : { "_id" : "sm14784868 2", "parent": "sm14784868" } } {"date":"2011-07-24T02:22:58+09:00","no":2,"vpos": 4651,"comment":"2 get","command":"184"}
  30. 30. bulk import • ls request_file* | parallel -j N curl -X POST -s -D - 'http:// localhost:9200/nico2/comment/_bulk' -o /dev/null --data- binary @{}
  31. 31. wc -l requests > 4.8billion
  32. 32. import... import... import... • all node can handle indexing request • curl bulk import in each node (x20) • I/O into 3 disks • takes 4hours
  33. 33. efficiency
  34. 34. efficiency "mappings": { "video": { "properties": { "video_id": { "type": "string", "index": "no" }, "title": { "type": "string", "index": "analyzed" }, "description": { "type": "string", "index": "analyzed" }, "thumbnail_url": { "type": "string", "index": "no", "store": "yes" }, "upload_time": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "movie_type": { "type": "string", "index": "not_analyzed" }, "last_res_body": { "type": "string", "index": "analyzed" }, "tags": { "properties": { "tag": { "type": "string", "index": "not_analyzed" } } } } }
  35. 35. efficiency "mappings": { "comment": { "_parent": { "type": "video" }, "properties": { "date": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "no": { "type": "integer" }, "vpos": { "type": "integer" }, "comment": { "type": "string" }, "command": { "type": "string" }, "video_id": { "type": "string", "index": "not_analyzed" } } }
  36. 36. efficiency • curl -X POST 'http://localhost:9200/nico2' -d @mapping.json
  37. 37. shrink curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "move" : { "index" : "nico2", "shard" :33, "from_node" : "nodeA", "to_node" : "nodeB" } } ]} '
  38. 38. shrink curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.concurrent_streams": 3 }}' curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.max_bytes_per_sec": "1000mb" }}'
  39. 39. Why Elasticsearch? • proven scalable search engine • super flexible config with nice default conf • Great API • growing developer, user base
  40. 40. not covered • mapping • query DSL • search performance • cluster operation • healthcheck / cluster statistics • etc...
  41. 41. questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×