ニコニコ動画を検索可能にしてみよう

  • 24,735 views
Uploaded on

indexing 2.5billion with elasticsearch

indexing 2.5billion with elasticsearch

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
24,735
On Slideshare
0
From Embeds
0
Number of Embeds
11

Actions

Shares
Downloads
36
Comments
0
Likes
30

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ニコニコ動画データ セットを検索可能に してみよう @PENGUINANA_
  • 2. whoami • @PENGUINANA_ / 兼山元太 • エンジニア at *.cookpad.com/* • 検索インフラとサービス開発
  • 3. 身の回りのJSON • tweet • 140 character message
  • 4. 身の回りのJSON • tweet • 140 character message • user_name • datetime • location • reply or not/contains link or not/ retweeted count/reply count ...
  • 5. 身の回りのJSON • access log • ip address • requested content • status code • response time • referrer
  • 6. 身の回りのJSON • event log • user_id • event name • params(hash) • datetime • user agent
  • 7. 身の回りのJSON • dictionary edit request • keyword • operation type • requester • status(applied or not)
  • 8. kibana • http://demo.kibana.org/ • http://www.elasticsearch.org/blog/kibana- whats-cooking/
  • 9. kibana@cookpad • log dashboard for internal API • explore log • capacity planning • performance check • slowquery
  • 10. dashboard for each application
  • 11. テーマ • データサイズに負けずにJSONデータを 柔軟に検索/分析できれば日常が楽にな る • どうやったらできる?難しい?
  • 12. やってみればよい • ニコニコ動画データセット • 検索/分析可能にする
  • 13. データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
  • 14. データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
  • 15. http://goo.gl/FYtO5T
  • 16. http://goo.gl/FYtO5T
  • 17. 結果 • Elasticsearch on AWSで4時間でできた • s3 -> unzip -> Elasticsearch(173k doc/s) • 550円
  • 18. デモ • 25億のコメントをdate facet
  • 19. install • wget https://download.elasticsearch.org/ elasticsearch/elasticsearch/ elasticsearch-0.90.3.noarch.rpm • sudo rpm -i elasticsearch-0.90.3.noarch.rpm
  • 20. install plugins • sudo bin/plugin • .. -install elasticsearch/elasticsearch-cloud-aws • .. -install mobz/elasticsearch-head • .. -install lukas-vlcek/bigdesk • .. -install elasticsearch/elasticsearch-analysis-kuromoji
  • 21. elasticsearch-cloud-aws • cluster node discovery in AWS • add config to elasticsearch.yml cloud: aws: access_key:AKI........... secret_key: mR............. discovery: type: ec2 discovery.ec2.groups: es_test (security_group)
  • 22. elasticsearch-head
  • 23. bigdesk
  • 24. elasticsearch-analysis- kuromoji • japanese analyzer
  • 25. config • # Set a custom allowed content length: • http.max_content_length: 1000m • # Heap Size (defaults to 256m min, 1g max) • ES_HEAP_SIZE=3g • # ElasticSearch data directory • DATA_DIR=/media/ephemeral1/es,/media/ephemeral2/ es,/media/ephemeral3/es
  • 26. make AMI • elasticsearch machine image
  • 27. launch ES Instances • c1.xlarge x 20 • CPU Xeon 8core(2,300MHz) • Memory 7G • Disk 420G x4 • $0.07/hour(spot instance)
  • 28. • download from s3 to nodes • use s3cmd(few minutes with GNU Parallel) • unzip(60GB->300GB) deploy data
  • 29. bulk import { "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } } {"date":"2011-06-18T20:15:30+09:00","no":1,"vpos": 63,"comment":"1","command":"184"} ... { "index" : { "_id" : "sm14784868 2", "parent": "sm14784868" } } {"date":"2011-07-24T02:22:58+09:00","no":2,"vpos": 4651,"comment":"2 get","command":"184"}
  • 30. bulk import • ls request_file* | parallel -j N curl -X POST -s -D - 'http:// localhost:9200/nico2/comment/_bulk' -o /dev/null --data- binary @{}
  • 31. wc -l requests > 4.8billion
  • 32. import... import... import... • all node can handle indexing request • curl bulk import in each node (x20) • I/O into 3 disks • takes 4hours
  • 33. efficiency
  • 34. efficiency "mappings": { "video": { "properties": { "video_id": { "type": "string", "index": "no" }, "title": { "type": "string", "index": "analyzed" }, "description": { "type": "string", "index": "analyzed" }, "thumbnail_url": { "type": "string", "index": "no", "store": "yes" }, "upload_time": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "movie_type": { "type": "string", "index": "not_analyzed" }, "last_res_body": { "type": "string", "index": "analyzed" }, "tags": { "properties": { "tag": { "type": "string", "index": "not_analyzed" } } } } }
  • 35. efficiency "mappings": { "comment": { "_parent": { "type": "video" }, "properties": { "date": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "no": { "type": "integer" }, "vpos": { "type": "integer" }, "comment": { "type": "string" }, "command": { "type": "string" }, "video_id": { "type": "string", "index": "not_analyzed" } } }
  • 36. efficiency • curl -X POST 'http://localhost:9200/nico2' -d @mapping.json
  • 37. shrink curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "move" : { "index" : "nico2", "shard" :33, "from_node" : "nodeA", "to_node" : "nodeB" } } ]} '
  • 38. shrink curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.concurrent_streams": 3 }}' curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.max_bytes_per_sec": "1000mb" }}'
  • 39. Why Elasticsearch? • proven scalable search engine • super flexible config with nice default conf • Great API • growing developer, user base
  • 40. not covered • mapping • query DSL • search performance • cluster operation • healthcheck / cluster statistics • etc...
  • 41. questions?