ニコニコ動画を検索可能にしてみよう
Upcoming SlideShare
Loading in...5
×
 

ニコニコ動画を検索可能にしてみよう

on

  • 26,092 views

indexing 2.5billion with elasticsearch

indexing 2.5billion with elasticsearch

Statistics

Views

Total Views
26,092
Slideshare-icon Views on SlideShare
24,407
Embed Views
1,685

Actions

Likes
30
Downloads
32
Comments
0

11 Embeds 1,685

http://code46.hatenablog.com 758
http://d.hatena.ne.jp 560
https://twitter.com 230
http://a3no.hatenablog.com 106
http://cloud.feedly.com 12
http://feedly.com 9
http://tweetedtimes.com 4
http://reader.aol.com 2
https://www.google.co.jp 2
http://digg.com 1
https://web.tweetdeck.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    ニコニコ動画を検索可能にしてみよう ニコニコ動画を検索可能にしてみよう Presentation Transcript

    • ニコニコ動画データ セットを検索可能に してみよう @PENGUINANA_
    • whoami • @PENGUINANA_ / 兼山元太 • エンジニア at *.cookpad.com/* • 検索インフラとサービス開発
    • 身の回りのJSON • tweet • 140 character message
    • 身の回りのJSON • tweet • 140 character message • user_name • datetime • location • reply or not/contains link or not/ retweeted count/reply count ...
    • 身の回りのJSON • access log • ip address • requested content • status code • response time • referrer
    • 身の回りのJSON • event log • user_id • event name • params(hash) • datetime • user agent
    • 身の回りのJSON • dictionary edit request • keyword • operation type • requester • status(applied or not)
    • kibana • http://demo.kibana.org/ • http://www.elasticsearch.org/blog/kibana- whats-cooking/
    • kibana@cookpad • log dashboard for internal API • explore log • capacity planning • performance check • slowquery
    • dashboard for each application
    • テーマ • データサイズに負けずにJSONデータを 柔軟に検索/分析できれば日常が楽にな る • どうやったらできる?難しい?
    • やってみればよい • ニコニコ動画データセット • 検索/分析可能にする
    • データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
    • データセット • ニコニコ動画公式データセット • 800万動画のメタデータ • 25億コメント • JSON形式(圧縮:60G 非圧縮:300G) http://goo.gl/FYtO5T
    • http://goo.gl/FYtO5T
    • http://goo.gl/FYtO5T
    • 結果 • Elasticsearch on AWSで4時間でできた • s3 -> unzip -> Elasticsearch(173k doc/s) • 550円
    • デモ • 25億のコメントをdate facet
    • install • wget https://download.elasticsearch.org/ elasticsearch/elasticsearch/ elasticsearch-0.90.3.noarch.rpm • sudo rpm -i elasticsearch-0.90.3.noarch.rpm
    • install plugins • sudo bin/plugin • .. -install elasticsearch/elasticsearch-cloud-aws • .. -install mobz/elasticsearch-head • .. -install lukas-vlcek/bigdesk • .. -install elasticsearch/elasticsearch-analysis-kuromoji
    • elasticsearch-cloud-aws • cluster node discovery in AWS • add config to elasticsearch.yml cloud: aws: access_key:AKI........... secret_key: mR............. discovery: type: ec2 discovery.ec2.groups: es_test (security_group)
    • elasticsearch-head
    • bigdesk
    • elasticsearch-analysis- kuromoji • japanese analyzer
    • config • # Set a custom allowed content length: • http.max_content_length: 1000m • # Heap Size (defaults to 256m min, 1g max) • ES_HEAP_SIZE=3g • # ElasticSearch data directory • DATA_DIR=/media/ephemeral1/es,/media/ephemeral2/ es,/media/ephemeral3/es
    • make AMI • elasticsearch machine image
    • launch ES Instances • c1.xlarge x 20 • CPU Xeon 8core(2,300MHz) • Memory 7G • Disk 420G x4 • $0.07/hour(spot instance)
    • • download from s3 to nodes • use s3cmd(few minutes with GNU Parallel) • unzip(60GB->300GB) deploy data
    • bulk import { "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } } {"date":"2011-06-18T20:15:30+09:00","no":1,"vpos": 63,"comment":"1","command":"184"} ... { "index" : { "_id" : "sm14784868 2", "parent": "sm14784868" } } {"date":"2011-07-24T02:22:58+09:00","no":2,"vpos": 4651,"comment":"2 get","command":"184"}
    • bulk import • ls request_file* | parallel -j N curl -X POST -s -D - 'http:// localhost:9200/nico2/comment/_bulk' -o /dev/null --data- binary @{}
    • wc -l requests > 4.8billion
    • import... import... import... • all node can handle indexing request • curl bulk import in each node (x20) • I/O into 3 disks • takes 4hours
    • efficiency
    • efficiency "mappings": { "video": { "properties": { "video_id": { "type": "string", "index": "no" }, "title": { "type": "string", "index": "analyzed" }, "description": { "type": "string", "index": "analyzed" }, "thumbnail_url": { "type": "string", "index": "no", "store": "yes" }, "upload_time": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "movie_type": { "type": "string", "index": "not_analyzed" }, "last_res_body": { "type": "string", "index": "analyzed" }, "tags": { "properties": { "tag": { "type": "string", "index": "not_analyzed" } } } } }
    • efficiency "mappings": { "comment": { "_parent": { "type": "video" }, "properties": { "date": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" }, "no": { "type": "integer" }, "vpos": { "type": "integer" }, "comment": { "type": "string" }, "command": { "type": "string" }, "video_id": { "type": "string", "index": "not_analyzed" } } }
    • efficiency • curl -X POST 'http://localhost:9200/nico2' -d @mapping.json
    • shrink curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "move" : { "index" : "nico2", "shard" :33, "from_node" : "nodeA", "to_node" : "nodeB" } } ]} '
    • shrink curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.concurrent_streams": 3 }}' curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent": { "indices.recovery.max_bytes_per_sec": "1000mb" }}'
    • Why Elasticsearch? • proven scalable search engine • super flexible config with nice default conf • Great API • growing developer, user base
    • not covered • mapping • query DSL • search performance • cluster operation • healthcheck / cluster statistics • etc...
    • questions?