ニコニコ動画を検索可能にしてみよう

ニコニコ動画データ
セットを検索可能に
してみよう
@PENGUINANA_

whoami
• @PENGUINANA_ / 兼山元太
• エンジニア at *.cookpad.com/*
• 検索インフラとサービス開発

身の回りのJSON
• tweet
• 140 character message

身の回りのJSON
• tweet
• 140 character message
• user_name
• datetime
• location
• reply or not/contains link or not/
retweeted count/reply count ...

身の回りのJSON
• access log
• ip address
• requested content
• status code
• response time
• referrer

身の回りのJSON
• event log
• user_id
• event name
• params(hash)
• datetime
• user agent

身の回りのJSON
• dictionary edit request
• keyword
• operation type
• requester
• status(applied or not)

kibana
• http://demo.kibana.org/
• http://www.elasticsearch.org/blog/kibana-
whats-cooking/

kibana@cookpad
• log dashboard for internal API
• explore log
• capacity planning
• performance check
• slowquery

dashboard for each application

テーマ
• データサイズに負けずにJSONデータを
柔軟に検索/分析できれば日常が楽にな
る
• どうやったらできる？難しい？

やってみればよい
• ニコニコ動画データセット
• 検索/分析可能にする

データセット
• ニコニコ動画公式データセット
• 800万動画のメタデータ
• 25億コメント
• JSON形式(圧縮:60G 非圧縮:300G)
http://goo.gl/FYtO5T

結果
• Elasticsearch on AWSで4時間でできた
• s3 -> unzip -> Elasticsearch(173k doc/s)
• 550円

デモ
• 25億のコメントをdate facet

install
• wget https://download.elasticsearch.org/
elasticsearch/elasticsearch/
elasticsearch-0.90.3.noarch.rpm
• sudo rpm -i elasticsearch-0.90.3.noarch.rpm

install plugins
• sudo bin/plugin
• .. -install elasticsearch/elasticsearch-cloud-aws
• .. -install mobz/elasticsearch-head
• .. -install lukas-vlcek/bigdesk
• .. -install elasticsearch/elasticsearch-analysis-kuromoji

elasticsearch-cloud-aws
• cluster node discovery in AWS
• add conﬁg to elasticsearch.yml
cloud:
aws:
access_key:AKI...........
secret_key: mR.............
discovery:
type: ec2
discovery.ec2.groups: es_test (security_group)

elasticsearch-analysis-
kuromoji
• japanese analyzer

conﬁg
• # Set a custom allowed content length:
• http.max_content_length: 1000m
• # Heap Size (defaults to 256m min, 1g max)
• ES_HEAP_SIZE=3g
• # ElasticSearch data directory
• DATA_DIR=/media/ephemeral1/es,/media/ephemeral2/
es,/media/ephemeral3/es

make AMI
• elasticsearch machine image

launch ES Instances
• c1.xlarge x 20
• CPU Xeon 8core(2,300MHz)
• Memory 7G
• Disk 420G x4
• $0.07/hour(spot instance)

• download from s3 to nodes
• use s3cmd(few minutes with GNU Parallel)
• unzip(60GB->300GB)
deploy data

bulk import
{ "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } }
{"date":"2011-06-18T20:15:30+09:00","no":1,"vpos":
63,"comment":"1","command":"184"}
...
{ "index" : { "_id" : "sm14784868 2", "parent": "sm14784868" } }
{"date":"2011-07-24T02:22:58+09:00","no":2,"vpos":
4651,"comment":"2 get","command":"184"}

bulk import
• ls request_ﬁle* | parallel -j N curl -X POST -s -D - 'http://
localhost:9200/nico2/comment/_bulk' -o /dev/null --data-
binary @{}

import... import...
import...
• all node can handle indexing request
• curl bulk import in each node (x20)
• I/O into 3 disks
• takes 4hours

efﬁciency
"mappings": {
"video": {
"properties": {
"video_id": { "type": "string", "index": "no" },
"title": { "type": "string", "index": "analyzed" },
"description": { "type": "string", "index": "analyzed" },
"thumbnail_url": { "type": "string", "index": "no", "store": "yes" },
"upload_time": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" },
"movie_type": { "type": "string", "index": "not_analyzed" },
"last_res_body": { "type": "string", "index": "analyzed" },
"tags": {
"properties": {
"tag": { "type": "string", "index": "not_analyzed" }
}
}
}
}

efﬁciency
"mappings": {
"comment": {
"_parent": { "type": "video" },
"properties": {
"date": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" },
"no": { "type": "integer" },
"vpos": { "type": "integer" },
"comment": { "type": "string" },
"command": { "type": "string" },
"video_id": { "type": "string", "index": "not_analyzed" }
}
}

efﬁciency
• curl -X POST 'http://localhost:9200/nico2' -d
@mapping.json

shrink
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "nico2", "shard" :33,
"from_node" : "nodeA", "to_node" : "nodeB"
}
}
]}
'

shrink
curl -XPUT localhost:9200/_cluster/settings -d
'{ "persistent": {
"indices.recovery.concurrent_streams": 3
}}'
curl -XPUT localhost:9200/_cluster/settings -d
'{ "persistent": {
"indices.recovery.max_bytes_per_sec": "1000mb"
}}'

Why Elasticsearch?
• proven scalable search engine
• super ﬂexible conﬁg with nice default conf
• Great API
• growing developer, user base

not covered
• mapping
• query DSL
• search performance
• cluster operation
• healthcheck / cluster statistics
• etc...

ニコニコ動画を検索可能にしてみよう

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to ニコニコ動画を検索可能にしてみよう

Similar to ニコニコ動画を検索可能にしてみよう (20)

Recently uploaded

Recently uploaded (20)

ニコニコ動画を検索可能にしてみよう