You know, for search. Querying 24 Billion Documents in 900ms

You know, for search
querying 24 000 000 000 Records in 900ms

@jodok

The anatomy of a tweet
http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php

c1.xlarge 5 x m2.2xlarge

EBS
ES as document store

- bash - 5 instances
- ﬁnd - weekly indexes
- zcat - 2 replicas
- curl - EBS volume

• Map/Reduce to push
to Elasticsearch
• via NFS to HDFS
storage HDFS

ES

• no dedicated nodes

MAPRED

- Disk IO
- concated gzip ﬁles
- compression

Hadoop Storage - Index “Driver”

Namenode Jobtracker Hive
Datanode Secondary NN Datanode
2 Tasktracker Datanode 2 Tasktracker
6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS
6x 500 TB HDFS

Datanode Datanode Datanode
4 Tasktracker 4 Tasktracker 4 Tasktracker
6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS

Tasktracker
Spot Instances

Adding S3 / External Tables to Hive

create external table $tmp_table_name
(size bigint, path string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
stored as
INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
location s3n://...;
SET ...
from (
select transform (size, path)
using './current.tar.gz/bin/importer transform${max_lines}' as
(crawl_ts int, screen_name string, ... num_tweets int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
LINES TERMINATED BY 'n'
from $tmp_table_name
) f
INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}')
select
crawl_ts, ... user_json, tweets,

https://launchpad.net/ubuntu/+source/cloud-init
http://www.netfort.gr.jp/~dancer/software/dsh.html.en

packages:
- puppet

# Send pre-generated ssh private keys to the server
ssh_keys:
rsa_private: | ${SSH_RSA_PRIVATE_KEY}
rsa_public: ${SSH_RSA_PUBLIC_KEY}
dsa_private: | ${SSH_DSA_PRIVATE_KEY}
dsa_public: ${SSH_DSA_PUBLIC_KEY}

# set up mount points
# remove default mount points
mounts:
- [ swap, null ]
- [ ephemeral0, null ]

# Additional YUM Repositories
repo_additions:
- source: "lovely-public"
name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions"
filename: lovely-public.repo
enabled: 1
gpgcheck: 1
key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely"
baseurl: "https://yum.lovelysystems.com/public/release"

runcmd:
- [ hostname, "${HOST}" ]
- [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ]
- [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ]
- [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ]

- [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"]

- [ mkdir, -p, /var/lib/puppet/ssl/private_keys ]
- [ mkdir, -p, /var/lib/puppet/ssl/public_keys ]
- [ mkdir, -p, /var/lib/puppet/ssl/certs ]
${PUPPET_PRIVATE_KEY}
- [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]
${PUPPET_PUBLIC_KEY}
- [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]
${PUPPET_CERT}
- [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ]

- [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ]
- [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ]
- [ /etc/init.d/puppet, start ]

- IO
- ES Memory
- ES Backup
- ES Replicas
- Load while indexing
- AWS Limits

EBS performance

http://blog.dt.org

• Shard allocation
• Avoid rebalancing (Discovery Timeout)
• Uncached Facets
https://github.com/lovelysystems/elasticsearch-ls-plugins
• LUCENE-2205
Rework of the TermInfosReader class to remove the
Terms[], TermInfos[], and the index pointer long[] and create
a more memory efﬁcient data structure.

3 AP server / MC
c1.xlarge

6 ES Master Nodes 6 Node Hadoop Cluster
c1.xlarge + Spot Instances

40 ES nodes per zone
m1.large
8 EBS Volumes

Cutting the cost
• Reduce the amount of Data
use Hadoop/MapRed transform to
eliminate SPAM, irrelevant Languages,...
• no more time-based indizes
• Dedicated Hardware
• SSD Disks
• Share Hardware for ES and Hadoop

distcp

S3 HDFS

transform

https://github.com/lovelysystems/ls-hive
https://github.com/lovelysystems/ls-thrift-py-hadoop

That's thirty
minutes away.
I'll be there in ten.

@jodok

You know, for search. Querying 24 Billion Documents in 900ms

More Related Content

What's hot

Viewers also liked

Similar to You know, for search. Querying 24 Billion Documents in 900ms

Recently uploaded

You know, for search. Querying 24 Billion Documents in 900ms

Editor's Notes