• Save
You know, for search. Querying 24 Billion Documents in 900ms
 

You know, for search. Querying 24 Billion Documents in 900ms

on

  • 24,830 views

Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our ...

Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.

Statistics

Views

Total Views
24,830
Views on SlideShare
4,437
Embed Views
20,393

Actions

Likes
18
Downloads
0
Comments
1

31 Embeds 20,393

http://www.elasticsearch.org 16960
http://ajohnstone.com 3085
http://www.scoop.it 215
http://www.ajohnstone.com 23
http://abtasty.com 21
http://blog.ajohnstone.com 19
http://esorg.dev 14
http://staging.elasticsearch.org 13
http://translate.googleusercontent.com 6
http://162.16.137.201 5
http://localhost 4
http://elasticsearch.org 3
http://es-en.medcl.net 3
https://si0.twimg.com 2
https://home.jolicloud.com 2
http://bitek.ajohnstone.com 2
http://46.51.171.110 2
http://www.google.com 1
http://http.www.ajohnstone.com 1
http://webcache.googleusercontent.com 1
https://translate.googleusercontent.com 1
http://new.elasticsearch.com 1
http://s.medcl.net 1
http://www.ajohnstone.cowww.ajohnstone.com 1
https://www.google.fr 1
http://ec2-46-51-171-110.eu-west-1.compute.amazonaws.com 1
http://s0.wp.com 1
http://jenkins.ajohnstone.com 1
http://127.0.0.1 1
http://bogus.referer.ibm.com&_=1341490986996 HTTP 1
http://www.google.de 1
More...

Accessibility

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • how do i work?\n* agile leader * i say what i do * i do what i say * hands on\n* quality over speed * responsibility to team\n* attract specialists * not trying to sell something. but DO IT. DELIVER\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • how i feel ich unternehme dinge\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • how i feel ich unternehme dinge\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

You know, for search. Querying 24 Billion Documents in 900ms You know, for search. Querying 24 Billion Documents in 900ms Presentation Transcript

  • You know, for search querying 24 000 000 000 Records in 900ms @jodok
  • @jodok
  • source: www.searchmetrics.com
  • First Iteration
  • The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
  • c1.xlarge 5 x m2.2xlarge EBS ES as document store- bash - 5 instances - find - weekly indexes - zcat - 2 replicas - curl - EBS volume
  • • Map/Reduce to push to Elasticsearch• via NFS to HDFS storage HDFS ES• no dedicated nodes MAPRED
  • - Disk IO- concated gzip files- compression
  • Hadoop Storage - Index “Driver” Namenode Jobtracker Hive Datanode Secondary NN Datanode 2 Tasktracker Datanode 2 Tasktracker 6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS Datanode Datanode Datanode 4 Tasktracker 4 Tasktracker 4 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS Tasktracker Spot Instances
  • Adding S3 / External Tables to Hivecreate external table $tmp_table_name (size bigint, path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY t stored as INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" location s3n://...;SET ...from ( select transform (size, path) using ./current.tar.gz/bin/importer transform${max_lines} as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 001 COLLECTION ITEMS TERMINATED BY 002 MAP KEYS TERMINATED BY 003 LINES TERMINATED BY n from $tmp_table_name ) fINSERT overwrite TABLE crawls PARTITION (crawl_day=${day}) select crawl_ts, ... user_json, tweets,
  • https://launchpad.net/ubuntu/+source/cloud-inithttp://www.netfort.gr.jp/~dancer/software/dsh.html.en
  • packages: - puppet# Send pre-generated ssh private keys to the serverssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY}# set up mount points# remove default mount pointsmounts: - [ swap, null ] - [ ephemeral0, null ]# Additional YUM Repositoriesrepo_additions:- source: "lovely-public" name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely" baseurl: "https://yum.lovelysystems.com/public/release"runcmd: - [ hostname, "${HOST}" ] - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ] - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ] - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ] - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ]${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]
  • - IO- ES Memory- ES Backup- ES Replicas- Load while indexing- AWS Limits
  • EBS performance http://blog.dt.org
  • • Shard allocation• Avoid rebalancing (Discovery Timeout)• Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins• LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
  • 3 AP server / MC c1.xlarge6 ES Master Nodes 6 Node Hadoop Clusterc1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
  • Everything fine?
  • Cutting the cost• Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,...• no more time-based indizes• Dedicated Hardware• SSD Disks• Share Hardware for ES and Hadoop
  • Jenkins for Workflows
  • distcp S3 HDFS transformhttps://github.com/lovelysystems/ls-hivehttps://github.com/lovelysystems/ls-thrift-py-hadoop
  • Thats thirty minutes away.Ill be there in ten. @jodok