You know, for search. Querying 24 Billion Documents in 900ms

33,889 views

Published on

Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.

2 Comments
31 Likes
Statistics
Notes
No Downloads
Views
Total views
33,889
On SlideShare
0
From Embeds
0
Number of Embeds
26,275
Actions
Shares
0
Downloads
0
Comments
2
Likes
31
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • how do i work?\n* agile leader * i say what i do * i do what i say * hands on\n* quality over speed * responsibility to team\n* attract specialists * not trying to sell something. but DO IT. DELIVER\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • how i feel ich unternehme dinge\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • how i feel ich unternehme dinge\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • You know, for search. Querying 24 Billion Documents in 900ms

    1. 1. You know, for search querying 24 000 000 000 Records in 900ms @jodok
    2. 2. @jodok
    3. 3. source: www.searchmetrics.com
    4. 4. First Iteration
    5. 5. The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
    6. 6. c1.xlarge 5 x m2.2xlarge EBS ES as document store- bash - 5 instances - find - weekly indexes - zcat - 2 replicas - curl - EBS volume
    7. 7. • Map/Reduce to push to Elasticsearch• via NFS to HDFS storage HDFS ES• no dedicated nodes MAPRED
    8. 8. - Disk IO- concated gzip files- compression
    9. 9. Hadoop Storage - Index “Driver” Namenode Jobtracker Hive Datanode Secondary NN Datanode 2 Tasktracker Datanode 2 Tasktracker 6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS Datanode Datanode Datanode 4 Tasktracker 4 Tasktracker 4 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS Tasktracker Spot Instances
    10. 10. Adding S3 / External Tables to Hivecreate external table $tmp_table_name (size bigint, path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY t stored as INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" location s3n://...;SET ...from ( select transform (size, path) using ./current.tar.gz/bin/importer transform${max_lines} as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY 001 COLLECTION ITEMS TERMINATED BY 002 MAP KEYS TERMINATED BY 003 LINES TERMINATED BY n from $tmp_table_name ) fINSERT overwrite TABLE crawls PARTITION (crawl_day=${day}) select crawl_ts, ... user_json, tweets,
    11. 11. https://launchpad.net/ubuntu/+source/cloud-inithttp://www.netfort.gr.jp/~dancer/software/dsh.html.en
    12. 12. packages: - puppet# Send pre-generated ssh private keys to the serverssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY}# set up mount points# remove default mount pointsmounts: - [ swap, null ] - [ ephemeral0, null ]# Additional YUM Repositoriesrepo_additions:- source: "lovely-public" name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely" baseurl: "https://yum.lovelysystems.com/public/release"runcmd: - [ hostname, "${HOST}" ] - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ] - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ] - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ] - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ]${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]
    13. 13. - IO- ES Memory- ES Backup- ES Replicas- Load while indexing- AWS Limits
    14. 14. EBS performance http://blog.dt.org
    15. 15. • Shard allocation• Avoid rebalancing (Discovery Timeout)• Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins• LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
    16. 16. 3 AP server / MC c1.xlarge6 ES Master Nodes 6 Node Hadoop Clusterc1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
    17. 17. Everything fine?
    18. 18. Cutting the cost• Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,...• no more time-based indizes• Dedicated Hardware• SSD Disks• Share Hardware for ES and Hadoop
    19. 19. Jenkins for Workflows
    20. 20. distcp S3 HDFS transformhttps://github.com/lovelysystems/ls-hivehttps://github.com/lovelysystems/ls-thrift-py-hadoop
    21. 21. Thats thirty minutes away.Ill be there in ten. @jodok

    ×