You know, for search
 querying 24 000 000 000 Records in 900ms




                                @jodok
@jodok
source: www.searchmetrics.com
First Iteration
The anatomy of a tweet
                         http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
c1.xlarge   5 x m2.2xlarge




           EBS
                     ES as document store

- bash               - 5 instances
  - find              - weekly indexes
  - zcat             - 2 replicas
  - curl             - EBS volume
• Map/Reduce to push
  to Elasticsearch
• via NFS to HDFS
  storage               HDFS

                                ES




• no dedicated nodes

                       MAPRED
- Disk IO
- concated gzip files
- compression
Hadoop Storage - Index “Driver”

  Namenode         Jobtracker       Hive
  Datanode         Secondary NN     Datanode
  2 Tasktracker    Datanode         2 Tasktracker
  6x 500 TB HDFS   2 Tasktracker    6x 500 TB HDFS
                   6x 500 TB HDFS




  Datanode         Datanode         Datanode
  4 Tasktracker    4 Tasktracker    4 Tasktracker
  6x 500 TB HDFS   6x 500 TB HDFS   6x 500 TB HDFS




                   Tasktracker
                   Spot Instances
Adding S3 / External Tables to Hive

create external table $tmp_table_name
        (size bigint, path string)
          ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
        stored as
         INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat"
         OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
        location s3n://...;
SET ...
from (
     select transform (size, path)
        using './current.tar.gz/bin/importer transform${max_lines}' as
        (crawl_ts int, screen_name string, ... num_tweets int)
         ROW FORMAT DELIMITED
             FIELDS TERMINATED BY '001'
             COLLECTION ITEMS TERMINATED BY '002'
             MAP KEYS TERMINATED BY '003'
             LINES TERMINATED BY 'n'
                from $tmp_table_name
        ) f
INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}')
     select
            crawl_ts, ... user_json, tweets,
https://launchpad.net/ubuntu/+source/cloud-init
http://www.netfort.gr.jp/~dancer/software/dsh.html.en
packages:
 - puppet

# Send pre-generated ssh private keys to the server
ssh_keys:
 rsa_private: | ${SSH_RSA_PRIVATE_KEY}
 rsa_public: ${SSH_RSA_PUBLIC_KEY}
 dsa_private: | ${SSH_DSA_PRIVATE_KEY}
 dsa_public: ${SSH_DSA_PUBLIC_KEY}

# set up mount points
# remove default mount points
mounts:
 - [ swap, null ]
 - [ ephemeral0, null ]

# Additional YUM Repositories
repo_additions:
- source: "lovely-public"
  name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions"
  filename: lovely-public.repo
  enabled: 1
  gpgcheck: 1
  key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely"
  baseurl: "https://yum.lovelysystems.com/public/release"

runcmd:
 - [ hostname, "${HOST}" ]
 - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ]
 - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ]
 - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4)    ${HOST}    ${HOST_NAME}" >> /etc/hosts ]

 - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"]

 - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/certs ]
${PUPPET_PRIVATE_KEY}
 - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]
${PUPPET_PUBLIC_KEY}
 - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]
${PUPPET_CERT}
 - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ]

 - [ sh, -c, echo "    server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ]
 - [ sh, -c, echo "    certname = ${HOST}" >> /etc/puppet/puppet.conf ]
 - [ /etc/init.d/puppet, start ]
- IO
- ES Memory
- ES Backup
- ES Replicas
- Load while indexing
- AWS Limits
EBS performance




                  http://blog.dt.org
• Shard allocation
• Avoid rebalancing (Discovery Timeout)
• Uncached Facets
  https://github.com/lovelysystems/elasticsearch-ls-plugins
• LUCENE-2205
  Rework of the TermInfosReader class to remove the
  Terms[], TermInfos[], and the index pointer long[] and create
  a more memory efficient data structure.
3 AP server / MC
                    c1.xlarge




6 ES Master Nodes                      6 Node Hadoop Cluster
c1.xlarge                              + Spot Instances




                                          40 ES nodes per zone
                                          m1.large
                                          8 EBS Volumes
Everything fine?
Cutting the cost
• Reduce the amount of Data
  use Hadoop/MapRed transform to
  eliminate SPAM, irrelevant Languages,...
• no more time-based indizes
• Dedicated Hardware
• SSD Disks
• Share Hardware for ES and Hadoop
Jenkins for Workflows
distcp




       S3                      HDFS




                  transform




https://github.com/lovelysystems/ls-hive
https://github.com/lovelysystems/ls-thrift-py-hadoop
That's thirty
   minutes away.
I'll be there in ten.

      @jodok

You know, for search. Querying 24 Billion Documents in 900ms

  • 1.
    You know, forsearch querying 24 000 000 000 Records in 900ms @jodok
  • 2.
  • 8.
  • 12.
  • 13.
    The anatomy ofa tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
  • 14.
    c1.xlarge 5 x m2.2xlarge EBS ES as document store - bash - 5 instances - find - weekly indexes - zcat - 2 replicas - curl - EBS volume
  • 15.
    • Map/Reduce topush to Elasticsearch • via NFS to HDFS storage HDFS ES • no dedicated nodes MAPRED
  • 16.
    - Disk IO -concated gzip files - compression
  • 19.
    Hadoop Storage -Index “Driver” Namenode Jobtracker Hive Datanode Secondary NN Datanode 2 Tasktracker Datanode 2 Tasktracker 6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS Datanode Datanode Datanode 4 Tasktracker 4 Tasktracker 4 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS Tasktracker Spot Instances
  • 20.
    Adding S3 /External Tables to Hive create external table $tmp_table_name (size bigint, path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' stored as INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" location s3n://...; SET ... from ( select transform (size, path) using './current.tar.gz/bin/importer transform${max_lines}' as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' from $tmp_table_name ) f INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}') select crawl_ts, ... user_json, tweets,
  • 21.
  • 22.
    packages: - puppet #Send pre-generated ssh private keys to the server ssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY} # set up mount points # remove default mount points mounts: - [ swap, null ] - [ ephemeral0, null ] # Additional YUM Repositories repo_additions: - source: "lovely-public" name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely" baseurl: "https://yum.lovelysystems.com/public/release" runcmd: - [ hostname, "${HOST}" ] - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ] - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ] - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ] - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ] ${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ] ${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ] ${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]
  • 23.
    - IO - ESMemory - ES Backup - ES Replicas - Load while indexing - AWS Limits
  • 24.
    EBS performance http://blog.dt.org
  • 25.
    • Shard allocation •Avoid rebalancing (Discovery Timeout) • Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins • LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
  • 26.
    3 AP server/ MC c1.xlarge 6 ES Master Nodes 6 Node Hadoop Cluster c1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
  • 31.
  • 33.
    Cutting the cost •Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,... • no more time-based indizes • Dedicated Hardware • SSD Disks • Share Hardware for ES and Hadoop
  • 34.
  • 35.
    distcp S3 HDFS transform https://github.com/lovelysystems/ls-hive https://github.com/lovelysystems/ls-thrift-py-hadoop
  • 38.
    That's thirty minutes away. I'll be there in ten. @jodok

Editor's Notes