SlideShare a Scribd company logo
You know, for search
 querying 24 000 000 000 Records in 900ms




                                @jodok
@jodok
source: www.searchmetrics.com
First Iteration
The anatomy of a tweet
                         http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
c1.xlarge   5 x m2.2xlarge




           EBS
                     ES as document store

- bash               - 5 instances
  - find              - weekly indexes
  - zcat             - 2 replicas
  - curl             - EBS volume
• Map/Reduce to push
  to Elasticsearch
• via NFS to HDFS
  storage               HDFS

                                ES




• no dedicated nodes

                       MAPRED
- Disk IO
- concated gzip files
- compression
Hadoop Storage - Index “Driver”

  Namenode         Jobtracker       Hive
  Datanode         Secondary NN     Datanode
  2 Tasktracker    Datanode         2 Tasktracker
  6x 500 TB HDFS   2 Tasktracker    6x 500 TB HDFS
                   6x 500 TB HDFS




  Datanode         Datanode         Datanode
  4 Tasktracker    4 Tasktracker    4 Tasktracker
  6x 500 TB HDFS   6x 500 TB HDFS   6x 500 TB HDFS




                   Tasktracker
                   Spot Instances
Adding S3 / External Tables to Hive

create external table $tmp_table_name
        (size bigint, path string)
          ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
        stored as
         INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat"
         OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
        location s3n://...;
SET ...
from (
     select transform (size, path)
        using './current.tar.gz/bin/importer transform${max_lines}' as
        (crawl_ts int, screen_name string, ... num_tweets int)
         ROW FORMAT DELIMITED
             FIELDS TERMINATED BY '001'
             COLLECTION ITEMS TERMINATED BY '002'
             MAP KEYS TERMINATED BY '003'
             LINES TERMINATED BY 'n'
                from $tmp_table_name
        ) f
INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}')
     select
            crawl_ts, ... user_json, tweets,
https://launchpad.net/ubuntu/+source/cloud-init
http://www.netfort.gr.jp/~dancer/software/dsh.html.en
packages:
 - puppet

# Send pre-generated ssh private keys to the server
ssh_keys:
 rsa_private: | ${SSH_RSA_PRIVATE_KEY}
 rsa_public: ${SSH_RSA_PUBLIC_KEY}
 dsa_private: | ${SSH_DSA_PRIVATE_KEY}
 dsa_public: ${SSH_DSA_PUBLIC_KEY}

# set up mount points
# remove default mount points
mounts:
 - [ swap, null ]
 - [ ephemeral0, null ]

# Additional YUM Repositories
repo_additions:
- source: "lovely-public"
  name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions"
  filename: lovely-public.repo
  enabled: 1
  gpgcheck: 1
  key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely"
  baseurl: "https://yum.lovelysystems.com/public/release"

runcmd:
 - [ hostname, "${HOST}" ]
 - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ]
 - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ]
 - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4)    ${HOST}    ${HOST_NAME}" >> /etc/hosts ]

 - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"]

 - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/certs ]
${PUPPET_PRIVATE_KEY}
 - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]
${PUPPET_PUBLIC_KEY}
 - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]
${PUPPET_CERT}
 - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ]

 - [ sh, -c, echo "    server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ]
 - [ sh, -c, echo "    certname = ${HOST}" >> /etc/puppet/puppet.conf ]
 - [ /etc/init.d/puppet, start ]
- IO
- ES Memory
- ES Backup
- ES Replicas
- Load while indexing
- AWS Limits
EBS performance




                  http://blog.dt.org
• Shard allocation
• Avoid rebalancing (Discovery Timeout)
• Uncached Facets
  https://github.com/lovelysystems/elasticsearch-ls-plugins
• LUCENE-2205
  Rework of the TermInfosReader class to remove the
  Terms[], TermInfos[], and the index pointer long[] and create
  a more memory efficient data structure.
3 AP server / MC
                    c1.xlarge




6 ES Master Nodes                      6 Node Hadoop Cluster
c1.xlarge                              + Spot Instances




                                          40 ES nodes per zone
                                          m1.large
                                          8 EBS Volumes
Everything fine?
Cutting the cost
• Reduce the amount of Data
  use Hadoop/MapRed transform to
  eliminate SPAM, irrelevant Languages,...
• no more time-based indizes
• Dedicated Hardware
• SSD Disks
• Share Hardware for ES and Hadoop
Jenkins for Workflows
distcp




       S3                      HDFS




                  transform




https://github.com/lovelysystems/ls-hive
https://github.com/lovelysystems/ls-thrift-py-hadoop
That's thirty
   minutes away.
I'll be there in ten.

      @jodok

More Related Content

What's hot

What's hot (20)

Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Elastic search
Elastic searchElastic search
Elastic search
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic search
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 

Viewers also liked

Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
Tom Z Zeng
 

Viewers also liked (20)

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Elasticsearch at Automattic
Elasticsearch at AutomatticElasticsearch at Automattic
Elasticsearch at Automattic
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to Hero
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & Aggregations
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...
 

Similar to You know, for search. Querying 24 Billion Documents in 900ms

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
exsuns
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
CodeIgniter Conference
 
20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
garrett honeycutt
 

Similar to You know, for search. Querying 24 Billion Documents in 900ms (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
מיכאל
מיכאלמיכאל
מיכאל
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
 
20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

You know, for search. Querying 24 Billion Documents in 900ms

  • 1. You know, for search querying 24 000 000 000 Records in 900ms @jodok
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 9.
  • 10.
  • 11.
  • 13. The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
  • 14. c1.xlarge 5 x m2.2xlarge EBS ES as document store - bash - 5 instances - find - weekly indexes - zcat - 2 replicas - curl - EBS volume
  • 15. • Map/Reduce to push to Elasticsearch • via NFS to HDFS storage HDFS ES • no dedicated nodes MAPRED
  • 16. - Disk IO - concated gzip files - compression
  • 17.
  • 18.
  • 19. Hadoop Storage - Index “Driver” Namenode Jobtracker Hive Datanode Secondary NN Datanode 2 Tasktracker Datanode 2 Tasktracker 6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS Datanode Datanode Datanode 4 Tasktracker 4 Tasktracker 4 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS Tasktracker Spot Instances
  • 20. Adding S3 / External Tables to Hive create external table $tmp_table_name (size bigint, path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' stored as INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" location s3n://...; SET ... from ( select transform (size, path) using './current.tar.gz/bin/importer transform${max_lines}' as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' from $tmp_table_name ) f INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}') select crawl_ts, ... user_json, tweets,
  • 22. packages: - puppet # Send pre-generated ssh private keys to the server ssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY} # set up mount points # remove default mount points mounts: - [ swap, null ] - [ ephemeral0, null ] # Additional YUM Repositories repo_additions: - source: "lovely-public" name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely" baseurl: "https://yum.lovelysystems.com/public/release" runcmd: - [ hostname, "${HOST}" ] - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ] - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ] - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ] - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ] ${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ] ${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ] ${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]
  • 23. - IO - ES Memory - ES Backup - ES Replicas - Load while indexing - AWS Limits
  • 24. EBS performance http://blog.dt.org
  • 25. • Shard allocation • Avoid rebalancing (Discovery Timeout) • Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins • LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
  • 26. 3 AP server / MC c1.xlarge 6 ES Master Nodes 6 Node Hadoop Cluster c1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
  • 27.
  • 28.
  • 29.
  • 30.
  • 32.
  • 33. Cutting the cost • Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,... • no more time-based indizes • Dedicated Hardware • SSD Disks • Share Hardware for ES and Hadoop
  • 35. distcp S3 HDFS transform https://github.com/lovelysystems/ls-hive https://github.com/lovelysystems/ls-thrift-py-hadoop
  • 36.
  • 37.
  • 38. That's thirty minutes away. I'll be there in ten. @jodok

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. how do i work?\n* agile leader * i say what i do * i do what i say * hands on\n* quality over speed * responsibility to team\n* attract specialists * not trying to sell something. but DO IT. DELIVER\n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. how i feel ich unternehme dinge\n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. how i feel ich unternehme dinge\n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n