SlideShare a Scribd company logo
1 of 38
You know, for search
 querying 24 000 000 000 Records in 900ms




                                @jodok
@jodok
source: www.searchmetrics.com
First Iteration
The anatomy of a tweet
                         http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
c1.xlarge   5 x m2.2xlarge




           EBS
                     ES as document store

- bash               - 5 instances
  - find              - weekly indexes
  - zcat             - 2 replicas
  - curl             - EBS volume
• Map/Reduce to push
  to Elasticsearch
• via NFS to HDFS
  storage               HDFS

                                ES




• no dedicated nodes

                       MAPRED
- Disk IO
- concated gzip files
- compression
Hadoop Storage - Index “Driver”

  Namenode         Jobtracker       Hive
  Datanode         Secondary NN     Datanode
  2 Tasktracker    Datanode         2 Tasktracker
  6x 500 TB HDFS   2 Tasktracker    6x 500 TB HDFS
                   6x 500 TB HDFS




  Datanode         Datanode         Datanode
  4 Tasktracker    4 Tasktracker    4 Tasktracker
  6x 500 TB HDFS   6x 500 TB HDFS   6x 500 TB HDFS




                   Tasktracker
                   Spot Instances
Adding S3 / External Tables to Hive

create external table $tmp_table_name
        (size bigint, path string)
          ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
        stored as
         INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat"
         OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
        location s3n://...;
SET ...
from (
     select transform (size, path)
        using './current.tar.gz/bin/importer transform${max_lines}' as
        (crawl_ts int, screen_name string, ... num_tweets int)
         ROW FORMAT DELIMITED
             FIELDS TERMINATED BY '001'
             COLLECTION ITEMS TERMINATED BY '002'
             MAP KEYS TERMINATED BY '003'
             LINES TERMINATED BY 'n'
                from $tmp_table_name
        ) f
INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}')
     select
            crawl_ts, ... user_json, tweets,
https://launchpad.net/ubuntu/+source/cloud-init
http://www.netfort.gr.jp/~dancer/software/dsh.html.en
packages:
 - puppet

# Send pre-generated ssh private keys to the server
ssh_keys:
 rsa_private: | ${SSH_RSA_PRIVATE_KEY}
 rsa_public: ${SSH_RSA_PUBLIC_KEY}
 dsa_private: | ${SSH_DSA_PRIVATE_KEY}
 dsa_public: ${SSH_DSA_PUBLIC_KEY}

# set up mount points
# remove default mount points
mounts:
 - [ swap, null ]
 - [ ephemeral0, null ]

# Additional YUM Repositories
repo_additions:
- source: "lovely-public"
  name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions"
  filename: lovely-public.repo
  enabled: 1
  gpgcheck: 1
  key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely"
  baseurl: "https://yum.lovelysystems.com/public/release"

runcmd:
 - [ hostname, "${HOST}" ]
 - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ]
 - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ]
 - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4)    ${HOST}    ${HOST_NAME}" >> /etc/hosts ]

 - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"]

 - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ]
 - [ mkdir, -p, /var/lib/puppet/ssl/certs ]
${PUPPET_PRIVATE_KEY}
 - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ]
${PUPPET_PUBLIC_KEY}
 - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ]
${PUPPET_CERT}
 - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ]

 - [ sh, -c, echo "    server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ]
 - [ sh, -c, echo "    certname = ${HOST}" >> /etc/puppet/puppet.conf ]
 - [ /etc/init.d/puppet, start ]
- IO
- ES Memory
- ES Backup
- ES Replicas
- Load while indexing
- AWS Limits
EBS performance




                  http://blog.dt.org
• Shard allocation
• Avoid rebalancing (Discovery Timeout)
• Uncached Facets
  https://github.com/lovelysystems/elasticsearch-ls-plugins
• LUCENE-2205
  Rework of the TermInfosReader class to remove the
  Terms[], TermInfos[], and the index pointer long[] and create
  a more memory efficient data structure.
3 AP server / MC
                    c1.xlarge




6 ES Master Nodes                      6 Node Hadoop Cluster
c1.xlarge                              + Spot Instances




                                          40 ES nodes per zone
                                          m1.large
                                          8 EBS Volumes
Everything fine?
Cutting the cost
• Reduce the amount of Data
  use Hadoop/MapRed transform to
  eliminate SPAM, irrelevant Languages,...
• no more time-based indizes
• Dedicated Hardware
• SSD Disks
• Share Hardware for ES and Hadoop
Jenkins for Workflows
distcp




       S3                      HDFS




                  transform




https://github.com/lovelysystems/ls-hive
https://github.com/lovelysystems/ls-thrift-py-hadoop
That's thirty
   minutes away.
I'll be there in ten.

      @jodok

More Related Content

What's hot

Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Elastic Search
Elastic SearchElastic Search
Elastic SearchNavule Rao
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in actionCodemotion
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Karel Minarik
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityStéphane Gamard
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchBo Andersen
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Philips Kokoh Prasetyo
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic searchmarkstory
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 

What's hot (20)

Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
ElasticSearch in action
ElasticSearch in actionElasticSearch in action
ElasticSearch in action
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
From Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalabilityFrom Lucene to Elasticsearch, a short explanation of horizontal scalability
From Lucene to Elasticsearch, a short explanation of horizontal scalability
 
Elastic search
Elastic searchElastic search
Elastic search
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Simple search with elastic search
Simple search with elastic searchSimple search with elastic search
Simple search with elastic search
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 

Viewers also liked

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBBoxed Ice
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistJeremy Zawodny
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB
 
Elasticsearch at Automattic
Elasticsearch at AutomatticElasticsearch at Automattic
Elasticsearch at AutomatticGreg Brown
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with railsTom Z Zeng
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to HeroDaniel Ziv
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16Miguel Bosin
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup Alberto Paro
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)foundsearch
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosJoe Stein
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearchAnton Udovychenko
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsAlaa Elhadba
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...clintongormley
 

Viewers also liked (20)

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at CraigslistLessons Learned Migrating 2+ Billion Documents at Craigslist
Lessons Learned Migrating 2+ Billion Documents at Craigslist
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Elasticsearch at Automattic
Elasticsearch at AutomatticElasticsearch at Automattic
Elasticsearch at Automattic
 
Using elasticsearch with rails
Using elasticsearch with railsUsing elasticsearch with rails
Using elasticsearch with rails
 
Elasticsearch - Zero to Hero
Elasticsearch - Zero to HeroElasticsearch - Zero to Hero
Elasticsearch - Zero to Hero
 
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Elasticsearch in Production (London version)
Elasticsearch in Production (London version)Elasticsearch in Production (London version)
Elasticsearch in Production (London version)
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & Aggregations
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...Elasticsearch Query DSL - Not just for wizards...
Elasticsearch Query DSL - Not just for wizards...
 

Similar to You know, for search. Querying 24 Billion Documents in 900ms

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing DaeHyung Lee
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...Puppet
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117exsuns
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPkatzgrau
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011CodeIgniter Conference
 
20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasaggarrett honeycutt
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 

Similar to You know, for search. Querying 24 Billion Documents in 900ms (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
מיכאל
מיכאלמיכאל
מיכאל
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag20090514 Introducing Puppet To Sasag
20090514 Introducing Puppet To Sasag
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 

Recently uploaded

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Recently uploaded (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

You know, for search. Querying 24 Billion Documents in 900ms

  • 1. You know, for search querying 24 000 000 000 Records in 900ms @jodok
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 9.
  • 10.
  • 11.
  • 13. The anatomy of a tweet http://www.readwriteweb.com/archives/what_a_tweet_can_tell_you.php
  • 14. c1.xlarge 5 x m2.2xlarge EBS ES as document store - bash - 5 instances - find - weekly indexes - zcat - 2 replicas - curl - EBS volume
  • 15. • Map/Reduce to push to Elasticsearch • via NFS to HDFS storage HDFS ES • no dedicated nodes MAPRED
  • 16. - Disk IO - concated gzip files - compression
  • 17.
  • 18.
  • 19. Hadoop Storage - Index “Driver” Namenode Jobtracker Hive Datanode Secondary NN Datanode 2 Tasktracker Datanode 2 Tasktracker 6x 500 TB HDFS 2 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS Datanode Datanode Datanode 4 Tasktracker 4 Tasktracker 4 Tasktracker 6x 500 TB HDFS 6x 500 TB HDFS 6x 500 TB HDFS Tasktracker Spot Instances
  • 20. Adding S3 / External Tables to Hive create external table $tmp_table_name (size bigint, path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' stored as INPUTFORMAT "org.apache.hadoop.mapred.lib.NLineInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" location s3n://...; SET ... from ( select transform (size, path) using './current.tar.gz/bin/importer transform${max_lines}' as (crawl_ts int, screen_name string, ... num_tweets int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' from $tmp_table_name ) f INSERT overwrite TABLE crawls PARTITION (crawl_day='${day}') select crawl_ts, ... user_json, tweets,
  • 22. packages: - puppet # Send pre-generated ssh private keys to the server ssh_keys: rsa_private: | ${SSH_RSA_PRIVATE_KEY} rsa_public: ${SSH_RSA_PUBLIC_KEY} dsa_private: | ${SSH_DSA_PRIVATE_KEY} dsa_public: ${SSH_DSA_PUBLIC_KEY} # set up mount points # remove default mount points mounts: - [ swap, null ] - [ ephemeral0, null ] # Additional YUM Repositories repo_additions: - source: "lovely-public" name: "Lovely Systems, Public Repository for RHEL 6 compatible Distributions" filename: lovely-public.repo enabled: 1 gpgcheck: 1 key: "file:///etc/pki/rpm-gpg/RPM-GPG-KEY-lovely" baseurl: "https://yum.lovelysystems.com/public/release" runcmd: - [ hostname, "${HOST}" ] - [ sed, -i, -e, "s/^HOSTNAME=.*/HOSTNAME=${HOST}/", /etc/sysconfig/network ] - [ wget, "http://169.254.169.254/latest/meta-data/local-ipv4", -O, /tmp/local-ipv4 ] - [ sh, -c, echo "$(/bin/cat /tmp/local-ipv4) ${HOST} ${HOST_NAME}" >> /etc/hosts ] - [ rpm, --import, "https://yum.lovelysystems.com/public/RPM-GPG-KEY-lovely"] - [ mkdir, -p, /var/lib/puppet/ssl/private_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/public_keys ] - [ mkdir, -p, /var/lib/puppet/ssl/certs ] ${PUPPET_PRIVATE_KEY} - [ mv, /tmp/puppet_private_key.pem, /var/lib/puppet/ssl/private_keys/${HOST}.pem ] ${PUPPET_PUBLIC_KEY} - [ mv, /tmp/puppet_public_key.pem, /var/lib/puppet/ssl/public_keys/${HOST}.pem ] ${PUPPET_CERT} - [ mv, /tmp/puppet_cert.pem, /var/lib/puppet/ssl/certs/${HOST}.pem ] - [ sh, -c, echo " server = ${PUPPET_MASTER}" >> /etc/puppet/puppet.conf ] - [ sh, -c, echo " certname = ${HOST}" >> /etc/puppet/puppet.conf ] - [ /etc/init.d/puppet, start ]
  • 23. - IO - ES Memory - ES Backup - ES Replicas - Load while indexing - AWS Limits
  • 24. EBS performance http://blog.dt.org
  • 25. • Shard allocation • Avoid rebalancing (Discovery Timeout) • Uncached Facets https://github.com/lovelysystems/elasticsearch-ls-plugins • LUCENE-2205 Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.
  • 26. 3 AP server / MC c1.xlarge 6 ES Master Nodes 6 Node Hadoop Cluster c1.xlarge + Spot Instances 40 ES nodes per zone m1.large 8 EBS Volumes
  • 27.
  • 28.
  • 29.
  • 30.
  • 32.
  • 33. Cutting the cost • Reduce the amount of Data use Hadoop/MapRed transform to eliminate SPAM, irrelevant Languages,... • no more time-based indizes • Dedicated Hardware • SSD Disks • Share Hardware for ES and Hadoop
  • 35. distcp S3 HDFS transform https://github.com/lovelysystems/ls-hive https://github.com/lovelysystems/ls-thrift-py-hadoop
  • 36.
  • 37.
  • 38. That's thirty minutes away. I'll be there in ten. @jodok

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. how do i work?\n* agile leader * i say what i do * i do what i say * hands on\n* quality over speed * responsibility to team\n* attract specialists * not trying to sell something. but DO IT. DELIVER\n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. how i feel ich unternehme dinge\n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. how i feel ich unternehme dinge\n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n