SlideShare a Scribd company logo
1 of 33
Download to read offline
H A D O O P AT N O K I A



                            JOSH DEVINS, NOKIA          HADOOP MEETUP, JANUARY 2011
                                                                 BERLIN


Wednesday, April 20, 2011

Two parts:
 * technical setup
 * applications

before starting Question: Hadoop experience levels from none to some to lots, and what
about cluster mgmt?
TECHNICAL SETUP




Wednesday, April 20, 2011

http://www.flickr.com/photos/josecamoessilva/2873298422/sizes/o/in/photostream/
GLOBAL ARCHITECTURE




Wednesday, April 20, 2011

Scribe for logging
agents on local machines forward to downstream collector nodes
collectors forward on to more downstream nodes or to final destination(s) like HDFS
buffering at each stage to deal with network outages
must consider the storage available on ALL nodes where Scribe is running to determine your
risk of potential data loss since Scribe buffers to local disk
global Scribe not deployed yet, but London DC is done
do it all over again? probably use Flume, but Scribe was being researched before Flume
existed
 * much more flexible and easily extensible
 * more reliability guarantees and tunable (data loss acceptable or not)
 * can also do syslog wire protocol which is nice for compatibility’s sake
DATA NODE HARDWARE

                DC          LONDON                     BERLIN

              cores         12x (w/ HT)                4x 2.00 GHz (w/ HT)

               RAM          48GB                       16GB

               disks        12x 2TB                    4x 1TB

            storage         24TB                       4TB

               LAN          1Gb                        2x 1Gb (bonded)

Wednesday, April 20, 2011

http://www.flickr.com/photos/torkildr/3462607995/in/photostream/

BERLIN
•HP DL160 G6
•1x Quad-core Intel Xeon E5504 @ 2.00 GHz (4-cores total)
•16GB DDR3 RAM
•4x 1TB 7200 RPM SATA
•2x 1Gb LAN
•iLO Lights-Out 100 Advanced
MASTER NODE HARDWARE

                DC          LONDON                    BERLIN

              cores         12x (w/ HT)               8x 2.00 GHz (w/ HT)

               RAM          48GB                      32GB

               disks        12x 2TB                   4x 1TB

            storage         24TB                      4TB

               LAN          10Gb                      4x 1Gb (bonded, DRBD/Heartbeat)

Wednesday, April 20, 2011

BERLIN
•HP DL160 G6
•2x Quad-core Intel Xeon E5504 @ 2.00 GHz (8-cores total)
•32GB DDR3 RAM
•4x 1TB 7200 RPM SATA
•4x 1Gb Ethernet (2x LAN, 2x DRBD/Heartbeat)
•iLO Lights-Out 100 Advanced (hadoop-master[01-02]-ilo.devbln)
MEANING?

    •   Size

         •   Berlin: 2 master nodes, 13 data nodes, ~17TB HDFS

         •   London: “large enough to handle a year’s worth of activity log
             data, with plans for rapid expansion”

    •   Scribe

         •   250,000 1KB msg/sec

         •   244MB/sec, 14.3GB/hr, 343GB/day

Wednesday, April 20, 2011

Berlin: 1 rack, 2 switches for...
London: it’s secret!
PHYSICAL OR CLOUD?
  •   Physical                                       •   Cloud (AWS)
        •   Capital cost                                 •   Elastic MR, 15 extra large nodes, 10%
                                                             utilized: $1,560
             •   1 rack w/ 2x switches
                                                         •   S3, 5TB: $7,800
             •   15x HP DL160 servers
                                                         •   $9,360 or €6,835
             •   ~€20,000
        •   Annual operating costs
             •   power and cooling: €5,265 @ €0.24
                 kWh
             •   rent: €3,600
             •   hardware support contract: €2,000
                 (disks replaced on warranty)
             •   €10,865

Wednesday, April 20, 2011

http://www.flickr.com/photos/dumbledad/4745475799/

Question: How many run own clusters of physical hardware vs AWS or virtualized?
actual decision is completely dependent on may factors including maybe existing DC, data
set sizes, etc.
UTILIZATION




Wednesday, April 20, 2011

here’s what to show your boss if you want hardware
UTILIZATION




Wednesday, April 20, 2011

here’s what to show your boss if you want the cloud
GRAPHING AND
                                  MONITORING
    •   Ganglia for graphing/trending

         •   “native” support in Hadoop to push metrics to Ganglia

              •   map or reduce tasks running, slots open, HDFS I/O, etc.

         •   excellent for system graphing like CPU, memory, etc.

         •   scales out horizontally

         •   no configuration - just push metrics from nodes to collectors and they will graph it

    •   Nagios for monitoring

         •   built into our Puppet infrastructure

         •   machines go up, automatically into Nagios with basic system checks

         •   scriptable to easily check other things like JMX


Wednesday, April 20, 2011

basically always have up: jobtracker, Oozie, Ganglia
GANGLIA




Wednesday, April 20, 2011

master nodes are mostly idle
GANGLIA




Wednesday, April 20, 2011

data nodes overview
GANGLIA




                                              start reduce
                                                  map




Wednesday, April 20, 2011

detail view can see actually the phases of a map reduce job
not totally accurate here since there were multiple jobs running at the same time
SCHEDULER




Wednesday, April 20, 2011

(side note, we use the fairshare scheduler which works pretty well)
INFRASTRUCTURE
                              MANAGEMENT




Wednesday, April 20, 2011

we use Puppet
Question: Who here has used or knows of Puppet/Chef/etc?

pros
 * used throughout the rest of our infrastructure
 * all configuration of Hadoop and machines is in source control/Subversion
cons
 * no push from central
 * can only pull from each node (pssh is your friend, poke Puppet on all nodes)
 * that’s it, Puppet rules
PUPPET FOR HADOOP
                                   1

                                       2



                                       3



Wednesday, April 20, 2011

more or less there are 3 steps in the Puppet chain
PUPPET FOR HADOOP
               package {
                   hadoop: ensure => '0.20.2+320-14';
                   rsync: ensure => installed;
                   lzo: ensure => installed;
                   lzo-devel: ensure => installed;
               }

               service {
                   iptables:
                       ensure => stopped,
                       enable => false;
               }

               # Hadoop account
               include account::users::la::hadoop

               file {
                   '/home/hadoop/.ssh/id_rsa':
                       mode => 600,
                       source => 'puppet:///modules/hadoop/home/hadoop/.ssh/id_rsa';
               }


Wednesday, April 20, 2011

example Puppet manifest
note: we rolled our own RPMs from the Cloudera packages since we didn’t like where
Cloudera put stuff on the servers and wanted a bit more control
PUPPET FOR HADOOP
               file {
                   # raw configuration files
                   '/etc/hadoop/core-site.xml':
                       source => "$src_dir/core-site.xml";
                   '/etc/hadoop/hdfs-site.xml':
                       source => "$src_dir/hdfs-site.xml";
                   '/etc/hadoop/mapred-site.xml':
                       source => "$src_dir/mapred-site.xml";
                   '/etc/hadoop/fair-scheduler.xml':
                       source => "${src_dir}/fair-scheduler.xml";
                   '/etc/hadoop/masters':
                       source => "$src_dir/masters";
                   '/etc/hadoop/slaves':
                       source => "$src_dir/slaves";

                      # templated configuration files
                      '/etc/hadoop/hadoop-env.sh':
                          content => template ('hadoop/conf/hadoop-env.sh.erb'),
                          mode => 555;
                      '/etc/hadoop/log4j.properties':
                          content => template ('hadoop/conf/log4j.properties.erb');
               }

Wednesday, April 20, 2011

Hadoop config files
APPLICATIONS




Wednesday, April 20, 2011

http://www.flickr.com/photos/thomaspurves/1039363039/sizes/o/in/photostream/

that wrap’s up the setup stuff, any questions on that?
REPORTING




Wednesday, April 20, 2011

operational - access logs, throughput, general usage, dashboards
business reporting - what are all of the products doing, how do they compare to other
months
ad-hoc - random business queries

almost all of this goes through Pig
there are several pipelines that use Oozie tie together parts
lots of parsing and decoding in Java MR job, then Pig for the heavy lifting
mostly goes into a RDBMS using Sqoop for display and querying in other tools
currently using Tableau to do live dashboards
IKEA!




Wednesday, April 20, 2011

other than reporting, we also occasionally do some data exploration, which can be quite fun
any guesses what this is a plot of?
geo-searches for Ikea!
Prenzl Berg Yuppies




             Ikea Spandau



                                                                           Ikea Schoenefeld
                            Ikea Tempelhof




Wednesday, April 20, 2011

Ikea geo-searches bounded to Berlin
can we make any assumptions about what the actual locations are?
kind of, but not much data here
clearly there is a Tempelhof cluster but the others are not very evident
certainly shows the relative popularity of all the locations
Ikea Lichtenberg was not open yet during this time frame
Ikea Edmonton

                            Ikea Wembley




                                                                          Ikea Lakeside




                                       Ikea Croydon


Wednesday, April 20, 2011

Ikea geo-searches bounded to London
can we make any assumptions about what the actual locations are?
turns out we can!
using a clustering algorithm like K-Means (maybe from Mahout) we probably could guess

> this is considering search location, what about time?
Berlin
Wednesday, April 20, 2011

distribution of searches over days of the week and hours of the day
certainly can make some comments about the hours that Berliners are awake
can we make assumptions about average opening hours?
Berlin
Wednesday, April 20, 2011

upwards trend a couple hours before opening
can also clearly make some statements about the best time to visit Ikea in Berlin - Sat night!

BERLIN
 * Mon-Fri 10am-9pm
 * Saturday 10am-10pm
London
Wednesday, April 20, 2011

more data points again so we get smoother results
London
Wednesday, April 20, 2011

LONDON
 * Mon-Fri 10am-10pm
 * Saturday 9am-10pm
 * Sunday 11am-5pm

> potential revenue stream?
> what to do with this data or data like this?
PRODUCTIZING




Wednesday, April 20, 2011

taking data and ideas and turning this into something useful, features that mean something
often the next step after data mining and exploration
either static data shipped to devices or web products, or live data that is constantly fed back
to web products/web services
BERLIN




Wednesday, April 20, 2011

another example of something that can be productized

Berlin
 * traffic sensors
 * map tiles
LOS ANGELES




Wednesday, April 20, 2011

LA
 * traffic sensors
 * map tiles
BERLIN                                  LA




Wednesday, April 20, 2011

Starbucks index comes from POI data set, not from the heatmaps you just saw
JOIN US!

    • Nokia                 is hiring in Berlin!

         • analytics            engineers, smart data folks

         • software              engineers

         • operations

         • josh.devins@nokia.com

         • www.nokia.com/careers

Wednesday, April 20, 2011
THANKS!
        JOSH DEVINS         www.joshdevins.net   @joshdevins




Wednesday, April 20, 2011

More Related Content

What's hot

Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop appsBasant Verma
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWJonathan Katz
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QConCloudera, Inc.
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 

What's hot (20)

Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Developing and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDWDeveloping and Deploying Apps with the Postgres FDW
Developing and Deploying Apps with the Postgres FDW
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
EEDC Apache Pig Language
EEDC Apache Pig LanguageEEDC Apache Pig Language
EEDC Apache Pig Language
 
Big Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VMBig Data Step-by-Step: Infrastructure 1/3: Local VM
Big Data Step-by-Step: Infrastructure 1/3: Local VM
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QCon
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 

Viewers also liked

Signage Solutions E Brochure
Signage Solutions E BrochureSignage Solutions E Brochure
Signage Solutions E Brochurecderuyter
 
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010Josh Devins
 
From Zero to Lots - ScaleCamp UK 2009
From Zero to Lots - ScaleCamp UK 2009From Zero to Lots - ScaleCamp UK 2009
From Zero to Lots - ScaleCamp UK 2009Josh Devins
 
Tabatha , Lovedeep.
Tabatha , Lovedeep.Tabatha , Lovedeep.
Tabatha , Lovedeep.Aitor Faura
 
Siddharth History Slideshow Pp
Siddharth History Slideshow PpSiddharth History Slideshow Pp
Siddharth History Slideshow PpSiddharth Suresh
 

Viewers also liked (6)

Signage Solutions E Brochure
Signage Solutions E BrochureSignage Solutions E Brochure
Signage Solutions E Brochure
 
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
Continuous Deployment and DevOps: Deprecating Silos - JAOO 2010
 
From Zero to Lots - ScaleCamp UK 2009
From Zero to Lots - ScaleCamp UK 2009From Zero to Lots - ScaleCamp UK 2009
From Zero to Lots - ScaleCamp UK 2009
 
Tabatha , Lovedeep.
Tabatha , Lovedeep.Tabatha , Lovedeep.
Tabatha , Lovedeep.
 
Siddharth History Slideshow Pp
Siddharth History Slideshow PpSiddharth History Slideshow Pp
Siddharth History Slideshow Pp
 
Company report
Company reportCompany report
Company report
 

Similar to Hadoop at Nokia

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureDevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureAngelo Failla
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintroDoug Chang
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017Chris Fregly
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop ClustersSteve Loughran
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
Docker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker, Inc.
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceAshok Modi
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...Chris Fregly
 
Big data Analytics hands-on sessions
Big data Analytics hands-on sessionsBig data Analytics hands-on sessions
Big data Analytics hands-on sessionsPraveen Hanchinal
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Pierre Joye
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 

Similar to Hadoop at Nokia (20)

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureDevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Dynamic Hadoop Clusters
Dynamic Hadoop ClustersDynamic Hadoop Clusters
Dynamic Hadoop Clusters
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Docker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in Production
 
netty_qcon_v4
netty_qcon_v4netty_qcon_v4
netty_qcon_v4
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
 
Big data Analytics hands-on sessions
Big data Analytics hands-on sessionsBig data Analytics hands-on sessions
Big data Analytics hands-on sessions
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18Webdevcon Keynote hh-2012-09-18
Webdevcon Keynote hh-2012-09-18
 
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 

Hadoop at Nokia

  • 1. H A D O O P AT N O K I A JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN Wednesday, April 20, 2011 Two parts: * technical setup * applications before starting Question: Hadoop experience levels from none to some to lots, and what about cluster mgmt?
  • 2. TECHNICAL SETUP Wednesday, April 20, 2011 http://www.flickr.com/photos/josecamoessilva/2873298422/sizes/o/in/photostream/
  • 3. GLOBAL ARCHITECTURE Wednesday, April 20, 2011 Scribe for logging agents on local machines forward to downstream collector nodes collectors forward on to more downstream nodes or to final destination(s) like HDFS buffering at each stage to deal with network outages must consider the storage available on ALL nodes where Scribe is running to determine your risk of potential data loss since Scribe buffers to local disk global Scribe not deployed yet, but London DC is done do it all over again? probably use Flume, but Scribe was being researched before Flume existed * much more flexible and easily extensible * more reliability guarantees and tunable (data loss acceptable or not) * can also do syslog wire protocol which is nice for compatibility’s sake
  • 4. DATA NODE HARDWARE DC LONDON BERLIN cores 12x (w/ HT) 4x 2.00 GHz (w/ HT) RAM 48GB 16GB disks 12x 2TB 4x 1TB storage 24TB 4TB LAN 1Gb 2x 1Gb (bonded) Wednesday, April 20, 2011 http://www.flickr.com/photos/torkildr/3462607995/in/photostream/ BERLIN •HP DL160 G6 •1x Quad-core Intel Xeon E5504 @ 2.00 GHz (4-cores total) •16GB DDR3 RAM •4x 1TB 7200 RPM SATA •2x 1Gb LAN •iLO Lights-Out 100 Advanced
  • 5. MASTER NODE HARDWARE DC LONDON BERLIN cores 12x (w/ HT) 8x 2.00 GHz (w/ HT) RAM 48GB 32GB disks 12x 2TB 4x 1TB storage 24TB 4TB LAN 10Gb 4x 1Gb (bonded, DRBD/Heartbeat) Wednesday, April 20, 2011 BERLIN •HP DL160 G6 •2x Quad-core Intel Xeon E5504 @ 2.00 GHz (8-cores total) •32GB DDR3 RAM •4x 1TB 7200 RPM SATA •4x 1Gb Ethernet (2x LAN, 2x DRBD/Heartbeat) •iLO Lights-Out 100 Advanced (hadoop-master[01-02]-ilo.devbln)
  • 6. MEANING? • Size • Berlin: 2 master nodes, 13 data nodes, ~17TB HDFS • London: “large enough to handle a year’s worth of activity log data, with plans for rapid expansion” • Scribe • 250,000 1KB msg/sec • 244MB/sec, 14.3GB/hr, 343GB/day Wednesday, April 20, 2011 Berlin: 1 rack, 2 switches for... London: it’s secret!
  • 7. PHYSICAL OR CLOUD? • Physical • Cloud (AWS) • Capital cost • Elastic MR, 15 extra large nodes, 10% utilized: $1,560 • 1 rack w/ 2x switches • S3, 5TB: $7,800 • 15x HP DL160 servers • $9,360 or €6,835 • ~€20,000 • Annual operating costs • power and cooling: €5,265 @ €0.24 kWh • rent: €3,600 • hardware support contract: €2,000 (disks replaced on warranty) • €10,865 Wednesday, April 20, 2011 http://www.flickr.com/photos/dumbledad/4745475799/ Question: How many run own clusters of physical hardware vs AWS or virtualized? actual decision is completely dependent on may factors including maybe existing DC, data set sizes, etc.
  • 8. UTILIZATION Wednesday, April 20, 2011 here’s what to show your boss if you want hardware
  • 9. UTILIZATION Wednesday, April 20, 2011 here’s what to show your boss if you want the cloud
  • 10. GRAPHING AND MONITORING • Ganglia for graphing/trending • “native” support in Hadoop to push metrics to Ganglia • map or reduce tasks running, slots open, HDFS I/O, etc. • excellent for system graphing like CPU, memory, etc. • scales out horizontally • no configuration - just push metrics from nodes to collectors and they will graph it • Nagios for monitoring • built into our Puppet infrastructure • machines go up, automatically into Nagios with basic system checks • scriptable to easily check other things like JMX Wednesday, April 20, 2011 basically always have up: jobtracker, Oozie, Ganglia
  • 11. GANGLIA Wednesday, April 20, 2011 master nodes are mostly idle
  • 12. GANGLIA Wednesday, April 20, 2011 data nodes overview
  • 13. GANGLIA start reduce map Wednesday, April 20, 2011 detail view can see actually the phases of a map reduce job not totally accurate here since there were multiple jobs running at the same time
  • 14. SCHEDULER Wednesday, April 20, 2011 (side note, we use the fairshare scheduler which works pretty well)
  • 15. INFRASTRUCTURE MANAGEMENT Wednesday, April 20, 2011 we use Puppet Question: Who here has used or knows of Puppet/Chef/etc? pros * used throughout the rest of our infrastructure * all configuration of Hadoop and machines is in source control/Subversion cons * no push from central * can only pull from each node (pssh is your friend, poke Puppet on all nodes) * that’s it, Puppet rules
  • 16. PUPPET FOR HADOOP 1 2 3 Wednesday, April 20, 2011 more or less there are 3 steps in the Puppet chain
  • 17. PUPPET FOR HADOOP package { hadoop: ensure => '0.20.2+320-14'; rsync: ensure => installed; lzo: ensure => installed; lzo-devel: ensure => installed; } service { iptables: ensure => stopped, enable => false; } # Hadoop account include account::users::la::hadoop file { '/home/hadoop/.ssh/id_rsa': mode => 600, source => 'puppet:///modules/hadoop/home/hadoop/.ssh/id_rsa'; } Wednesday, April 20, 2011 example Puppet manifest note: we rolled our own RPMs from the Cloudera packages since we didn’t like where Cloudera put stuff on the servers and wanted a bit more control
  • 18. PUPPET FOR HADOOP file { # raw configuration files '/etc/hadoop/core-site.xml': source => "$src_dir/core-site.xml"; '/etc/hadoop/hdfs-site.xml': source => "$src_dir/hdfs-site.xml"; '/etc/hadoop/mapred-site.xml': source => "$src_dir/mapred-site.xml"; '/etc/hadoop/fair-scheduler.xml': source => "${src_dir}/fair-scheduler.xml"; '/etc/hadoop/masters': source => "$src_dir/masters"; '/etc/hadoop/slaves': source => "$src_dir/slaves"; # templated configuration files '/etc/hadoop/hadoop-env.sh': content => template ('hadoop/conf/hadoop-env.sh.erb'), mode => 555; '/etc/hadoop/log4j.properties': content => template ('hadoop/conf/log4j.properties.erb'); } Wednesday, April 20, 2011 Hadoop config files
  • 19. APPLICATIONS Wednesday, April 20, 2011 http://www.flickr.com/photos/thomaspurves/1039363039/sizes/o/in/photostream/ that wrap’s up the setup stuff, any questions on that?
  • 20. REPORTING Wednesday, April 20, 2011 operational - access logs, throughput, general usage, dashboards business reporting - what are all of the products doing, how do they compare to other months ad-hoc - random business queries almost all of this goes through Pig there are several pipelines that use Oozie tie together parts lots of parsing and decoding in Java MR job, then Pig for the heavy lifting mostly goes into a RDBMS using Sqoop for display and querying in other tools currently using Tableau to do live dashboards
  • 21. IKEA! Wednesday, April 20, 2011 other than reporting, we also occasionally do some data exploration, which can be quite fun any guesses what this is a plot of? geo-searches for Ikea!
  • 22. Prenzl Berg Yuppies Ikea Spandau Ikea Schoenefeld Ikea Tempelhof Wednesday, April 20, 2011 Ikea geo-searches bounded to Berlin can we make any assumptions about what the actual locations are? kind of, but not much data here clearly there is a Tempelhof cluster but the others are not very evident certainly shows the relative popularity of all the locations Ikea Lichtenberg was not open yet during this time frame
  • 23. Ikea Edmonton Ikea Wembley Ikea Lakeside Ikea Croydon Wednesday, April 20, 2011 Ikea geo-searches bounded to London can we make any assumptions about what the actual locations are? turns out we can! using a clustering algorithm like K-Means (maybe from Mahout) we probably could guess > this is considering search location, what about time?
  • 24. Berlin Wednesday, April 20, 2011 distribution of searches over days of the week and hours of the day certainly can make some comments about the hours that Berliners are awake can we make assumptions about average opening hours?
  • 25. Berlin Wednesday, April 20, 2011 upwards trend a couple hours before opening can also clearly make some statements about the best time to visit Ikea in Berlin - Sat night! BERLIN * Mon-Fri 10am-9pm * Saturday 10am-10pm
  • 26. London Wednesday, April 20, 2011 more data points again so we get smoother results
  • 27. London Wednesday, April 20, 2011 LONDON * Mon-Fri 10am-10pm * Saturday 9am-10pm * Sunday 11am-5pm > potential revenue stream? > what to do with this data or data like this?
  • 28. PRODUCTIZING Wednesday, April 20, 2011 taking data and ideas and turning this into something useful, features that mean something often the next step after data mining and exploration either static data shipped to devices or web products, or live data that is constantly fed back to web products/web services
  • 29. BERLIN Wednesday, April 20, 2011 another example of something that can be productized Berlin * traffic sensors * map tiles
  • 30. LOS ANGELES Wednesday, April 20, 2011 LA * traffic sensors * map tiles
  • 31. BERLIN LA Wednesday, April 20, 2011 Starbucks index comes from POI data set, not from the heatmaps you just saw
  • 32. JOIN US! • Nokia is hiring in Berlin! • analytics engineers, smart data folks • software engineers • operations • josh.devins@nokia.com • www.nokia.com/careers Wednesday, April 20, 2011
  • 33. THANKS! JOSH DEVINS www.joshdevins.net @joshdevins Wednesday, April 20, 2011