SlideShare a Scribd company logo
1 of 50
Download to read offline
Hadoop 101




© Copyright 2011 EMC Corporation. All rights reserved.                1
Agenda
                                                   •  What is Hadoop ?
                                                   •  History of Hadoop
                                                   •  Hadoop Components
                                                   •  Hadoop Ecosystem
                                                   •  Customer Use Cases
                                                   •  Hadoop Challenges




© Copyright 2011 EMC Corporation. All rights reserved.                     2
What is Hadoop?




© Copyright 2011 EMC Corporation. All rights reserved.                 3
What is Hadoop?
      The Apache Hadoop software library is a framework that allows for the
      distributed processing of large data sets across clusters of computers
      using a simple programming model. It is designed to scale up from single
      servers to thousands of machines, each offering local computation and
      storage. Rather than rely on hardware to deliver high-availability, the library
      itself is designed to detect and handle failures at the application layer, so
      delivering a highly-available service on top of a cluster of computers, each of
      which may be prone to failures.

          •  Concepts :
                    –  NameNode (aka Master) is responsible for managing the
                       namespace repository (index) for the filesystem, and
                       managing Jobs
                    –  DataNode (aka Segment) is responsible for storing blocks of
                       data and running tasks
                    –  MapReduce – Push compute to where data resides



© Copyright 2011 EMC Corporation. All rights reserved.                                  4
What is Hadoop and Where did it start?

•  Created by Doug Cutting
          –  HDFS (storage) & MapReduce (compute)

          –  Inspired by Google’s MapReduce and Google File
             System (GFS) papers

•  It is now a top-level Apache project backed by large
   open source development community

•  Three major subprojects
          –  Nutch

          –  Lucene

          –  Hadoop




 © Copyright 2011 EMC Corporation. All rights reserved.       5
What makes Hadoop Different?
 •  Hadoop is a complete paradigm shift
 •  Bypasses 25yrs of enterprise ceilings
 •  Hadoop also defers some really difficult challenges:
           –  Non-transactional
           –  File System is essentially read-only



 •  “Greater potential for the Hadoop architecture to mature
    and handle the complexity of transactions then RDBMS
    to figure out failures, and data growth”




© Copyright 2011 EMC Corporation. All rights reserved.         6
Confluence of Factors
  •  Hadoop makes analytics on large scale data sets
     more pragmatic
            –  BI Solutions often suffer from garbage-in, garbage-out

            –  Opens up new ways of understanding and thus running lines of business


  •  Classic Architectures won’t scale any further
  •  New sources of information (social media) are too
     big and unwieldy for traditional solutions
            –  5 year enterprise data growth is estimated at 650% - with over 80% of that
               unstructured data EG. Facebook collects 100TB per day


  •  What works for Google & Facebook!

© Copyright 2011 EMC Corporation. All rights reserved.                                      7
Hadoop vs Relational Solutions
      •  Hadoop is a paradigm shift in the way we think about and manage data
      •  Traditional solutions were not designed with growth in mind
      •  Big-Data accelerates this problem dramatically


 Category                     Traditional RDBMS              Hadoop
                              Resource constrained           Linear Expansion
                              Re-architecture                Seamless addition & subtraction of
 Scalability                                                 nodes
                              ~ 5TB                          ~ 5PB
 Fault                        After thought, many critical   Designed in, tasks are
 Tolerance                    points of failure              automatically restarted
                              Transactional, OLTP            Batch, OLAP (today!)
 Problem
 Space                        Inability to incorporate new   No bounds
                              sources



© Copyright 2011 EMC Corporation. All rights reserved.                                            8
History of Hadoop?




© Copyright 2011 EMC Corporation. All rights reserved.                9
History
       •  Google paper: Simplified Data Processing on Large Clusters – 2006
                 –  GFS & MapReduce framework

       •  Top Level Apache Open Source Community Project -
          2008
       •  Yahoo, Facebook, and Powerset become the main contributors, with
          Yahoo running over 10K nodes (300K cores) - 2009
       •  Hadoop cluster at Yahoo sets Terasort benchmark standard – Jul 08
           –  209s to sort 1TB (62 seconds NOW)
           –  Cluster Config
               •  910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4-
                  SATA disks
               •  1Gb Ethernet
               •  40 nodes per rack
               •  8Gb Ethernet uplinks from each rack
               •  RH Server 5.1
               •  Sun JDK 1.6.0_05-b13



© Copyright 2011 EMC Corporation. All rights reserved.                        10
Hadoop Components




© Copyright 2011 EMC Corporation. All rights reserved.          11
Analytics
Component Design                                                                               R
                                                                              Mahout




     MapReduce
                                                                       Packages
                                  python                                               HBase
            java                                                          Hive
                                                    stream                                         Pig




                                                                HDFS


                  jbod                       jbod            jbod      jbod        jbod    …..
© Copyright 2011 EMC Corporation. All rights reserved.                                                   12
Hadoop Components
                                                   Two Core Components


                                     HDFS                         MapReduce


                                    Storage                         Compute


•  Storage & Compute in One Framework
•  Open Source Project of the Apache Software Foundation
•  Written in Java




© Copyright 2011 EMC Corporation. All rights reserved.                        13
HDFS




© Copyright 2011 EMC Corporation. All rights reserved.          14
HDFS Concepts
•  Sits on top of native (ext3, xfs, etc.) file system
•  Performs best with a ‘modest’ number of large files
         –  Millions, rather than billions, of files
         –  Each file typically 100Mb or more

•  Files in HDFS are ‘write once’
         –  No random writes to files are allowed
         –  Append support is available in Hadoop 0.21

•  HDFS is optimized for large, streaming reads of files
         –  Rather than random reads




© Copyright 2011 EMC Corporation. All rights reserved.     15
HDFS
•  Hadoop Distributed File System
         –  Data is organized into files & directories
         –  Files are divided into blocks, typically 64-128MB each,
            and distributed across cluster nodes
         –  Block placement is known at runtime by map-reduce so
            computation can be co-located with data
         –  Blocks are replicated (default is 3 copies) to handle
            failure
         –  Checksums are used to ensure data integrity
•  Replication is the one and only strategy for error
   handling, recovery and fault tolerance
         –  Make multiple copies and be happy!



© Copyright 2011 EMC Corporation. All rights reserved.                16
Hadoop Architecture - HDFS




© Copyright 2011 EMC Corporation. All rights reserved.   17
HDFS Components
•  NameNode
•  DataNode
•  Standby NameNode
•  Job Tracker
•  Task Tracker




© Copyright 2011 EMC Corporation. All rights reserved.   18
NameNode
•  Provides a centralized, repository for the
   namespace
         –  A index of what files are stored in which blocks
•  Responds to client requests (map-reduce jobs) by
   coordinating distribution of tasks (algorithm
         –  Make multiple copies and be happy!
•  In Memory only
         –  0.23 provides distributed namenode
         –  Namenode recovery must re-build entire meta-data
            repository



© Copyright 2011 EMC Corporation. All rights reserved.         19
Hadoop Architecture - HDFS
•  Block level storage
•  N-Node replication
•  Namenode for
        –  File system index (EditLog)
        –  Access coordination
        –  IPC via TCP/IP

•  Datanode for                                          Put
        –  Data Block Management
        –  Job Execution (MapReduce)

•  Automated Fault Tolerance



© Copyright 2011 EMC Corporation. All rights reserved.         20
Job Tracker
•  MapReduce jobs are controlled by a software daemon
   known as the JobTracker
•  The JobTracker resides on a single node
         –  Clients submit MapReduce jobs to the JobTracker
         –  The JobTracker assigns Map and Reduce tasks to other
            nodes on the cluster
         –  These nodes each run a software daemon known as the
            TaskTracker
         –  The TaskTracker is responsible for actually instantiating the
            Map or Reduce task, and reporting progress back to the
            JobTracker
•  A Job consists of a collection of Map & Reduce Tasks


© Copyright 2011 EMC Corporation. All rights reserved.                      21
MapReduce




© Copyright 2011 EMC Corporation. All rights reserved.               22
Map Reduce Framework
•  Map step
       –  Input records are parsed into
          intermediate key/value pairs
       –  Multiple Maps per Node
                 •  10TB => 128MB/Blk => 82K Maps

•  Reduce step
       –  Each Reducer handles all like
          keys
       –  3 Steps
                 •  Shuffle: All like keys are retrieved
                    from each Mapper
                 •  Sort: Intermediate keys are sorted
                    prior to reduce
                 •  Reduce: Values are processed




© Copyright 2011 EMC Corporation. All rights reserved.     23
Map Reduce




© Copyright 2011 EMC Corporation. All rights reserved.   24
Reduce Task
•  After the Map phase is over, all the intermediate values for a
   given intermediate key are combined together into a list
•  This list is given to a Reducer
         –  There may be a single Reducer, or multiple Reducers
         –  This is specified as part of the job configuration (see later)
         –  All values associated with a particular intermediate key are
            guaranteed to go to the same Reducer
         –  The intermediate keys, and their value lists, are passed to the
            Reducer in sorted key order
         –  This step is known as the ‘shuffle and sort’
•  The Reducer outputs zero or more final key/value pairs
         –  These are written to HDFS
•  In practice, the Reducer usually emits a single key/value pair
   for each input key


© Copyright 2011 EMC Corporation. All rights reserved.                        25
Fault Tolerance
•  HDFS will only allocate jobs to active nodes
•  Map-Reduce can compensate for slow running jobs
          –  If a Mapper appears to be running significantly more slowly than the
             others, a new instance of the Mapper will be started on another
             machine, operating on the same data
          –  The results of the first Mapper to finish will be used
          –  Hadoop will kill off the Mapper which is still running

•  Yahoo experiences multiple failures (> 10) of various
   components (drives, cables, servers) every day
          –  Which have exactly 0 impact on operations




© Copyright 2011 EMC Corporation. All rights reserved.                              26
Ecosystem




© Copyright 2011 EMC Corporation. All rights reserved.               27
Hadoop Ecosystem




© Copyright 2011 EMC Corporation. All rights reserved.   28
Ecosystem Distribution by Role
 Distribution                                Reporting         Analytics      Monitoring

      Apache
                                                                            Manageability


                                                                               Training

                  IDE                                    Consulting

                                                                           Data Integration
             Hadoop-ide

                                                                           Data Visualization
                                                                                      UAP



© Copyright 2011 EMC Corporation. All rights reserved.                                          29
Hadoop Components (hadoop.apache.org)
                          HDFS                           •  Hadoop Distributed File System



                MapReduce                                •  Framework for writing scalable data applications



                              Pig                        •  Procedural language that abstracts lower level MapReduce



                  Zookeeper                              •  Highly reliable distributed coordination



                            Hive                         •  System for querying data and managing structured data built on top of
                                                            HDFS (SQL-like query)


                         HBase                           •  Database for random, real time read/write access



                          Oozie                          •  workflow/coordination to manage jobs



                       Mahout                            •  Scalable machine learning libraries




© Copyright 2011 EMC Corporation. All rights reserved.                                                                              30
Technology Adoption Lifecycle



                                  Today




            Innovators/                                  Early Majority   Late Majority   Laggards
           Early Adopters




© Copyright 2011 EMC Corporation. All rights reserved.                                               31
Hbase, Pig, Hive




© Copyright 2011 EMC Corporation. All rights reserved.                      32
Hbase Overview
     •  Hbase is a sparse, distributed, persistent,
        scalable, reliable multi-dimensional map which is
        indexed by row key
               –  Hadoop Database, ~ “No-SQL” database
               –  Many relational features
               –  Scalable: Region Servers
               –  Multiple client access: java, ReST, Thrift
     •  What’s it good for?
               –  Queries against a number of rows that makes your
                  Oracle Server puke!
     •  Hbase leverages HDFS for its storage


© Copyright 2011 EMC Corporation. All rights reserved.               33
HBase in Practice
     •  High performance, real-time query
     •  Client is typically a Java program
     •  But HBase supports many other API’s:
               –  JSON: Java Script Object Notation
               –  REST: Representational State Transfer
               –  Thrift, Avro: Frameworks..




© Copyright 2011 EMC Corporation. All rights reserved.    34
Hbase – key/value Store
       •  Excellent key-based access to a specific cell or
          sequential cells of data
       •  Column Oriented Architecture ( like GPDB)
                 –  Column Families related attributes often queried
                    together
                 –  Members are stored together
       •  Versioning of cells is used to provide update
          capability
                 –  Change to an existing cell is stored as a new version
                    by timestamp
       •  No transactional guarantee


© Copyright 2011 EMC Corporation. All rights reserved.                      35
Hive
•  Data Warehousing package built on top of Hadoop
•  System for managing and querying structured data
         –  Leverages MapReduce for execution
         –  Utilizes HDFS (or HBase) for storage
•  Data is stored in tables
         –  Consists of separate Schema metastore and data files
•  HiveQL is a sql-like language
         –  Queries are converted into MapReduce jobs




© Copyright 2011 EMC Corporation. All rights reserved.             36
Hive – Basics & Syntax
                                                         --- Hive example
                                                         -- set hive to use local (non-hdfs) storage

                                                         hive > SET mapred.job.tracker=local;
                       Tell Hive to use a local          hive > SET mapred.local.dir=/Users/hardar/Documents/training/
                repository for mapreduce, not            HDWorkshop/labs/9.hive/data
                                                         hive > SET hive.exec.mode.local.auto=false;
                                           hdfs
                                                         -- setup hive storage location in hdfs - if not using local
                     Create repository folders in          $ hadoop fs -mkdir     /tmp
                                            hdfs           $ hadoop fs -mkdir     /user/hive/warehouse
                                                           $ hadoop fs -chmod g+w /tmp
                                                           $ hadoop fs -chmod g+w /user/hive/warehouse
                        Create a Customers table
                                                         -- create an orders table
                           Load data from local file     create table orders (orderid bigint, customerid bigint, productid
                                                            int, qty int, rate int, estdlvdate string, status string) row format
                                           system        delimited fields terminated by ",";

                                                         -- load some data
                                                         load data local inpath '9.hive/data/orders.txt'
                                                             into table orders;

                                                         -- query
                            Create a Products table      select * from orders;

                                                         -- create a product table
                           Load data from local file     create table products (productid int, description string) row
                                           system           format delimited fields terminated by ",";

                                                         -- load some data
                                                         load data local inpath '9.hive/data/products.txt' into table products;

                                                         -- select * from products.




© Copyright 2011 EMC Corporation. All rights reserved.                                                                             37
Pig
 •  Provides a mechanism for using MapReduce without
    programming in Java
           –  Utilizes HDFS & MapReduce
 •  Allows for a more intuitive means to specify data
    flows
           –  High-level sequential, data flow language
           –  Pig Latin
           –  Python integration
                     •  Comfortable for researchers who are familiar with Perl &
                        Python
 •  Pig is easier to learn & execute, but more limited
    in scope of functionality then java


© Copyright 2011 EMC Corporation. All rights reserved.                             38
PIG – Basics & Syntax
                                                                    -- file : demographic.pig
                                                                    --
                                                                    -- extracts INCOME (in thousands) and
                                                                    ZIPCODE from census data. Filters out
                                                                    ZERO incomes
                            Define a table and load                 grunt> DEMO_TABLE = LOAD 'data/
                             directly from local file               input/demo_sample.txt' using
                                                                    PigStorage(',') AS
                                                                        (gender:chararray, age:int, income:int,
                                                                    zip:chararray);

                                                         Describe   -- describe DEMO_TABLE
                                                                    grunt> describe DEMO_TABLE;

                                                                    ## run mr job to dump DEMO_TABLE
                                                Select * from
                                                                    grunt> dump DEMO_TABLE;

                                                                    ## store DEMO_TABLE in hdfs
                                Store the data in hdfs              grunt> store DEMO_TABLE into '/gphd/
                                                                    pig/DEMO_TABLE';



© Copyright 2011 EMC Corporation. All rights reserved.                                                            39
Others…….




© Copyright 2011 EMC Corporation. All rights reserved.               40
Mahout
 •  Important stuff first: most common pronunciation is “Ma-h-
    out” – rhymes with ‘trout’
 •  Machine Learning Library that Runs on HDFS
 •  4 Primary Use Cases:
           –      Recommendation Mining – People who like X, also like Y
           –      Clustering – Topic based association
           –      Classification – Assign new docs to existing categories
           –      Frequent Item set Mining – Which things will appear together




© Copyright 2011 EMC Corporation. All rights reserved.                           41
Revolutions Analytics R

•  Statistical programming language for Hadoop
         –  Open Source & Revolution R Enterprise
         –  More than just counts and averages
•  Ability to manipulate HDFS directly from R
•  Mimics Java APIs




© Copyright 2011 EMC Corporation. All rights reserved.   42
Hadoop Use Cases




© Copyright 2011 EMC Corporation. All rights reserved.             43
Hadoop Use Cases
•  Internet                                              •  Social
         –      Search Index Generation                      –  Recommendations
         –      User Engagement Behavior                     –  Network Graphs
         –      Targeting / Advertising Optimizations        –  Feed Updates
         –      Recommendations                          •  Enterprises
                                                             –    email analysis, and image processing
•  BioMed                                                    –    ETL
         –  Computational BioMedical Systems                 –    Reporting & Analytics
         –  Bioinformatics                                   –    Natural Language Processing
         –  Data Mining and Genome Analysis
                                                         •  Media/Newspapers
•  Financial                                                 –  Image Conversions
         –  Prediction Models
                                                         •  Agriculture
         –  Fraud Analysis                                   –  Process “agri” stream
         –  Portfolio Risk Management
                                                         •  Image
•  Telecom                                                   –  Geo-Spatial processing
         –  Call data records
         –  Set top & DVR streams                        •  Education
                                                             –  Systems Research
                                                             –  Statistical analysis of stuff on the web




© Copyright 2011 EMC Corporation. All rights reserved.                                                     44
Greenplum Hadoop Customers
How our customers are using Hadoop


                                             •  Return Path
                                                          –  World’s leader in email certification & scoring
                                                          –  Uses Hadoop & Hbase to store & process ISP data
                                                          –  Replaced Cloudera with Greenplum MR

                                             •  American Express
                                                          –  Early stages of developing Big Data Analytics strategy
                                                          –  Greenplum MR selected over Cloudera
                                                          –  Chose GP b/c of EMC Support & Existing Relationship

                                             •  SunGard
                                                          –  IT company focusing on availability services
                                                          –  Choose Greenplum MR as platform for big-data-analytics-as-a-
                                                             service
                                                          –  Compete against AWS Elastic MapReduce


 © Copyright 2011 EMC Corporation. All rights reserved.                                                                     45
Major Telco: CDR Churn Analysis
•  Business problem: Construct a churn model to provide early
   detection of customers who are going to end their contracts
•  Available data
         –  Dependent variable: did a customer leave in a 4-month period?
         –  Independent variables: various features on customer call history
         –  ~120,000 training data points, ~120,000 test data points
•  First attempt
         –  Use R, specifically the Generalised Additive Models (GAM) package
         –  Quickly built a model that matched T-Mobile’s existing model




© Copyright 2011 EMC Corporation. All rights reserved.                          46
Challenges




© Copyright 2011 EMC Corporation. All rights reserved.                47
Hadoop Pain Points
    Integrated Product                                   •  No Integrated Hadoop Stack
           Suite                                         •  Hadoop, Pig, Hive, Hbase, Zookeeper, Oozie, Mahout…


                                                         •  No Industry standard ETL and BI Stack Integration
          Interoperability                               •  Informatica, Microstrategy, Business Objects …


                                                         •  Poor Job and Application Monitoring Solution
                Monitoring                               •  Non-existent Performance Monitoring


         Operability and                                 •  Complex System Configuration and Manageability
         Manageability                                   •  No Data Format Interoperability & Storage Abstractions


                                                         •  Poor Dimensional Lookup Performance
             Performance                                 •  Very poor Random Access and Serving Performance




© Copyright 2011 EMC Corporation. All rights reserved.                                                               48
Data Co-Processing
                                                           Analytic Productivity
                                                         Applications, Tools, Chorus

                                                         Data Computing Interfaces
                  SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time)


                 Greenplum Database                                                              Hadoop


       Compute                                                                    Compute

                                                                      parallel
                                                                  data exchange
       Storage                                                                    Storage


       SQL DB                                                         parallel
                                                                                  MapReduce
       Engine                                                     data exchange   Engine


       Network



                                                                All Data Types
                                •  unstructured data                                        •  geospatial data
                                •  structured data                                          •  sensor data
                                •  temporal data                                            •  spatial data
© Copyright 2011 EMC Corporation. All rights reserved.                                                           49
Questions……?
                                                       &
                                                 THANK YOU




© Copyright 2011 EMC Corporation. All rights reserved.         50

More Related Content

What's hot

Docker introduction (1)
Docker introduction (1)Docker introduction (1)
Docker introduction (1)Gourav Varma
 
KCD Italy 2022 - Application driven infrastructure with Crossplane
KCD Italy 2022 - Application driven infrastructure with CrossplaneKCD Italy 2022 - Application driven infrastructure with Crossplane
KCD Italy 2022 - Application driven infrastructure with Crossplanesparkfabrik
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSYevgeniy Brikman
 
Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...
Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...
Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...Amazon Web Services
 
IBM JVM 소개 - Oracle JVM 과 비교
IBM JVM 소개 - Oracle JVM 과 비교IBM JVM 소개 - Oracle JVM 과 비교
IBM JVM 소개 - Oracle JVM 과 비교JungWoon Lee
 
DevOps with Azure, Kubernetes, and Helm Webinar
DevOps with Azure, Kubernetes, and Helm WebinarDevOps with Azure, Kubernetes, and Helm Webinar
DevOps with Azure, Kubernetes, and Helm WebinarCodefresh
 
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...Amazon Web Services Korea
 
Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...
Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...
Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...Simplilearn
 
Red Hat Container Strategy
Red Hat Container StrategyRed Hat Container Strategy
Red Hat Container StrategyRed Hat Events
 
Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...
Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...
Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...SlideTeam
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architectureIgor Sfiligoi
 
WebRTC 1.0 표준완성과 현재, 그리고 다음버전
WebRTC 1.0 표준완성과 현재, 그리고 다음버전WebRTC 1.0 표준완성과 현재, 그리고 다음버전
WebRTC 1.0 표준완성과 현재, 그리고 다음버전sung young son
 
クックパッドでのemr利用事例
クックパッドでのemr利用事例クックパッドでのemr利用事例
クックパッドでのemr利用事例Tatsuya Sasaki
 
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 rockplace
 
생산성을 높여주는 iOS 개발 방법들.pdf
생산성을 높여주는 iOS 개발 방법들.pdf생산성을 높여주는 iOS 개발 방법들.pdf
생산성을 높여주는 iOS 개발 방법들.pdfssuserb942d2
 
docker installation and basics
docker installation and basicsdocker installation and basics
docker installation and basicsWalid Ashraf
 
Chromium에 contribution하기
Chromium에 contribution하기Chromium에 contribution하기
Chromium에 contribution하기규영 허
 
Docker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker, Inc.
 
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례NAVER LABS
 
OpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleOpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleMihai Criveti
 

What's hot (20)

Docker introduction (1)
Docker introduction (1)Docker introduction (1)
Docker introduction (1)
 
KCD Italy 2022 - Application driven infrastructure with Crossplane
KCD Italy 2022 - Application driven infrastructure with CrossplaneKCD Italy 2022 - Application driven infrastructure with Crossplane
KCD Italy 2022 - Application driven infrastructure with Crossplane
 
An intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECSAn intro to Docker, Terraform, and Amazon ECS
An intro to Docker, Terraform, and Amazon ECS
 
Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...
Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...
Configure an End-to-End Video Channel to Deliver Low Latency (CTD411-R3) - AW...
 
IBM JVM 소개 - Oracle JVM 과 비교
IBM JVM 소개 - Oracle JVM 과 비교IBM JVM 소개 - Oracle JVM 과 비교
IBM JVM 소개 - Oracle JVM 과 비교
 
DevOps with Azure, Kubernetes, and Helm Webinar
DevOps with Azure, Kubernetes, and Helm WebinarDevOps with Azure, Kubernetes, and Helm Webinar
DevOps with Azure, Kubernetes, and Helm Webinar
 
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
PUBG: Battlegrounds 라이브 서비스 EKS 전환 사례 공유 [크래프톤 - 레벨 300] - 발표자: 김정헌, PUBG Dev...
 
Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...
Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...
Docker Tutorial For Beginners | What Is Docker And How It Works? | Docker Tut...
 
Red Hat Container Strategy
Red Hat Container StrategyRed Hat Container Strategy
Red Hat Container Strategy
 
Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...
Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...
Kubernetes Docker Container Implementation Ppt PowerPoint Presentation Slide ...
 
An overview of the Kubernetes architecture
An overview of the Kubernetes architectureAn overview of the Kubernetes architecture
An overview of the Kubernetes architecture
 
WebRTC 1.0 표준완성과 현재, 그리고 다음버전
WebRTC 1.0 표준완성과 현재, 그리고 다음버전WebRTC 1.0 표준완성과 현재, 그리고 다음버전
WebRTC 1.0 표준완성과 현재, 그리고 다음버전
 
クックパッドでのemr利用事例
クックパッドでのemr利用事例クックパッドでのemr利用事例
クックパッドでのemr利用事例
 
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성
 
생산성을 높여주는 iOS 개발 방법들.pdf
생산성을 높여주는 iOS 개발 방법들.pdf생산성을 높여주는 iOS 개발 방법들.pdf
생산성을 높여주는 iOS 개발 방법들.pdf
 
docker installation and basics
docker installation and basicsdocker installation and basics
docker installation and basics
 
Chromium에 contribution하기
Chromium에 contribution하기Chromium에 contribution하기
Chromium에 contribution하기
 
Docker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker SlidesDocker Birthday #3 - Intro to Docker Slides
Docker Birthday #3 - Intro to Docker Slides
 
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
Docker + Kubernetes를 이용한 빌드 서버 가상화 사례
 
OpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleOpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image Lifecycle
 

Viewers also liked

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Business Analysis Fundamentals
Business Analysis FundamentalsBusiness Analysis Fundamentals
Business Analysis Fundamentalswaelsaid75
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Translation slides
Translation slidesTranslation slides
Translation slidesQuanina Quan
 
Software Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageSoftware Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageEMC
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Community Insurance by The Goat trust
Community Insurance by The Goat trustCommunity Insurance by The Goat trust
Community Insurance by The Goat trustSanjeev Kumar
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryCSCJournals
 
Smart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - GrindstaffSmart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - GrindstaffJohn Nixon
 
Want to work for The Insurance Barn
Want to work for The Insurance BarnWant to work for The Insurance Barn
Want to work for The Insurance BarnTim Barnes Clu
 
Credit insurance Solutions
Credit insurance SolutionsCredit insurance Solutions
Credit insurance SolutionsZayd Soobedar
 

Viewers also liked (15)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Business Analysis Fundamentals
Business Analysis FundamentalsBusiness Analysis Fundamentals
Business Analysis Fundamentals
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Translation slides
Translation slidesTranslation slides
Translation slides
 
Software Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageSoftware Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and Storage
 
Dialysis lab
Dialysis labDialysis lab
Dialysis lab
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Community Insurance by The Goat trust
Community Insurance by The Goat trustCommunity Insurance by The Goat trust
Community Insurance by The Goat trust
 
Efficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud LibraryEfficient Point Cloud Pre-processing using The Point Cloud Library
Efficient Point Cloud Pre-processing using The Point Cloud Library
 
Smart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - GrindstaffSmart Innovation Platform Flier - Grindstaff
Smart Innovation Platform Flier - Grindstaff
 
Want to work for The Insurance Barn
Want to work for The Insurance BarnWant to work for The Insurance Barn
Want to work for The Insurance Barn
 
Credit insurance Solutions
Credit insurance SolutionsCredit insurance Solutions
Credit insurance Solutions
 

Similar to Hadoop 101

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 

Similar to Hadoop 101 (20)

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Hadoop 101

  • 1. Hadoop 101 © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Agenda •  What is Hadoop ? •  History of Hadoop •  Hadoop Components •  Hadoop Ecosystem •  Customer Use Cases •  Hadoop Challenges © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. What is Hadoop? © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. •  Concepts : –  NameNode (aka Master) is responsible for managing the namespace repository (index) for the filesystem, and managing Jobs –  DataNode (aka Segment) is responsible for storing blocks of data and running tasks –  MapReduce – Push compute to where data resides © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. What is Hadoop and Where did it start? •  Created by Doug Cutting –  HDFS (storage) & MapReduce (compute) –  Inspired by Google’s MapReduce and Google File System (GFS) papers •  It is now a top-level Apache project backed by large open source development community •  Three major subprojects –  Nutch –  Lucene –  Hadoop © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. What makes Hadoop Different? •  Hadoop is a complete paradigm shift •  Bypasses 25yrs of enterprise ceilings •  Hadoop also defers some really difficult challenges: –  Non-transactional –  File System is essentially read-only •  “Greater potential for the Hadoop architecture to mature and handle the complexity of transactions then RDBMS to figure out failures, and data growth” © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. Confluence of Factors •  Hadoop makes analytics on large scale data sets more pragmatic –  BI Solutions often suffer from garbage-in, garbage-out –  Opens up new ways of understanding and thus running lines of business •  Classic Architectures won’t scale any further •  New sources of information (social media) are too big and unwieldy for traditional solutions –  5 year enterprise data growth is estimated at 650% - with over 80% of that unstructured data EG. Facebook collects 100TB per day •  What works for Google & Facebook! © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Hadoop vs Relational Solutions •  Hadoop is a paradigm shift in the way we think about and manage data •  Traditional solutions were not designed with growth in mind •  Big-Data accelerates this problem dramatically Category Traditional RDBMS Hadoop Resource constrained Linear Expansion Re-architecture Seamless addition & subtraction of Scalability nodes ~ 5TB ~ 5PB Fault After thought, many critical Designed in, tasks are Tolerance points of failure automatically restarted Transactional, OLTP Batch, OLAP (today!) Problem Space Inability to incorporate new No bounds sources © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. History of Hadoop? © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. History •  Google paper: Simplified Data Processing on Large Clusters – 2006 –  GFS & MapReduce framework •  Top Level Apache Open Source Community Project - 2008 •  Yahoo, Facebook, and Powerset become the main contributors, with Yahoo running over 10K nodes (300K cores) - 2009 •  Hadoop cluster at Yahoo sets Terasort benchmark standard – Jul 08 –  209s to sort 1TB (62 seconds NOW) –  Cluster Config •  910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4- SATA disks •  1Gb Ethernet •  40 nodes per rack •  8Gb Ethernet uplinks from each rack •  RH Server 5.1 •  Sun JDK 1.6.0_05-b13 © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 11. Hadoop Components © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Analytics Component Design R Mahout MapReduce Packages python HBase java Hive stream Pig HDFS jbod jbod jbod jbod jbod ….. © Copyright 2011 EMC Corporation. All rights reserved. 12
  • 13. Hadoop Components Two Core Components HDFS MapReduce Storage Compute •  Storage & Compute in One Framework •  Open Source Project of the Apache Software Foundation •  Written in Java © Copyright 2011 EMC Corporation. All rights reserved. 13
  • 14. HDFS © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. HDFS Concepts •  Sits on top of native (ext3, xfs, etc.) file system •  Performs best with a ‘modest’ number of large files –  Millions, rather than billions, of files –  Each file typically 100Mb or more •  Files in HDFS are ‘write once’ –  No random writes to files are allowed –  Append support is available in Hadoop 0.21 •  HDFS is optimized for large, streaming reads of files –  Rather than random reads © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. HDFS •  Hadoop Distributed File System –  Data is organized into files & directories –  Files are divided into blocks, typically 64-128MB each, and distributed across cluster nodes –  Block placement is known at runtime by map-reduce so computation can be co-located with data –  Blocks are replicated (default is 3 copies) to handle failure –  Checksums are used to ensure data integrity •  Replication is the one and only strategy for error handling, recovery and fault tolerance –  Make multiple copies and be happy! © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Hadoop Architecture - HDFS © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. HDFS Components •  NameNode •  DataNode •  Standby NameNode •  Job Tracker •  Task Tracker © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. NameNode •  Provides a centralized, repository for the namespace –  A index of what files are stored in which blocks •  Responds to client requests (map-reduce jobs) by coordinating distribution of tasks (algorithm –  Make multiple copies and be happy! •  In Memory only –  0.23 provides distributed namenode –  Namenode recovery must re-build entire meta-data repository © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. Hadoop Architecture - HDFS •  Block level storage •  N-Node replication •  Namenode for –  File system index (EditLog) –  Access coordination –  IPC via TCP/IP •  Datanode for Put –  Data Block Management –  Job Execution (MapReduce) •  Automated Fault Tolerance © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Job Tracker •  MapReduce jobs are controlled by a software daemon known as the JobTracker •  The JobTracker resides on a single node –  Clients submit MapReduce jobs to the JobTracker –  The JobTracker assigns Map and Reduce tasks to other nodes on the cluster –  These nodes each run a software daemon known as the TaskTracker –  The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker •  A Job consists of a collection of Map & Reduce Tasks © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MapReduce © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Map Reduce Framework •  Map step –  Input records are parsed into intermediate key/value pairs –  Multiple Maps per Node •  10TB => 128MB/Blk => 82K Maps •  Reduce step –  Each Reducer handles all like keys –  3 Steps •  Shuffle: All like keys are retrieved from each Mapper •  Sort: Intermediate keys are sorted prior to reduce •  Reduce: Values are processed © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Map Reduce © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. Reduce Task •  After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list •  This list is given to a Reducer –  There may be a single Reducer, or multiple Reducers –  This is specified as part of the job configuration (see later) –  All values associated with a particular intermediate key are guaranteed to go to the same Reducer –  The intermediate keys, and their value lists, are passed to the Reducer in sorted key order –  This step is known as the ‘shuffle and sort’ •  The Reducer outputs zero or more final key/value pairs –  These are written to HDFS •  In practice, the Reducer usually emits a single key/value pair for each input key © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Fault Tolerance •  HDFS will only allocate jobs to active nodes •  Map-Reduce can compensate for slow running jobs –  If a Mapper appears to be running significantly more slowly than the others, a new instance of the Mapper will be started on another machine, operating on the same data –  The results of the first Mapper to finish will be used –  Hadoop will kill off the Mapper which is still running •  Yahoo experiences multiple failures (> 10) of various components (drives, cables, servers) every day –  Which have exactly 0 impact on operations © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Ecosystem © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. Hadoop Ecosystem © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. Ecosystem Distribution by Role Distribution Reporting Analytics Monitoring Apache Manageability Training IDE Consulting Data Integration Hadoop-ide Data Visualization UAP © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Hadoop Components (hadoop.apache.org) HDFS •  Hadoop Distributed File System MapReduce •  Framework for writing scalable data applications Pig •  Procedural language that abstracts lower level MapReduce Zookeeper •  Highly reliable distributed coordination Hive •  System for querying data and managing structured data built on top of HDFS (SQL-like query) HBase •  Database for random, real time read/write access Oozie •  workflow/coordination to manage jobs Mahout •  Scalable machine learning libraries © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Technology Adoption Lifecycle Today Innovators/ Early Majority Late Majority Laggards Early Adopters © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. Hbase, Pig, Hive © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. Hbase Overview •  Hbase is a sparse, distributed, persistent, scalable, reliable multi-dimensional map which is indexed by row key –  Hadoop Database, ~ “No-SQL” database –  Many relational features –  Scalable: Region Servers –  Multiple client access: java, ReST, Thrift •  What’s it good for? –  Queries against a number of rows that makes your Oracle Server puke! •  Hbase leverages HDFS for its storage © Copyright 2011 EMC Corporation. All rights reserved. 33
  • 34. HBase in Practice •  High performance, real-time query •  Client is typically a Java program •  But HBase supports many other API’s: –  JSON: Java Script Object Notation –  REST: Representational State Transfer –  Thrift, Avro: Frameworks.. © Copyright 2011 EMC Corporation. All rights reserved. 34
  • 35. Hbase – key/value Store •  Excellent key-based access to a specific cell or sequential cells of data •  Column Oriented Architecture ( like GPDB) –  Column Families related attributes often queried together –  Members are stored together •  Versioning of cells is used to provide update capability –  Change to an existing cell is stored as a new version by timestamp •  No transactional guarantee © Copyright 2011 EMC Corporation. All rights reserved. 35
  • 36. Hive •  Data Warehousing package built on top of Hadoop •  System for managing and querying structured data –  Leverages MapReduce for execution –  Utilizes HDFS (or HBase) for storage •  Data is stored in tables –  Consists of separate Schema metastore and data files •  HiveQL is a sql-like language –  Queries are converted into MapReduce jobs © Copyright 2011 EMC Corporation. All rights reserved. 36
  • 37. Hive – Basics & Syntax --- Hive example -- set hive to use local (non-hdfs) storage hive > SET mapred.job.tracker=local; Tell Hive to use a local hive > SET mapred.local.dir=/Users/hardar/Documents/training/ repository for mapreduce, not HDWorkshop/labs/9.hive/data hive > SET hive.exec.mode.local.auto=false; hdfs -- setup hive storage location in hdfs - if not using local Create repository folders in $ hadoop fs -mkdir /tmp hdfs $ hadoop fs -mkdir /user/hive/warehouse $ hadoop fs -chmod g+w /tmp $ hadoop fs -chmod g+w /user/hive/warehouse Create a Customers table -- create an orders table Load data from local file create table orders (orderid bigint, customerid bigint, productid int, qty int, rate int, estdlvdate string, status string) row format system delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/orders.txt' into table orders; -- query Create a Products table select * from orders; -- create a product table Load data from local file create table products (productid int, description string) row system format delimited fields terminated by ","; -- load some data load data local inpath '9.hive/data/products.txt' into table products; -- select * from products. © Copyright 2011 EMC Corporation. All rights reserved. 37
  • 38. Pig •  Provides a mechanism for using MapReduce without programming in Java –  Utilizes HDFS & MapReduce •  Allows for a more intuitive means to specify data flows –  High-level sequential, data flow language –  Pig Latin –  Python integration •  Comfortable for researchers who are familiar with Perl & Python •  Pig is easier to learn & execute, but more limited in scope of functionality then java © Copyright 2011 EMC Corporation. All rights reserved. 38
  • 39. PIG – Basics & Syntax -- file : demographic.pig -- -- extracts INCOME (in thousands) and ZIPCODE from census data. Filters out ZERO incomes Define a table and load grunt> DEMO_TABLE = LOAD 'data/ directly from local file input/demo_sample.txt' using PigStorage(',') AS (gender:chararray, age:int, income:int, zip:chararray); Describe -- describe DEMO_TABLE grunt> describe DEMO_TABLE; ## run mr job to dump DEMO_TABLE Select * from grunt> dump DEMO_TABLE; ## store DEMO_TABLE in hdfs Store the data in hdfs grunt> store DEMO_TABLE into '/gphd/ pig/DEMO_TABLE'; © Copyright 2011 EMC Corporation. All rights reserved. 39
  • 40. Others……. © Copyright 2011 EMC Corporation. All rights reserved. 40
  • 41. Mahout •  Important stuff first: most common pronunciation is “Ma-h- out” – rhymes with ‘trout’ •  Machine Learning Library that Runs on HDFS •  4 Primary Use Cases: –  Recommendation Mining – People who like X, also like Y –  Clustering – Topic based association –  Classification – Assign new docs to existing categories –  Frequent Item set Mining – Which things will appear together © Copyright 2011 EMC Corporation. All rights reserved. 41
  • 42. Revolutions Analytics R •  Statistical programming language for Hadoop –  Open Source & Revolution R Enterprise –  More than just counts and averages •  Ability to manipulate HDFS directly from R •  Mimics Java APIs © Copyright 2011 EMC Corporation. All rights reserved. 42
  • 43. Hadoop Use Cases © Copyright 2011 EMC Corporation. All rights reserved. 43
  • 44. Hadoop Use Cases •  Internet •  Social –  Search Index Generation –  Recommendations –  User Engagement Behavior –  Network Graphs –  Targeting / Advertising Optimizations –  Feed Updates –  Recommendations •  Enterprises –  email analysis, and image processing •  BioMed –  ETL –  Computational BioMedical Systems –  Reporting & Analytics –  Bioinformatics –  Natural Language Processing –  Data Mining and Genome Analysis •  Media/Newspapers •  Financial –  Image Conversions –  Prediction Models •  Agriculture –  Fraud Analysis –  Process “agri” stream –  Portfolio Risk Management •  Image •  Telecom –  Geo-Spatial processing –  Call data records –  Set top & DVR streams •  Education –  Systems Research –  Statistical analysis of stuff on the web © Copyright 2011 EMC Corporation. All rights reserved. 44
  • 45. Greenplum Hadoop Customers How our customers are using Hadoop •  Return Path –  World’s leader in email certification & scoring –  Uses Hadoop & Hbase to store & process ISP data –  Replaced Cloudera with Greenplum MR •  American Express –  Early stages of developing Big Data Analytics strategy –  Greenplum MR selected over Cloudera –  Chose GP b/c of EMC Support & Existing Relationship •  SunGard –  IT company focusing on availability services –  Choose Greenplum MR as platform for big-data-analytics-as-a- service –  Compete against AWS Elastic MapReduce © Copyright 2011 EMC Corporation. All rights reserved. 45
  • 46. Major Telco: CDR Churn Analysis •  Business problem: Construct a churn model to provide early detection of customers who are going to end their contracts •  Available data –  Dependent variable: did a customer leave in a 4-month period? –  Independent variables: various features on customer call history –  ~120,000 training data points, ~120,000 test data points •  First attempt –  Use R, specifically the Generalised Additive Models (GAM) package –  Quickly built a model that matched T-Mobile’s existing model © Copyright 2011 EMC Corporation. All rights reserved. 46
  • 47. Challenges © Copyright 2011 EMC Corporation. All rights reserved. 47
  • 48. Hadoop Pain Points Integrated Product •  No Integrated Hadoop Stack Suite •  Hadoop, Pig, Hive, Hbase, Zookeeper, Oozie, Mahout… •  No Industry standard ETL and BI Stack Integration Interoperability •  Informatica, Microstrategy, Business Objects … •  Poor Job and Application Monitoring Solution Monitoring •  Non-existent Performance Monitoring Operability and •  Complex System Configuration and Manageability Manageability •  No Data Format Interoperability & Storage Abstractions •  Poor Dimensional Lookup Performance Performance •  Very poor Random Access and Serving Performance © Copyright 2011 EMC Corporation. All rights reserved. 48
  • 49. Data Co-Processing Analytic Productivity Applications, Tools, Chorus Data Computing Interfaces SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time) Greenplum Database Hadoop Compute Compute parallel data exchange Storage Storage SQL DB parallel MapReduce Engine data exchange Engine Network All Data Types •  unstructured data •  geospatial data •  structured data •  sensor data •  temporal data •  spatial data © Copyright 2011 EMC Corporation. All rights reserved. 49
  • 50. Questions……? & THANK YOU © Copyright 2011 EMC Corporation. All rights reserved. 50