SlideShare a Scribd company logo
1 of 52
Hadoop for
                             Enterprise
                             rev 7




Rajesh Nadipalli
Mar 2012
rajesh.nadipalli@gmail.com
Hadoop getting attention
•   Feb 2012: Microsoft, Hortonworks in partnership to develop Excel
    plug-in for Hadoop


•   Jan 2012: Oracle announces Big Data Appliance with Cloudera’s
    Hadoop distribution


•   Dec 2011: EMC released Unified Analytics Platform which includes
    Greenplum Apache Hadoop distribution


•   Oct 2011: Microsoft plans to add Hadoop support to SQL server 2012

•   May 2010: IBM introduces Hadoop based InfoSphereBigInsights




                          Rajesh.nadipalli@gmail.com
In this Presentation…
  Big Data – Big Opportunities
  Hadoop for Enterprise – Reference
   Arch
  Map Reduce Overview
  Hive
  References




              Rajesh.nadipalli@gmail.com
BIG DATA – BIG
OPPORTUNITIES




                 Rajesh.nadipalli@gmail.c
Big Data - Business
Opportunity
Enterprises today are challenged with..
   Exponential data growth
   Complex data needs- structured & unstructured
   Real time insights with key indicators
   Heterogeneous environment: private and
    public clouds
   Tighter budgets and the need to do more with
    less

        Traditional relational databases are not
        able to scale and meet these challenges


                   Rajesh.nadipalli@gmail.com
http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinars




Big Data – 4 V’s (Forrester)




                    Rajesh.nadipalli@gmail.com
Why Hadoop?
      Hadoop provides…
         Distributed File System
         Parallel computing across several nodes
         Support for structured and un-structured
          content
         Fault tolerance and linear scalability
         Open source under Apache foundation
         Increasing support from vendors
         Key Philosophy: “moving compute is cheaper
          than moving data”
Forrester regards Hadoop as the nucleus of the next-generation EDW in the
cloud.
                               Rajesh.nadipalli@gmail.com
Some users of Hadoop…
                                                              http://wiki.apache.org/hadoop/PoweredBy




     • Use Hadoop to store copies of internal log and dimension data sources and use it as
     a source for reporting/analytics and machine learning.
     • Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and
     about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw
     storage.
     • Each (commodity) node has 8 cores and 12 TB of storage.


     • Hadoop used to analyze the log of search and do some mining work on web page
     database
     • We handle about 3000TB per week Our clusters vary from 10 to 500 nodes



     • 532 nodes cluster (8 * 532 cores, 5.3PB).
     • Heavy usage of Java MapReduce, Pig, Hive, HBase
     • Using it for search optimization and Research

     •5 machine cluster (8 cores/machine, 5TB/machine storage)
     •Existing 19 virtual machine cluster (2 cores/machine 30TB storage
     •Predominantly Hive and Streaming API based jobs (~20,000 jobs a week)
     •Daily batch ETL; Log analysis; Data mining; Machine learning




                                 Rajesh.nadipalli@gmail.com
HADOOP REFERENCE
ARCHITECTURE




            Rajesh.nadipalli@gmail.c
Hadoop for Enterprise – Technology Stack
                                   User Experience
        Ad-hoc           Notifications               Embedded
                                                                                 Analytics
        queries            /Alerts                    Search



                                         Data Access
                                             Excel                                 R (Rhipe,
         Hive            Pig                                  Datameer
                                                                                    RBits)




                                                                                                                                    Zookeeper (Orchestration, Quorum)
                                                                                               Pentaho (Scheduling, Integrations)
                                   Data Processing
                                              Mapreduce


                                  Hadoop Data Store
                                         Hbase (NOSQL DB)

                                                HDFS

                                     Sqoop

                                     Data Sources

        Application    Database                             Log           RSS
                                            Cloud                                  Others
        s (internal)       s                                Files        Feeds

                                   Rajesh.nadipalli@gmail.com
Hadoop for BI – Reference Architecture
 Data       Hadoop Distributed Computing                    Enterprise Apps
                    Environment
                                                                     Dashboards
 RDBMS

 Excel
               M
 XML
               A          N-Node
 JSON          P          scalable
                          cluster                                ERP, Enterprise
                                                                      Apps
 Binary        R
               E
 CSV           D
               U
 Log           C                                   Import     RDBM
               E                                                S
 Java                         Hadoop File
 Objects                      System
                              (HDFS)


                      Rajesh.nadipalli@gmail.com
http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/wp-big-data-with-oracle-
   521209.pdf?ssSourceSiteId=ocomen




   Oracle’s Big Data Solution




• Oracle sees Hadoop is good for unstructured sourcing and map reduce.
• It recommends to use Oracle database for the final analyze stage
• Oracle Data Integrator can make Hive queries (ETL)
• Oracle has a wrapper on top of sqoop which is called Oraoop (see
references)
                                            Rajesh.nadipalli@gmail.com
DATA PROCESSING




             Rajesh.nadipalli@gmail.c
Hadoop Mapreduce Overview
                                                      Map Reduce Process
                                       Node 1
                                  010101010101010101010
                                                                  Node 1
                                           10
                                                               222222222222222222
                                  010101010101010101010 Map
                                                              3333333333333333333
                                           10
                                                              3333333333333333333
  DATA (from HDFS)                010101010101010101010
                                           10
                                                                                              RESULTS
01010101010101010101010
01010101010101010101010                 Node 2                                            2222222222222222
01010101010101010101010                                                                          22
01010101010101010101010           010101010101010101010
                                                                   Node 2                 3333333333333333
01010101010101010101010                                                             Reduc
                          Split            10                                                    33
01010101010101010101010                                       2222222222222222222     e
                                  010101010101010101010                                   4444444444444444
01010101010101010101010                              Map      3333333333333333333
                                           10                                                    44
01010101010101010101010                                       4444444444444444444
                                  010101010101010101010
01010101010101010101010                    10
01010101010101010101010
01010101010101010101010
01010101010101010101010
                                        Node 3
                                                                   Node 3
                                  010101010101010101010
                                           10
                                  010101010101010101010        222222222222222222
                                           10
                                                      Map     3333333333333333333
                                  010101010101010101010       3333333333333333333
                                           10



                                               Rajesh.nadipalli@gmail.com
Map Reduce Tips
   The first step is to understand what data you
    have, and how to feed it into the Hadoop
    distributed computing environment.

   Using distributed applications, provide
    analytics of the massive data sets, while
    simultaneously enabling the surfacing of
    opportunities.

   Hadoop stores your information for future
    queries, enhancing the exploratory
    capabilities (as well as historical reference) of
    your data.     Rajesh.nadipalli@gmail.com
DATA STORE




             Rajesh.nadipalli@gmail.c
HDFS
   Distributed file system consisting of
    ◦ One single node is called “Namenode” and
      has metadata
    ◦ Several “Datanodes”
 Designed to run on commodity hardware
 Data gets imported as blocks (64 MB)
 These Blocks are replicated (typically 3
  copies) to protect for hardware failures
 Access via Java API’s or hadoop
  command line ($hadoop fs…)

                  Rajesh.nadipalli@gmail.com
http://hadoop.apache.org/common/docs/current/hdfs_design.
                                             html




HDFS architecture




Hadoop next revision has a failover Namenode called “Avatar”

                      Rajesh.nadipalli@gmail.com
HBase
   Distributed, column-oriented database
    (NoSQL)
   Failure-tolerant
   Low latency
   HDFS aware
   Access via Java APIs or REST APIs
   It is not a replacement for RDBMS
   Recommended to use Hbase when
    ◦ Data is searched by key (or range)
    ◦ Data does not conform to a schema (for
      instance if you have attributes that change by
      record).
                  Rajesh.nadipalli@gmail.com
Hbase Architecture
                           Zookeeper

                                                Avatar
              Hbase
                                              (Failover of
              Master
                                                master)
    Region       Region                  Region              Region
    Server       Server                  Server              Server



    Zookeeper maintains quorum and knows which server
     is the master
    Master keeps track of regions and region servers
    Region servers store table regions

                       Rajesh.nadipalli@gmail.com
Hbase Column Storage
Hbase stores data like tags for a key;
 for example:


Row         Column                  Column         Cell
            Family
            Cast                    Cast:Actor1    Harrison Ford
Star Wars                           Cast:Actor2    Carrie Fisher
            Reviews                 Review: IMDB   Review URL
                                    Review: ET     Review URL2



                   Rajesh.nadipalli@gmail.com
DATA ACCESS




              Rajesh.nadipalli@gmail.c
Hive Overview
 Data warehouse software built on top
  of Hadoop
 HiveQL provides a SQL like interface
  and performs a map reduce job
 Provides structure to HDFS data
  similar to Oracle External table




            Rajesh.nadipalli@gmail.com
Hive Architecture

         Hive CLI
Browse              Query




                                       Hive QL
  Hive                      Parser
Metastore                                        Execution


                                                 SerDe (Map Reduce)



                                                              HDFS


                               Rajesh.nadipalli@gmail.com
Pig Overview
 Pig is a layer on top of map-reduce for
  statisticians (programmers)
 It provides several standard operators:
  join, order by etc
 It allows user defined functions to be
  included.
 Java or phyton supported for UDF’s




             Rajesh.nadipalli@gmail.com
http://www.datameer.com/




Datameer Overview
Key philosophy: Business users understand Excel; let them
  do the grouping, sorting, filtering, aggregates

Key Steps:
 Datameer’s source is a mapreduce output.
 Datameer takes a quick sample of 5000 records.
 The end user is next presented an Excel like interface on top
  of this 5000 records. This is where the end users can define
  their filters, formula, grouping, aggregations, joins across
  sheets (even join hadoop data with data from a relational
  database table)
 Once the end user has defined what they want as the end
  result, they can submit a job to run on the complete dataset.
 Datameer will then build the necessary map reduce jobs and
  run it on the complete data set.
 Next the user gets the results and can build charts, tables etc
  – all on the browser
                       Rajesh.nadipalli@gmail.com
http://www.informationweek.com/news/software/info_management/232601675?cid=RSSfeed_IWK_News



Excel Integration
Microsoft announced Excel integration with
 Hadoop (Feb 2012) with HortonWorks

Key Highlights:
 Microsoft &Hortonworks will deliver a Hive
  ODBC driver that will enable integration with
  Excel
 Microsoft’s PowerPivot in-memory plug-in
  for Excel will handle larger data sets
 There is also a plan for Javascript framework
  for Hadoop enabling Ajax like iterative
                              Rajesh.nadipalli@gmail.com
INTEGRATION,
SCHEDULING




               Rajesh.nadipalli@gmail.c
Pentaho Data Integration
 Pentaho is considered as “strong
  performer” by Forrestor (Feb 2012)
 It makes building MapReduce easy via
  it’s Data Integration IDE.
 It can read/write to HDFS, run map
  reduce and Pig scripts
 The IDE has several standard
  connectors, transformation, and allows
  custom java code
 http://www.pentaho.com/big-data/
              Rajesh.nadipalli@gmail.com
http://www.youtube.com/watch?v=KZe1UugxXcs&feature=player_emb
                                               edded



       Pentaho Data Integration
                                              Build Reducer
                                          2




1   Build
    Mapper




                 Run Map
             3
                 Reduce




                           Rajesh.nadipalli@gmail.com
Talend - ETL
 Talend is another ETL
  development, scheduling and
  monitoring tool
 It supports HDFS, Pig, Hive, Sqoop
 http://www.talend.com/products-big-
  data/




            Rajesh.nadipalli@gmail.com
Talend ETL – with Hadoop

      • Can invoke Hadoopcalls (generates Hive
      queries)
      • See right slide “Processing”




             Rajesh.nadipalli@gmail.com
USER EXPERIENCE




             Rajesh.nadipalli@gmail.c
User Experience
This layer of stack is generally custom
  development. However some tools
  that work with Hadoop are:
 Tableau for data analysis &
  visualizations
 SAS Enterprise Miner
 IBM BigInsights




             Rajesh.nadipalli@gmail.com
REFERENCES




             Rajesh.nadipalli@gmail.c
References
   http://hadoop.apache.org/
   http://www.cloudera.com/
   http://www-01.ibm.com/software/data/bigdata/
   http://www.cs.duke.edu/starfish/index.html
   http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-
    node-cluster/
   http://karmasphere.com/Download/karmasphere-studio-community-virtual-
    appliance-for-ibm.html
   http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop-
    presentation
   http://www.slideshare.net/trihug/trihug-november-pig-talk-by-alan-
    gates?from=ss_embed
   http://www.trihug.org/
   http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper
    _c11-690561.html
   http://www.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With-
    OraHive.pdf
                                Rajesh.nadipalli@gmail.com
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-
690561.html




         Rajesh.nadipalli@gmail.com
http://wiki.apache.org/hadoop/PoweredBy



Key Hadoop Players




          Rajesh.nadipalli@gmail.com
MAP-R
 No single point of failure of name node
 Performance improvements (5 times
  faster than HDFS)
 Snapshots, Multi-site copies
 They have separate Mapreduce
  (extended mapreduce)
 MapR is 8K blocks instead of 64MB
  block size of HDFS

             Rajesh.nadipalli@gmail.com
Open Topics – why there is
adoption issue
   Security – no concept of roles
   Backup, Recovery
   ACID not supported




               Rajesh.nadipalli@gmail.com
Thank You to my viewers




         Rajesh.nadipalli@gmail.com
Questions / Comments

Rajesh.nadipalli@gmail.com
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises
Hadoop For Enterprises

More Related Content

What's hot

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupCloudera, Inc.
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopEnkitec
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 

What's hot (20)

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Moving Data Between Exadata and Hadoop
Moving Data Between Exadata and HadoopMoving Data Between Exadata and Hadoop
Moving Data Between Exadata and Hadoop
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 

Viewers also liked

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Financial Programmer - How to break into investment banks for java developers
Financial Programmer - How to break into investment banks for java developersFinancial Programmer - How to break into investment banks for java developers
Financial Programmer - How to break into investment banks for java developersArmel Nene
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupRemus Rusanu
 
Enabling optimization of business processes in banking ws tech conf logan_2011
Enabling optimization of business processes in banking ws tech conf logan_2011Enabling optimization of business processes in banking ws tech conf logan_2011
Enabling optimization of business processes in banking ws tech conf logan_2011Logan Vadivelu
 
Investment banking
Investment bankingInvestment banking
Investment bankingsuruchi2019
 
Investment banks
Investment banksInvestment banks
Investment banksQamar Adeel
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
Ifw framework for banking industry presentation
Ifw framework for banking industry presentationIfw framework for banking industry presentation
Ifw framework for banking industry presentationRavi Sarkar
 
Investing in fintech: Trends in financial technology for investors and entrep...
Investing in fintech: Trends in financial technology for investors and entrep...Investing in fintech: Trends in financial technology for investors and entrep...
Investing in fintech: Trends in financial technology for investors and entrep...OurCrowd
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB
 
Core Banking Transformation: Solutions to Standardize Processes and Cut Costs
Core Banking Transformation: Solutions to Standardize Processes and Cut CostsCore Banking Transformation: Solutions to Standardize Processes and Cut Costs
Core Banking Transformation: Solutions to Standardize Processes and Cut CostsIBM Banking
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackJohn Dougherty
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...Amazon Web Services
 
The Rise of Microservices
The Rise of MicroservicesThe Rise of Microservices
The Rise of MicroservicesMongoDB
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 

Viewers also liked (20)

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Financial Programmer - How to break into investment banks for java developers
Financial Programmer - How to break into investment banks for java developersFinancial Programmer - How to break into investment banks for java developers
Financial Programmer - How to break into investment banks for java developers
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
Bankcore ID
Bankcore IDBankcore ID
Bankcore ID
 
Enabling optimization of business processes in banking ws tech conf logan_2011
Enabling optimization of business processes in banking ws tech conf logan_2011Enabling optimization of business processes in banking ws tech conf logan_2011
Enabling optimization of business processes in banking ws tech conf logan_2011
 
Investment banking
Investment bankingInvestment banking
Investment banking
 
Investment banks
Investment banksInvestment banks
Investment banks
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Ifw framework for banking industry presentation
Ifw framework for banking industry presentationIfw framework for banking industry presentation
Ifw framework for banking industry presentation
 
Investing in fintech: Trends in financial technology for investors and entrep...
Investing in fintech: Trends in financial technology for investors and entrep...Investing in fintech: Trends in financial technology for investors and entrep...
Investing in fintech: Trends in financial technology for investors and entrep...
 
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
MongoDB Days UK: Building an Enterprise Data Fabric at Royal Bank of Scotland...
 
Core Banking Transformation: Solutions to Standardize Processes and Cut Costs
Core Banking Transformation: Solutions to Standardize Processes and Cut CostsCore Banking Transformation: Solutions to Standardize Processes and Cut Costs
Core Banking Transformation: Solutions to Standardize Processes and Cut Costs
 
Hadoop, Infrastructure and Stack
Hadoop, Infrastructure and StackHadoop, Infrastructure and Stack
Hadoop, Infrastructure and Stack
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Investment Banking
Investment BankingInvestment Banking
Investment Banking
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
Automating your Infrastructure Deployment with AWS CloudFormation and AWS Ops...
 
The Rise of Microservices
The Rise of MicroservicesThe Rise of Microservices
The Rise of Microservices
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
SAS® Customer Analytics for Banking
SAS® Customer Analytics for BankingSAS® Customer Analytics for Banking
SAS® Customer Analytics for Banking
 

Similar to Hadoop For Enterprises

Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 

Similar to Hadoop For Enterprises (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 

More from nvvrajesh

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecturenvvrajesh
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentationnvvrajesh
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profitsnvvrajesh
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overviewnvvrajesh
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Waynvvrajesh
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshellnvvrajesh
 

More from nvvrajesh (8)

Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Information management and enterprise architecture
Information management and enterprise architectureInformation management and enterprise architecture
Information management and enterprise architecture
 
Pentaho bi suite overview presentation
Pentaho bi suite overview   presentationPentaho bi suite overview   presentation
Pentaho bi suite overview presentation
 
Social Networking for Non-Profits
Social Networking for Non-ProfitsSocial Networking for Non-Profits
Social Networking for Non-Profits
 
Oracle business intelligence overview
Oracle business intelligence overviewOracle business intelligence overview
Oracle business intelligence overview
 
BI the Agile Way
BI the Agile WayBI the Agile Way
BI the Agile Way
 
Agile Process in a Nutshell
Agile Process in a NutshellAgile Process in a Nutshell
Agile Process in a Nutshell
 

Recently uploaded

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Hadoop For Enterprises

  • 1. Hadoop for Enterprise rev 7 Rajesh Nadipalli Mar 2012 rajesh.nadipalli@gmail.com
  • 2. Hadoop getting attention • Feb 2012: Microsoft, Hortonworks in partnership to develop Excel plug-in for Hadoop • Jan 2012: Oracle announces Big Data Appliance with Cloudera’s Hadoop distribution • Dec 2011: EMC released Unified Analytics Platform which includes Greenplum Apache Hadoop distribution • Oct 2011: Microsoft plans to add Hadoop support to SQL server 2012 • May 2010: IBM introduces Hadoop based InfoSphereBigInsights Rajesh.nadipalli@gmail.com
  • 3. In this Presentation…  Big Data – Big Opportunities  Hadoop for Enterprise – Reference Arch  Map Reduce Overview  Hive  References Rajesh.nadipalli@gmail.com
  • 4. BIG DATA – BIG OPPORTUNITIES Rajesh.nadipalli@gmail.c
  • 5. Big Data - Business Opportunity Enterprises today are challenged with..  Exponential data growth  Complex data needs- structured & unstructured  Real time insights with key indicators  Heterogeneous environment: private and public clouds  Tighter budgets and the need to do more with less Traditional relational databases are not able to scale and meet these challenges Rajesh.nadipalli@gmail.com
  • 7. Why Hadoop? Hadoop provides…  Distributed File System  Parallel computing across several nodes  Support for structured and un-structured content  Fault tolerance and linear scalability  Open source under Apache foundation  Increasing support from vendors  Key Philosophy: “moving compute is cheaper than moving data” Forrester regards Hadoop as the nucleus of the next-generation EDW in the cloud. Rajesh.nadipalli@gmail.com
  • 8. Some users of Hadoop… http://wiki.apache.org/hadoop/PoweredBy • Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. • Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. • Each (commodity) node has 8 cores and 12 TB of storage. • Hadoop used to analyze the log of search and do some mining work on web page database • We handle about 3000TB per week Our clusters vary from 10 to 500 nodes • 532 nodes cluster (8 * 532 cores, 5.3PB). • Heavy usage of Java MapReduce, Pig, Hive, HBase • Using it for search optimization and Research •5 machine cluster (8 cores/machine, 5TB/machine storage) •Existing 19 virtual machine cluster (2 cores/machine 30TB storage •Predominantly Hive and Streaming API based jobs (~20,000 jobs a week) •Daily batch ETL; Log analysis; Data mining; Machine learning Rajesh.nadipalli@gmail.com
  • 9. HADOOP REFERENCE ARCHITECTURE Rajesh.nadipalli@gmail.c
  • 10. Hadoop for Enterprise – Technology Stack User Experience Ad-hoc Notifications Embedded Analytics queries /Alerts Search Data Access Excel R (Rhipe, Hive Pig Datameer RBits) Zookeeper (Orchestration, Quorum) Pentaho (Scheduling, Integrations) Data Processing Mapreduce Hadoop Data Store Hbase (NOSQL DB) HDFS Sqoop Data Sources Application Database Log RSS Cloud Others s (internal) s Files Feeds Rajesh.nadipalli@gmail.com
  • 11. Hadoop for BI – Reference Architecture Data Hadoop Distributed Computing Enterprise Apps Environment Dashboards RDBMS Excel M XML A N-Node JSON P scalable cluster ERP, Enterprise Apps Binary R E CSV D U Log C Import RDBM E S Java Hadoop File Objects System (HDFS) Rajesh.nadipalli@gmail.com
  • 12. http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/wp-big-data-with-oracle- 521209.pdf?ssSourceSiteId=ocomen Oracle’s Big Data Solution • Oracle sees Hadoop is good for unstructured sourcing and map reduce. • It recommends to use Oracle database for the final analyze stage • Oracle Data Integrator can make Hive queries (ETL) • Oracle has a wrapper on top of sqoop which is called Oraoop (see references) Rajesh.nadipalli@gmail.com
  • 13. DATA PROCESSING Rajesh.nadipalli@gmail.c
  • 14. Hadoop Mapreduce Overview Map Reduce Process Node 1 010101010101010101010 Node 1 10 222222222222222222 010101010101010101010 Map 3333333333333333333 10 3333333333333333333 DATA (from HDFS) 010101010101010101010 10 RESULTS 01010101010101010101010 01010101010101010101010 Node 2 2222222222222222 01010101010101010101010 22 01010101010101010101010 010101010101010101010 Node 2 3333333333333333 01010101010101010101010 Reduc Split 10 33 01010101010101010101010 2222222222222222222 e 010101010101010101010 4444444444444444 01010101010101010101010 Map 3333333333333333333 10 44 01010101010101010101010 4444444444444444444 010101010101010101010 01010101010101010101010 10 01010101010101010101010 01010101010101010101010 01010101010101010101010 Node 3 Node 3 010101010101010101010 10 010101010101010101010 222222222222222222 10 Map 3333333333333333333 010101010101010101010 3333333333333333333 10 Rajesh.nadipalli@gmail.com
  • 15. Map Reduce Tips  The first step is to understand what data you have, and how to feed it into the Hadoop distributed computing environment.  Using distributed applications, provide analytics of the massive data sets, while simultaneously enabling the surfacing of opportunities.  Hadoop stores your information for future queries, enhancing the exploratory capabilities (as well as historical reference) of your data. Rajesh.nadipalli@gmail.com
  • 16. DATA STORE Rajesh.nadipalli@gmail.c
  • 17. HDFS  Distributed file system consisting of ◦ One single node is called “Namenode” and has metadata ◦ Several “Datanodes”  Designed to run on commodity hardware  Data gets imported as blocks (64 MB)  These Blocks are replicated (typically 3 copies) to protect for hardware failures  Access via Java API’s or hadoop command line ($hadoop fs…) Rajesh.nadipalli@gmail.com
  • 18. http://hadoop.apache.org/common/docs/current/hdfs_design. html HDFS architecture Hadoop next revision has a failover Namenode called “Avatar” Rajesh.nadipalli@gmail.com
  • 19. HBase  Distributed, column-oriented database (NoSQL)  Failure-tolerant  Low latency  HDFS aware  Access via Java APIs or REST APIs  It is not a replacement for RDBMS  Recommended to use Hbase when ◦ Data is searched by key (or range) ◦ Data does not conform to a schema (for instance if you have attributes that change by record). Rajesh.nadipalli@gmail.com
  • 20. Hbase Architecture Zookeeper Avatar Hbase (Failover of Master master) Region Region Region Region Server Server Server Server  Zookeeper maintains quorum and knows which server is the master  Master keeps track of regions and region servers  Region servers store table regions Rajesh.nadipalli@gmail.com
  • 21. Hbase Column Storage Hbase stores data like tags for a key; for example: Row Column Column Cell Family Cast Cast:Actor1 Harrison Ford Star Wars Cast:Actor2 Carrie Fisher Reviews Review: IMDB Review URL Review: ET Review URL2 Rajesh.nadipalli@gmail.com
  • 22. DATA ACCESS Rajesh.nadipalli@gmail.c
  • 23. Hive Overview  Data warehouse software built on top of Hadoop  HiveQL provides a SQL like interface and performs a map reduce job  Provides structure to HDFS data similar to Oracle External table Rajesh.nadipalli@gmail.com
  • 24. Hive Architecture Hive CLI Browse Query Hive QL Hive Parser Metastore Execution SerDe (Map Reduce) HDFS Rajesh.nadipalli@gmail.com
  • 25. Pig Overview  Pig is a layer on top of map-reduce for statisticians (programmers)  It provides several standard operators: join, order by etc  It allows user defined functions to be included.  Java or phyton supported for UDF’s Rajesh.nadipalli@gmail.com
  • 26. http://www.datameer.com/ Datameer Overview Key philosophy: Business users understand Excel; let them do the grouping, sorting, filtering, aggregates Key Steps:  Datameer’s source is a mapreduce output.  Datameer takes a quick sample of 5000 records.  The end user is next presented an Excel like interface on top of this 5000 records. This is where the end users can define their filters, formula, grouping, aggregations, joins across sheets (even join hadoop data with data from a relational database table)  Once the end user has defined what they want as the end result, they can submit a job to run on the complete dataset.  Datameer will then build the necessary map reduce jobs and run it on the complete data set.  Next the user gets the results and can build charts, tables etc – all on the browser Rajesh.nadipalli@gmail.com
  • 27. http://www.informationweek.com/news/software/info_management/232601675?cid=RSSfeed_IWK_News Excel Integration Microsoft announced Excel integration with Hadoop (Feb 2012) with HortonWorks Key Highlights:  Microsoft &Hortonworks will deliver a Hive ODBC driver that will enable integration with Excel  Microsoft’s PowerPivot in-memory plug-in for Excel will handle larger data sets  There is also a plan for Javascript framework for Hadoop enabling Ajax like iterative Rajesh.nadipalli@gmail.com
  • 28. INTEGRATION, SCHEDULING Rajesh.nadipalli@gmail.c
  • 29. Pentaho Data Integration  Pentaho is considered as “strong performer” by Forrestor (Feb 2012)  It makes building MapReduce easy via it’s Data Integration IDE.  It can read/write to HDFS, run map reduce and Pig scripts  The IDE has several standard connectors, transformation, and allows custom java code  http://www.pentaho.com/big-data/ Rajesh.nadipalli@gmail.com
  • 30. http://www.youtube.com/watch?v=KZe1UugxXcs&feature=player_emb edded Pentaho Data Integration Build Reducer 2 1 Build Mapper Run Map 3 Reduce Rajesh.nadipalli@gmail.com
  • 31. Talend - ETL  Talend is another ETL development, scheduling and monitoring tool  It supports HDFS, Pig, Hive, Sqoop  http://www.talend.com/products-big- data/ Rajesh.nadipalli@gmail.com
  • 32. Talend ETL – with Hadoop • Can invoke Hadoopcalls (generates Hive queries) • See right slide “Processing” Rajesh.nadipalli@gmail.com
  • 33. USER EXPERIENCE Rajesh.nadipalli@gmail.c
  • 34. User Experience This layer of stack is generally custom development. However some tools that work with Hadoop are:  Tableau for data analysis & visualizations  SAS Enterprise Miner  IBM BigInsights Rajesh.nadipalli@gmail.com
  • 35. REFERENCES Rajesh.nadipalli@gmail.c
  • 36. References  http://hadoop.apache.org/  http://www.cloudera.com/  http://www-01.ibm.com/software/data/bigdata/  http://www.cs.duke.edu/starfish/index.html  http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single- node-cluster/  http://karmasphere.com/Download/karmasphere-studio-community-virtual- appliance-for-ibm.html  http://www.slideshare.net/zshao/hive-data-warehousing-analytics-on-hadoop- presentation  http://www.slideshare.net/trihug/trihug-november-pig-talk-by-alan- gates?from=ss_embed  http://www.trihug.org/  http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper _c11-690561.html  http://www.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With- OraHive.pdf Rajesh.nadipalli@gmail.com
  • 39. MAP-R  No single point of failure of name node  Performance improvements (5 times faster than HDFS)  Snapshots, Multi-site copies  They have separate Mapreduce (extended mapreduce)  MapR is 8K blocks instead of 64MB block size of HDFS Rajesh.nadipalli@gmail.com
  • 40. Open Topics – why there is adoption issue  Security – no concept of roles  Backup, Recovery  ACID not supported Rajesh.nadipalli@gmail.com
  • 41. Thank You to my viewers Rajesh.nadipalli@gmail.com

Editor's Notes

  1. (http://www.zdnet.com/blog/big-data/hadoop-20-mapreduce-in-its-place-hdfs-all-grown-up/267http://www.informationweek.com/news/software/info_management/231900633http://www.informationweek.com/news/software/info_management/232400021http://www.informationweek.com/news/software/info_management/232300181http://www-01.ibm.com/software/data/infosphere/biginsights/basic.htmlhttp://www.informationweek.com/news/galleries/software/enterprise_apps/232500290?pgno=1
  2. http://www.informationweek.com/news/software/bi/229900002?cid=RSSfeed_IWK_News
  3. http://www.informationweek.com/news/software/bi/229900002?cid=RSSfeed_IWK_News
  4. Hadoop implements the core features that are at the heart of most modern EDWs: cloud-facing architectures, MPP, in-database analytics, mixed workload management, and a hybrid storage layer
  5. http://wiki.apache.org/hadoop/PoweredBy
  6. HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset. Sqoop takes data from any DB that supports JDBC and moves it into HDFSIf you haven’t already, check out Toad® for Cloud Databases, our free, fully functional, commercial-grade cloud solution. With Toad for Cloud Databases, you can easily generate queries, migrate, browse, and edit data, as well as create reports and tables – all in a familiar SQL view. Finally, everyone can experience the productivity gains and cost benefits of NoSQL and big data – without the headachesToad for Cloud Databases provides unrivaled support for Apache Hive, Apache HBase, Apache Cassandra, MongoDB, Amazon SimpleDB, Microsoft Azure Table Services, Microsoft SQL Azure, and all open database connectivity (ODBC)-enabled relational databases (including Oracle, SQL Server, MySQL, DB2, and others)
  7. Netflix uses similar reference architecture for movie recommendations. Hadoop is not suited for low latency. Facebook does use Hbase for messaging which is close to a real time functionality
  8. http://facility9.com/nosql/glossary/
  9. Weka read this… it is similar… Mahoot is AI…