The Hadoop Ecosystem




© Hortonworks Inc. 2012
What is Big Data?
• Big does not have to be always
  Petabytes

• Big refers to big enough for traditional
  systems to handle efficiently
Big Data Facts
• Twitter generates 8TB of data every day

• eBay data warehouse is 10+ PB

• Facebook data warehouse is 36+ PB

• Yahoo! Has 100+ PB data

• Google scans and indexes 500+ PB data
Data Types
• Structured
  – Pre-defined schema
  – Example: relational database system
• Semi Structured
  – No identifiable structure
  – Cannot be stored in rows and tables in a database
  – Examples : logs, tweets,
• Un Structured
  – Irregular structure or it lacks structure
  – Examples: free-form text, reports, customer feedback
    forms


                       Copyright Hortonworks 2012   4
Characteristics of Big Data

• Volume

• Velocity

• Variety

• Value

                  Copyright Hortonworks 2012   5
Problem with Legacy Solution
• Expensive
   – Scale up costs lots of $$


• Rigid

• Stale Data




                         Copyright Hortonworks 2012   6
Hadoop Approach

• Process data locally

• Expect Hardware failures

• Handle failover elegantly

• Duplicate a small percentage of the data to
  small groups (versus entire database)
Compare with RDBMS




     Copyright Hortonworks 2012   8
Hadoop Core Components
Hadoop Cluster – Basic configuration




             Copyright Hortonworks 2012   10
MapReduce In Action

Logical




 Physical




                              11
Hadoop Ecosystem

                                                                              Develop                      Analyze            Visualize


                                                                                        Hortonworks Data Platform

                                                                                        Scripting                    Query
          Management & Monitoring




                                                                                          (Pig)                      (Hive)




                                                                                                                                                                   (Sqoop, Talend, WebHDFS, WebHCatalog)
                                                          NoSQL Column DB




                                                                                                                                           Workflow & Scheduling


                                                                                                                                                                                                           Data Extraction & Load
                                    (Ambari, Zookeeper)



                                                                            (HBase)




                                                                                          Metadata Management
                                                                                                      (HCatalog)




                                                                                                                                 (Oozie)
Operate                                                                                                                                                                                                                             Integrate

                                                                                           Distributed Processing
                                                                                                     (MapReduce)




                                                                                        Distributed Storage
                                                                                                  (HDFS)
What Next?

1                                 Download Hortonworks Data Platform
                                  hortonworks.com/download




2   Use the getting started guide
    hortonworks.com/get-started



3   Learn more… get support

                                                             Hortonworks Support
       • Expert role based training                          • Full lifecycle technical support
       • Course for admins, developers                         across four service levels
         and operators                                       • Delivered by Apache Hadoop
       • Certification program                                 Experts/Committers
       • Custom onsite options                               • Forward-compatible
        hortonworks.com/training                             hortonworks.com/support


                                                                                                  Page 13
        © Hortonworks Inc. 2012
Hortonworks Support Subscriptions
Objective: help organizations to successfully develop
and deploy solutions based upon Apache Hadoop
• Full-lifecycle technical support available
  – Developer support for design, development and POCs
  – Production support for staging and production environments
      – Up to 24x7 with 1-hour response times

• Delivered by the Apache Hadoop experts
  – Backed by development team that has released every major version of
    Apache Hadoop since 0.1

• Forward-compatibility
  – Hortonworks’ leadership role helps ensure bug fixes and patches can be
    included in future versions of Hadoop projects



                                                                          Page 14
        © Hortonworks Inc. 2012
Hortonworks Training
Objective: help organizations overcome Hadoop
knowledge gaps
• Expert role-based training for developers,
  administrators & data analysts
  – Heavy emphasis on hands-on labs
  – Extensive schedule of public training courses available
    (hortonworks.com/training)

• Comprehensive certification programs



• Customized, on-site courses available

                                                              Page 15
         © Hortonworks Inc. 2012

NYC-Meetup- Introduction to Hadoop Echosystem

  • 1.
    The Hadoop Ecosystem ©Hortonworks Inc. 2012
  • 2.
    What is BigData? • Big does not have to be always Petabytes • Big refers to big enough for traditional systems to handle efficiently
  • 3.
    Big Data Facts •Twitter generates 8TB of data every day • eBay data warehouse is 10+ PB • Facebook data warehouse is 36+ PB • Yahoo! Has 100+ PB data • Google scans and indexes 500+ PB data
  • 4.
    Data Types • Structured – Pre-defined schema – Example: relational database system • Semi Structured – No identifiable structure – Cannot be stored in rows and tables in a database – Examples : logs, tweets, • Un Structured – Irregular structure or it lacks structure – Examples: free-form text, reports, customer feedback forms Copyright Hortonworks 2012 4
  • 5.
    Characteristics of BigData • Volume • Velocity • Variety • Value Copyright Hortonworks 2012 5
  • 6.
    Problem with LegacySolution • Expensive – Scale up costs lots of $$ • Rigid • Stale Data Copyright Hortonworks 2012 6
  • 7.
    Hadoop Approach • Processdata locally • Expect Hardware failures • Handle failover elegantly • Duplicate a small percentage of the data to small groups (versus entire database)
  • 8.
    Compare with RDBMS Copyright Hortonworks 2012 8
  • 9.
  • 10.
    Hadoop Cluster –Basic configuration Copyright Hortonworks 2012 10
  • 11.
  • 12.
    Hadoop Ecosystem Develop Analyze Visualize Hortonworks Data Platform Scripting Query Management & Monitoring (Pig) (Hive) (Sqoop, Talend, WebHDFS, WebHCatalog) NoSQL Column DB Workflow & Scheduling Data Extraction & Load (Ambari, Zookeeper) (HBase) Metadata Management (HCatalog) (Oozie) Operate Integrate Distributed Processing (MapReduce) Distributed Storage (HDFS)
  • 13.
    What Next? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support Page 13 © Hortonworks Inc. 2012
  • 14.
    Hortonworks Support Subscriptions Objective:help organizations to successfully develop and deploy solutions based upon Apache Hadoop • Full-lifecycle technical support available – Developer support for design, development and POCs – Production support for staging and production environments – Up to 24x7 with 1-hour response times • Delivered by the Apache Hadoop experts – Backed by development team that has released every major version of Apache Hadoop since 0.1 • Forward-compatibility – Hortonworks’ leadership role helps ensure bug fixes and patches can be included in future versions of Hadoop projects Page 14 © Hortonworks Inc. 2012
  • 15.
    Hortonworks Training Objective: helporganizations overcome Hadoop knowledge gaps • Expert role-based training for developers, administrators & data analysts – Heavy emphasis on hands-on labs – Extensive schedule of public training courses available (hortonworks.com/training) • Comprehensive certification programs • Customized, on-site courses available Page 15 © Hortonworks Inc. 2012

Editor's Notes

  • #2 Hi, My Name is Abhijit Lele, I am a solutions Engineer @ hortonworks. I support our customers to understand and achieve their business and technical goals with Hadoop and Big data ecosystem in general.
  • #8 So if we were to turn our original assumptions on their respective heads, we might be able to come up with an alternate set of rules, that allow for a new way of thinking about large data stores.