DO NOT USE PUBLICLY    From Zero to Hadoop                  PRIOR TO 10/23/12    Headline Goes Here    Speaker Name | Titl...
Agenda    • Hadoop Ecosystem Overview    • Hadoop Core Technical Overview       • HDFS       • MapReduce    • Hadoop in th...
Hadoop Ecosystem Overview    What Are All These Things?3
Hadoop Ecosystem              INGEST                   STORE           EXPLORE                       PROCESS              ...
Sqoop    Performs Bi    Directional data    transfers between    Hadoop and almost    any SQL database    with a JDBC driv...
FlumeNG    A streaming data       Client    collection and         Client                                    Agent    aggr...
HBase• A low latency,  distributed, non-  SQL database built  on HDFS.• A “Columnar  Database”7
Hive• Relational database                                    SELECT     abstraction using a SQL like      s.word, s.freq, ...
Pig    •   High-level scripting language                                        emps = LOAD people.txt’ AS        for for ...
Oozie     A workflow engine and     scheduler built specifically     for large-scale job     orchestration on a     Hadoop...
Zookeeper• Zookeeper is a distributed     consensus engine• Provides well-defined concurrent     access semantics:      • ...
MahoutA machine learning library withalgorithms for:•    Recommendation based on users     behavior.•    Clustering groups...
Hadoop Security• Authentication is secured by MIT Kerberos v5     and integrated with LDAP• Provides Identity, Authenticat...
Hadoop Core Technical Overview     Only the Good Parts14
Components of HDFS• NameNode – Holds all metadata for HDFS   • Needs to be a highly reliable machine      • RAID drives – ...
Components of HDFS – Contd. •    DataNodes – Hardware will depend on the specific needs of the      cluster       • No RAI...
HDFS Architecture Overview        Host 1        Host 3      Host 5        Namenode      DataNode    DataNode         Host ...
HDFS Block Replication                                    Node 1                        Node 2        Block Size = 64MB   ...
MapReduce – Map     • Records from the data source (lines out of files, rows of a       database, etc) are fed into the ma...
MapReduce – Reduce     • After the map phase is over, all the intermediate values for a       given output key are combine...
MapReduce – Shuffle and Sort21
Hadoop In the Enterprise     How It Works In The Real World22
Networking     • One of the most important things to consider when       setting up a Hadoop cluster     • Typically a top...
Hadoop Typical Data Pipeline                                                     Hadoop                                   ...
Hadoop Use CasesUse Case                          Application               Industry             Application            Us...
Hadoop in the Enterprise      OPERATORS                      ENGINEERS      ANALYSTS          BUSINESS USERS     Managemen...
Cloudera Manager     End-to-End Administration for CDH                                   1     Manage                     ...
Install A Cluster In 3 Simple Steps     Cloudera Manager Key Features                  1             Find Nodes           ...
View Service Health & Performance     Cloudera Manager Key Features30
Monitor & Diagnose Cluster Workloads     Cloudera Manager Key Features31
Visualize Health Status With Heatmaps     Cloudera Manager Key Features32
Rolling Upgrades     Cloudera Manager Key Features33
34
35
Upcoming SlideShare
Loading in …5
×

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

2,160 views

Published on

If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,160
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
145
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  • Apache Hadoop is a new solution in your existing infrastructure.It does not replace any existing major existing investment.Apache brings data that you’re already generating into context and integrates it with your business.You get access to key information about how your business is operating but pulling togetherWeb and application logsUnstructured filesWeb dataRelational dataHadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies
  • Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

    1. 1. DO NOT USE PUBLICLY From Zero to Hadoop PRIOR TO 10/23/12 Headline Goes Here Speaker Name | Title Speaker Name or Subhead Goes Here March 19, 20131
    2. 2. Agenda • Hadoop Ecosystem Overview • Hadoop Core Technical Overview • HDFS • MapReduce • Hadoop in the Enterprise • Cluster Planning • Cluster Management with Cloudera Manager2
    3. 3. Hadoop Ecosystem Overview What Are All These Things?3
    4. 4. Hadoop Ecosystem INGEST STORE EXPLORE PROCESS ANALYZE SERVE MANAGEMENT SOFTWARE & CONNECTORS TECHNICAL SUPPORT BI ETL RDBMS SUBSCIPTION OPTIONS CLOUDERA NAVIGATOR CLOUD USER INTERFACE WORKFLOW MGMT AUDIT CORE METADATA LINEAGE WH OO (v1.0) HU WHIRR HUE OOZIE ACCESS LIFECYCLE (v1.0) INTEGRATION BATCH PROCESSING REAL-TIME ACCESS EXPLORE SQ HI PI MA DF & COMPUTE SQOOP HIVE PIG MAHOUT DATAFU AC FL BATCH COMPUTE IM ACCESS CLOUDERA MANAGER FLUME MR MR2 IMPALA FILE MAPREDUCE MAPREDUCE2 MS BDR FUSE-DFS RESOURCE MGMT META STORE REST & COORDINATION YA ZO RTD RTQ WEBHDFS / HTTPFS YARN ZOOKEEPER STORAGE CORE SQL HDFS HB (REQUIRED) ODBC / JDBC HADOOP DFS HBASE4
    5. 5. Sqoop Performs Bi Directional data transfers between Hadoop and almost any SQL database with a JDBC driver5
    6. 6. FlumeNG A streaming data Client collection and Client Agent aggregation system Client Agent for massive volumes Agent Client of data, such as RPC services, Log4J, Syslog, etc.6
    7. 7. HBase• A low latency, distributed, non- SQL database built on HDFS.• A “Columnar Database”7
    8. 8. Hive• Relational database SELECT abstraction using a SQL like s.word, s.freq, k.freq dialect called HiveQL FROM shakespeare JOIN ON (s.word= k.word)• Statements are executed as WHERE s.freq >= 5; One or more MapReduce Jobs8
    9. 9. Pig • High-level scripting language emps = LOAD people.txt’ AS for for executing one or more (id,name,salary); MapReduce jobs rich = FILTER emps BY salary > • Created to simplify authoring 200000; sorted_rich = ORDER rich BY of MapReduce jobs salary DESC; • Can be extended with user STORE sorted_rich INTO defined functions ’rich_people.txt;9
    10. 10. Oozie A workflow engine and scheduler built specifically for large-scale job orchestration on a Hadoop cluster10
    11. 11. Zookeeper• Zookeeper is a distributed consensus engine• Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Message board / mailboxes11
    12. 12. MahoutA machine learning library withalgorithms for:• Recommendation based on users behavior.• Clustering groups related documents.• Classification from existing categorized.• Frequent item-set mining (shopping cart content).12
    13. 13. Hadoop Security• Authentication is secured by MIT Kerberos v5 and integrated with LDAP• Provides Identity, Authentication, and Authorization• Useful for multitenancy or secure environments13
    14. 14. Hadoop Core Technical Overview Only the Good Parts14
    15. 15. Components of HDFS• NameNode – Holds all metadata for HDFS • Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded • The more memory the better – typical 36GB to - 64GB• Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used15
    16. 16. Components of HDFS – Contd. • DataNodes – Hardware will depend on the specific needs of the cluster • No RAID needed, JBOD (just a bunch of disks) is used • Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM16
    17. 17. HDFS Architecture Overview Host 1 Host 3 Host 5 Namenode DataNode DataNode Host 2 Host 4 Host n Secondary DataNode DataNode Namenode17
    18. 18. HDFS Block Replication Node 1 Node 2 Block Size = 64MB Replication Factor = 3 2 1 4 2 Blocks 5 5 Node 3 1 HDFS 1 2 3 3 Node 4 4 4 Node 5 5 2 1 3 3 4 518
    19. 19. MapReduce – Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key 1, (key 1, int. values) values) Map (key 2, Shuffle (key 1, int. Reduce Final (key, Task values) Phase values) Task values) (key 3, (key 1, int. values) values)19
    20. 20. MapReduce – Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key 1, (key 1, int. values) values) Map (key 2, Shuffle (key 1, int. Reduce Final (key, Task values) Phase values) Task values) (key 3, (key 1, int. values) values)20
    21. 21. MapReduce – Shuffle and Sort21
    22. 22. Hadoop In the Enterprise How It Works In The Real World22
    23. 23. Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch!24
    24. 24. Hadoop Typical Data Pipeline Hadoop Marts Oozie Result or Calculated Data Original Source Data Data Sources Pig Data Hive Warehouse Sqoop MapReduce Sqoop Flume HDFS25
    25. 25. Hadoop Use CasesUse Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization ADVANCED ANALYTICS Content Optimization Media Clickstream Sessionization DATA PROCESSING Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping26
    26. 26. Hadoop in the Enterprise OPERATORS ENGINEERS ANALYSTS BUSINESS USERS Management Enterprise IDE’s BI / Analytics Tools Reporting CUSTOMERS Enterprise Data Warehouse Web Application Relational Logs Files Web Data Databases27
    27. 27. Cloudera Manager End-to-End Administration for CDH 1 Manage Easily deploy, configure & optimize clusters 2 Monitor Maintain a central view of all activity 3 Diagnose Easily identify and resolve issues 4 Integrate Use Cloudera Manager with existing tools28
    28. 28. Install A Cluster In 3 Simple Steps Cloudera Manager Key Features 1 Find Nodes 2 Install Components 3 Assign Roles Enter the names of the hosts which will be Cloudera Manager automatically installs the CDH Verify the roles of the nodes within your cluster.included in the Hadoop cluster. Click Continue. components on the hosts you specified. Make changes as necessary.29
    29. 29. View Service Health & Performance Cloudera Manager Key Features30
    30. 30. Monitor & Diagnose Cluster Workloads Cloudera Manager Key Features31
    31. 31. Visualize Health Status With Heatmaps Cloudera Manager Key Features32
    32. 32. Rolling Upgrades Cloudera Manager Key Features33
    33. 33. 34
    34. 34. 35

    ×