• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data Analytics with Hadoop
 

Big Data Analytics with Hadoop

on

  • 96,808 views

Hadoop, flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

Hadoop, flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

Statistics

Views

Total Views
96,808
Views on SlideShare
84,693
Embed Views
12,115

Actions

Likes
489
Downloads
2,399
Comments
34

129 Embeds 12,115

http://www.hadoopwizard.com 6366
http://www.bigdata-madesimple.com 2278
http://thilina-anjitha.blogspot.com 858
http://blog.rajithdelantha.com 427
http://www.slideshare.net 351
http://www.scoop.it 286
https://www.xiaohui.org 138
http://blog.bbvsws.de 116
http://www.linkedin.com 105
http://dschool.co 87
http://itspice.net 80
http://www.cloud24by7.com 75
http://developbi.blogspot.com 73
http://wiki.onakasuita.org 71
http://processingbigdata.com 69
http://developbi.blogspot.in 66
http://www.xiaohui.org 54
http://redcloverbi.wordpress.com 52
http://blogs.sun.com 26
http://thilina-anjitha.blogspot.ca 25
http://paper.li 22
http://thilina-anjitha.blogspot.co.uk 22
https://www.linkedin.com 22
http://202.39.164.167 20
http://computationalsociology.tumblr.com 20
http://bigdatacrunching.blogspot.com 19
http://thilina-anjitha.blogspot.in 18
http://bigdatacrunching.blogspot.in 17
https://coral.corp.apple.com 16
http://www.techgig.com 15
http://developbi.blogspot.co.il 14
http://thilina-anjitha.blogspot.ru 12
http://thilina-anjitha.blogspot.com.br 11
http://thilina-anjitha.blogspot.com.au 11
http://eleap.ciol.com 11
http://www.skillschool.co.in 10
http://beyondsbu.blogspot.com 10
http://coral.corp.apple.com 9
http://confluence.corp.apple.com 9
http://skillschool.co.in 9
http://140.119.210.51 8
http://developbi.blogspot.fr 8
http://thilina-anjitha.blogspot.fr 8
http://www.yatedo.com 7
http://beyondsbu.blogspot.kr 6
http://hadoopbrasil.com 6
http://www.pinterest.com 6
http://thilina-anjitha.blogspot.sg 6
http://www.dschool.co 6
http://plus.url.google.com 6
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

110 of 34 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…

110 of 34 previous next

Post Comment
Edit your comment

    Big Data Analytics with Hadoop Big Data Analytics with Hadoop Presentation Transcript

    • Open for Business…
    • 2 Big Data Analytics with Hadoop WHO AM I • Big Data / Analytics / BI & Cloud Solutions Specialist • http://www.linkedin.com/in/JulioPhilippe • Skills Big Data Analytics Business Intelligence Data Warehousing IT Transformation IT Solutions Cloud Computing Datacenter Optimization Business Development Architecture Hadoop Management Mentoring
    • 3 Big Data Analytics with Hadoop « Data don’t spring relevant, they become though ! » BIG DATA MANAGEMENT INSIGHT
    • 4 Big Data Analytics with Hadoop DATA-DRIVEN ON-LINE WEBSITES • To run the apps : messages, posts, blog entries, video clips, maps, web graph... • To give the data context : friends networks, social networks, collaborative filtering... • To keep the applications running : web logs, system logs, system metrics, database query logs...
    • 5 Big Data Analytics with Hadoop BIG DATA – NOT ONLY DATA VOLUME • Improve analytics and statistics models • Extract business value by analyzing large volumes of multi- structured data from various sources such as databases, websites, blogs, social media, smart sensors... • Have efficient architectures, massively parallel, highly scalable and available to handle very large data volumes up to several petabytes Thematics • Web Technologies • Database Scale-out • Relational Data Analytics • Distributed Data Analytics • Distributed File Systems • Real Time Analytics
    • 6 Big Data Analytics with Hadoop • Digital marketing optimization (e.g., web analytics, attribution, golden path analysis) • Data exploration and discovery (e.g., identifying new data-driven products, new markets) • Fraud detection and prevention (e.g., revenue protection, site integrity & uptime) • Social network and relationship analysis (e.g., influencer marketing, outsourcing, attrition prediction) • Machine-generated data analytics (e.g., remote device insight, remote sensing, location-based intelligence) • Data retention (e.g. long term retention of data, data archiving) BIG DATA APPLICATIONS DOMAINS
    • 7 Big Data Analytics with Hadoop SOME BIG DATA USE CASES BY INDUSTRY Energy  Smart meter analytics  Distribution load forecasting & scheduling  Condition-based maintenance Telecommunications  Network performance  New products & services creation  Call Detail Records (CDRs) analysis  Customer relationship management Retail  Dynamic price optimization  Localized assortment  Supply-chain management  Customer relationship management Manufacturing  Supply chain management  Customer Care Call Centers  Preventive Maintenance and Repairs  Customer relationship management Banking  Fraud detection  Trade surveillance  Compliance and regulatory  Customer relationship management Insurance  Catastrophe modeling  Claims fraud  Reputation management  Customer relationship management Public  Fraud detection  Fighting criminality  Threats detection  Cyber security Media  Large-scale clickstream analytics  Abuse and click-fraud prevention  Social graph analysis and profile segmentation  Campaign management and loyalty programs Healthcare  Clinical trials data analysis  Patient care quality and program analysis  Supply chain management  Drug discovery and development analysis
    • 8 Big Data Analytics with Hadoop TOP 10 BIG DATA SOURCES 1. Social network profiles 2. Social influencers 3. Activity-generated data 4. SaaS & Cloud Apps 5. Public web information 6. MapReduce results 7. Data Warehouse appliances 8. NoSQL databases 9. Network and in-stream monitoring technologies 10. Legacy documents
    • 9 Big Data Analytics with Hadoop NEW DATA AND MANAGEMENT ECONOMICS Storage Trends New Data Structure Distributed File Systems, NoSQL Database, NewSQL…) Compute Trends New Analytics (Massively Parallel Processing, Algorithms…) Proprietary and dedicated data warehouse OLTP is the data warehouse General purpose data warehouse Objects storage Distributed File Systems Federated/ Sharded Master/Master Master/Slave Enterprise data warehouse Multi-structured Data Logical data warehouse Master Data Management, Data Quality, Data Integration
    • 10 Big Data Analytics with Hadoop DISTRIBUTED FILE SYSTEMS • System that permanently store data • Divided into logical units (files, shards, chunks, blocks…) • A file path joins file and directory names into a relative or absolute address to identify a file • Support access to file and remote servers • Support concurrency • Support distribution • Support replication • NFS, GPFS, Hadoop DFS, GlusterFS, MogileFS, MooseFS…. Master Slave Slave Slave App
    • 11 Big Data Analytics with Hadoop • Metadata values are specific to each individual type • Enables automated management of content • Ensure integrity, retention and authenticity UNSTRUCTURED DATA AND OBJECT STORAGE Metadata 101000101010100111010101100010110100…110010UID Video Metadata 111000101010100111010101100010110100…101010UID Image Metadata 101100101010100111010101100010110100…110111UID Audio Metadata 101011101010100111010101100010111110…110011UID Text Unstructured Data + Metadata = Object Storage
    • 12 Big Data Analytics with Hadoop NOSQL DATABASES CATEGORIES • NoSQL = Not only SQL • Popular name for a subset of structured storage software that is designed with the intention of delivering increased optimization for high-performance operations on large datasets • Basically, available, scalable, eventually consistent • Easy to use • Tolerant of scale by way of horizontal distribution Key-Value Redis, Riak (Basho), CouchBase, Voldemort (LinkedIn) MemcacheDB… Column BigTable (Google), HBase, Cassandra (DataStax), Hypertable… Document MongoDB (10Gen), CouchDB, Terrastore, SimpleDB (AWS) … Graph Neo4j (Neo Technology), Jena, InfiniteGraph (Objectivity), FlockDB (Twitter)…
    • 13 Big Data Analytics with Hadoop WHAT IS HADOOP ? “Flexible and available architecture for large scale computation and data processing on a network of commodity hardware” Open Source Software + Hardware Commodity = IT Costs Reduction
    • 14 Big Data Analytics with Hadoop • Searching • Log processing • Recommendation systems • Analytics • Video and Image analysis • Data Retention WHAT IS HADOOP USED FOR ?
    • 15 Big Data Analytics with Hadoop • Top level Apache Foundation project • Large, active user base, mailing lists, user groups • Very active development, strong development team • http://wiki.apache.org/hadoop/PoweredBy#L WHO USED HADOOP ?
    • 16 Big Data Analytics with Hadoop MOVING COMPUTATION TO STORAGE General Purpose Storage Servers • Combine server with disks & networking for reducing latency • Specialized software enables general purpose systems designs to provide high performance data services Moving Data processing to Storage Application Data Processing Storage Legacy Application Data Processing Metadata Mgmt Storage Emerging Application Data Processing Metadata Mgmt Storage Next Gen. Metadata Mgmt Storage Array (SAN, NAS) Servers Network
    • 17 Big Data Analytics with Hadoop BIG DATA ARCHITECTURE BI & DWH Architecture - Conventional • SQL based • High availability • Enterprise database • Right design for structured data • Current storage hardware (SAN, NAS, DAS) Analytics Architecture – Next Generation • Not only SQL based • High scalability, availability and flexibility • Compute and storage in the same box for reducing the network latency • Right design for semi-structured and unstructured data Data Nodes Network Switches Edge Nodes App Servers Database Servers Network Switches SAN Switch Storage Array
    • 18 Big Data Analytics with Hadoop APACHE HADOOP 2.0 ECOSYSTEM http://incubator.apache.org/projects/ Storage Data Access HDFS Cluster Resource Management YARN / MR v2 Pig Database HBaseHive Graph Giraph Streaming Storm HCatalog Serialization(Avro,Shrift) Security(Knox) Management(Oozie,Zookeeper,Chuckwa,Kafka) ManagementOps(Ambari,BigTop,Whirr) Development(Crunch,MRUnit,HDT) Others Data Processing TezMR v1 OpenMPI SparkS4
    • 19 Big Data Analytics with Hadoop HADOOP COMMON • Hadoop Common is a set of utilities that support the Hadoop subprojects. • Hadoop Common includes Filesystem, RPC, and Serialization libraries.
    • 20 Big Data Analytics with Hadoop • Hadoop Distributed File System - A scalable, Fault tolerant, High performance distributed file system - Asynchronous replication - Write-once and read-many (WORM) - Hadoop cluster with 3 DataNodes minimum - Data divided into 64MB (default) or 128MB blocks, each block replicated 3 times (default) - No RAID required for DataNode - Interfaces: Java, Thrift, C Library, FUSE, WebDAV, HTTP, FTP - NameNode holds filesystem metadata - Files are broken up and spread over the DataNodes • Hadoop Map Reduce - Software framework for distributed computation - Input | Map() | Copy/Sort | Reduce() | Output - JobTracker schedules and manages jobs - TaskTracker executes individual map() and reduce() tasks on each cluster node HDFS & MAPREDUCE Worker Nodes Master Node
    • 21 Big Data Analytics with Hadoop HDFS - READ FILE 1. The client API calculates the blocks index based on the offset of the file pointer and make a request to the NameNode 2. The NameNode replies which DataNodes has a copy of that block 3. The client contacts the DataNodes directly without going through the NameNode 4. The DataNodes read the blocks 5. The DataNodes response to the client about the success NameNode DataNodes 1. Calculate blocks index 3. Contact DataNode 3. 3. 2. Client EdgeNode Read File End 4. Read 3. 5. 4.Read 4. Read 4. Read
    • 22 Big Data Analytics with Hadoop HDFS - WRITE FILE 1. Client contacts the NameNode who designates one of the replica as the primary 2. The response of the NameNode contains who is the primary and who are the secondary replicas 3. The client pushes its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode 4. The client sends a “commit” request to the primary, which determines an order to update and then push this order to all other secondaries 5. After all secondaries complete the commit, the primary response to the client about the success 6. All changes of blocks distribution and metadata changes be written to an operation log file at the NameNode NameNode DataNodes 1. Select primary replica 3. 5. 3. 3. Push to DataNode 4. Commit 2. Primary DataNode Client EdgeNode Write File End 3. 6. Write metadata 5. Write 5. Write 5. Write 5. Write
    • 23 Big Data Analytics with Hadoop MAPREDUCE - EXEC FILE Le client program is copied on each node 1. The JobTracker determines the number of splits from the input path, and select some TaskTrackers based on their network proximity to the data sources 2. JobTracker sends the task requests to those selected TaskTrackers 3. Each TaskTracker starts the map phase processing by extracting the input data from the splits 4. When the map task completes, the TaskTracker notifies the JobTracker. When all the TaskTrackers are done, the JobTracker notifies the selected TaskTrackers for the reduce phase 5. Each TaskTracker reads the region files remotely and invokes the reduce function, which collects the key/aggregated value into the output file (one per reducer node) 6. After both phase completes, the JobTracker unblocks the client program Client EdgeNode TaskTrackers 1. Select TaskTrackers from splits 2. Send request to TackTracker 3. Map 3. Map 3. Map 3. Map 5. Reduce 5. Reduce 5. Reduce 5. Reduce 2 2 2 JobTracker Exec File 6. End4. End
    • 24 Big Data Analytics with Hadoop MR VS. YARN ARCHITECTURE Client Client ResourceManager NameNode NodeManager ApplicationMaster DataNode NodeManager ApplicationMaster DataNode NodeManager Container DataNode NodeManager Container DataNode NodeManager Container DataNode NodeManager Container DataNode Container Client Client JobTracker NameNode TaskTracker DataNode TaskTraker DataNode TaskTraker DataNode MR v1 YARN / MR v2 • YARN : Yet Another Resource Negotiator • MR : MapReduce
    • 25 Big Data Analytics with Hadoop HBASE • It's not a relational database (No joins) • Sparse data – nulls are stored for free • Semi-structured or unstructured data • Data changes through time • Versioned data • Scalable – Goal of billions of rows x millions of columns Table - Example (Table, Row_Key, Family, Column, Timestamp) = Cell (Value) Region Row Timestamp Animal Repair Type Size Cost Enclosure1 12 Zebra Medium 1000€ 11 Lion Big Enclosure2 13 Monkey Small 1500€ Column Key Family Cell • Clone of Big Table (Google) • Implemented in Java (Clients : Java, C++, Ruby...) • Data is stored “Column‐oriented” • Distributed over many servers • Tolerant of machine failure • Layered over HDFS • Strong consistency
    • 26 Big Data Analytics with Hadoop HBASE • Table - Regions for scalability, defined by row [start_key, end_key) - Store for efficiency, 1 per Family - 1..n StoreFiles (HFile format on HDFS) • Everything is byte • Rows are ordered sequentially by key • Special tables -ROOT- , .META. - Tell clients where to find user data Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
    • 27 Big Data Analytics with Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop - MapReduce for execution - HDFS for storage • MetaStore - Table/Partitions properties - Thrift API : Current clients in PHP (Web Interface), Python interface to Hive, Java (Query Engine and CLI) - Metadata stored in any SQL backend • Hive Query Language - Basic SQL : Select, From, Join, Group By - Equi-Join, Multi-Table Insert, Multi-Group-By - Batch query DATA ACCESS • A high-level data-flow language and execution framework for parallel computation • Pig Latin - Data processing language - Compiler to translate to MapReduce • Simple to write MapReduce program • Abstracts you from specific detail • Focus on data processing • Data flow • Data manipulation • Table and storage management service for data created using Apache Hadoop • Providing a shared schema and data type mechanism • Providing a table abstraction so that users need not be concerned with where or how their data is stored. • Providing interoperability across data processing tools such as Pig, Map Reduce and Hive • HCatalog DDL (Data Definition Language) • HCatalog CLI (Common Language Interface) HIVE PIG HCatalog
    • 28 Big Data Analytics with Hadoop DATA TRANSFER • Data import/export • Sqoop is a tool designed to help users of large data import existing relational databases into their Hadoop clusters • Automatic data import • Easy import data from many databases to Hadoop • Generates code for use in MapReduce applications RDBMS Hadoop Cluster • Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data • Simple and flexible architecture based on streaming data flows • Robust fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms • The system is centrally managed and allows for intelligent dynamic management Log Files Hadoop Cluster Streaming data flows Batching, compression, fltering, transformation SQOOP FLUME
    • 29 Big Data Analytics with Hadoop MANAGEMENT • Oozie is a server based Bundle Engine that provides a higher-level Oozie abstraction that will batch a set of coordinator applications. The user will be able to start/stop/suspend/resume/rerun a set coordinator jobs in the bundle level resulting a better and easy operational control • Oozie is a server based Coordinator Engine specialized in running workflows based on time and data triggers • Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that execute Hadoop Map/Reduce and Pig jobs • A data collection system for managing large distributed systems • Build on HDFS and MapReduce • Tools kit for displaying, monitoring and analyzing the log files • A high-performance coordination service for distributed applications • Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services OOZIE CHUKWA ZOOKEEPER
    • 30 Big Data Analytics with Hadoop MACHINE LEARNING • Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform • Mahout machine learning algorithms: – Recommendation mining, takes users’ behavior and find items said specified user might like – Clustering, takes e.g. text documents and groups them based on related document topics – Classification, learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category – Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together
    • 31 Big Data Analytics with Hadoop • A data serialization system that provides dynamic integration with scripting languages • Avro Data - Expressive - Smaller and Faster - Dynamic - Schema store with data - APIs permit reading and creating - Include a file format and a textual encoding • Avro RPC - Leverage versioning support - For Hadoop service provide cross-language access SERIALIZATION
    • 32 Big Data Analytics with Hadoop MANAGEMENT OPS • Apache Whirr is a set of libraries for running cloud services • A common service API • Provision, Install, Configure and Manage • Deploy clusters on demand for processing or for testing • Command line for deploying clusters • Ambari is a web-based set of tools for – Deploying – Administering – Monitoring Apache Hadoop clusters WHIRR AMBARI
    • 33 Big Data Analytics with Hadoop Hadoop Project Description TEZ Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re- usable set of data-processing primitives which can be used by other projects. GORA Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop. DRILL Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel. LUCENE Lucene.NET is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework. BLUR Blur is a search platform capable of searching massive amounts of data in a cloud computing environment. GIRAPH Giraph is a large-scale, fault-tolerant, Bulk Synchronous Parallel (BSP)-based graph processing framework. HAMA Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations, e.g., matrix, graph and network algorithms. ACCUMULO Accumulo is a distributed key/value store that provides expressive, cell-level access labels. CRUNCH Crunch is a Java library for writing, testing, and running pipelines of MapReduce jobs on Apache Hadoop. MRUNIT MRUnit is a library to support unit testing of Hadoop MapReduce jobs. HADOOP DEVELOPMENT TOOLS Eclipse based tools for developing applications on the Hadoop platform. BIGTOP Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem. THRIFT Cross-language serialization and RPC framework. KNOX Knox Gateway is a system that provides a single point of secure access for Apache Hadoop clusters. KAFKA Kafka is a distributed publish-subscribe system for processing large amounts of streaming data. CASSANDRA Cassandra is columnar NoSQL store with scalability, availability and performance capabilities. FALCON A data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. SENTRY Sentry is a highly modular system for providing fine grained role based authorization to both data and metadata stored on an Apache Hadoop cluster. STORM Storm is a distributed, fault-tolerant, and high-performance real time computation system that provides strong guarantees on the processing of data. S4 S4 (Simple Scalable Streaming System) is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous, unbounded streams of data. SPARK Spark is an open source system for fast and flexible large-scale data analysis. Spark provides a general purpose runtime that supports low-latency execution in several forms. OTHERS APACHE HADOOP PROJECTS https://incubator.apache.org/projects/
    • 34 Big Data Analytics with Hadoop Cloudera Hortonworks MapR SOME HADOOP SERVICES PROVIDERS DataStax Microsoft VMware http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
    • 35 Big Data Analytics with Hadoop CLOUDERA IMPALA • Unified storage : supports HDFS and HBase, Flexible file formats • Unified Metastore • Unified security • Unified client interface : ODBC, SQL syntax, Hue, Beeswax SQL App ODBC Hive Metastore YARN NameNode State Store DataNode Query Planner Query Coordinator Query Exec Engine HBase HDFS DataNode Query Planner Query Coordinator Query Exec Engine HBase HDFS DataNode Query Planner Query Coordinator Query Exec Engine HBase HDFS • Impala: real time SQL queries, native distributive query engine, optimized for low-latency • Answers as fast as you can ask • Everyone to ask questions for all data, Big Data storage and analytics together Source : http://cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
    • 36 Big Data Analytics with Hadoop CLOUDERA HUE • HUE (aka. Hadoop User Experience) • Open source project started as Cloudera • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library • User Admin: Account management for HUE users. • File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. • Job Designer: Create MapReduce jobs, which can be templates that prompt for parameters when they are submitted. • Job Browser: View jobs, tasks, counters, logs, etc. • Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format. • Help: Documentation and help Source : http://blog.cloudera.com/blog/category/hue/
    • 37 Big Data Analytics with Hadoop • The Stinger Initiative is a broad, community- based effort to drive the future of Apache Hive • Stinger delivers 100x performance improvements at petabyte scale with familiar SQL semantics HORTONWORKS STINGER INITIATIVE Interactive query for Apache Hive Source: http://hortonworks.com/labs/stinger/
    • 38 Big Data Analytics with Hadoop MAPR HADOOP MapR Distribution Editions Source : http://www.mapr.com/products/why-mapr MapR Distribution for Apache Hadoop Advantages Source : http://www.mapr.com/products/mapr-editions
    • 39 Big Data Analytics with Hadoop • PolyBase is a technology on the data processing engine in SQL Server Parallel Data Warehouse (PDW) designed as the simplest way to combine non- relational data and traditional relational data in your analysis • PolyBase provides the easiest and broadest way to access Hadoop with the standard SQL query language without needing to learn MapReduce • PolyBase moves data in parallel to and from Hadoop and PDW allowing end users to perform their analysis without the help of IT MICROSOFT POLYBASE SQL Query Results PolyBase Hadoop RDBMS • Microsoft SQL Server Parallel Data Warehouse • HDFS Source : http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx
    • 40 Big Data Analytics with Hadoop VMWARE HADOOP VIRTUALIZATION EXTENSION One Hadoop node per server Multiple Hadoop nodes per server • Compute • Data • Virtual Machine • Physical Server Multiple compute nodes and data nodes per server Source: http://www.vmware.com/products/big-data-extensions • HADOOP VIRTUALIZATION EXTENSION (HVE) is designed to enhance the reliability and performance of virtualized Hadoop clusters with extended topology layer and refined locality related policies HVE Task Scheduling Balancer Replica Choosing Replica Placement Replica Removal Network Topology Hadoop HDFS MapReduce Common • HVE consists of hooks and extensions to data-locality-related components of Hadoop, including: network topology, replica, balancer and task scheduling. It touches all sub projects of Hadoop: Common, HDFS and MapReduce Multiple compute nodes and one data node per server
    • 41 Big Data Analytics with Hadoop • Brisk solution from DataStax combines the real time capabilities of Cassandra with the analytical power of Hadoop • The Hadoop Distributed File System (HDFS) is replace by the Apache Cassandra File System (CFS) - Data store in CFS • Blocks are compressed with Snappy • Hive Metastore in Cassandra – Automatically maps Cassandra Column Families to Hive tables HADOOP AND CASSANDRA INTEGRATION Cassandra CFS Layer Job/Task Tracker Hive Metastore Hive, Pig… Tools
    • 42 Big Data Analytics with Hadoop HIGH AVAILABILITY SOLUTIONS • Automatic blocks replication on 3 DataNodes – Rack awareness • NameNode and Standby NameNode • Quorum Journal Manager and Zookeeper • Disaster Recovery by replication • Hive and NameNode Metastores backup • HDFS Snapshots NameNode JobTracker DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Name Node (active) Worker Node Worker Node Worker Node NameNode JobTracker Standby NN (standby)
    • 43 Big Data Analytics with Hadoop HA* WITH CONVENTIONAL SHARED STORAGE • NameNode and Standby NameNode Servers: the Active and Standby nodes should have equivalent hardware • Shared Storage Array: shared directory which both Active node and Standby node servers can have read/write access to. The share storage Array supports NFS and is mounted on each node. Currently only a single shared edits directory is supported. The availability of the system is implemented by the redundancy of the shared edits directory with multiple network paths to the storage, and redundancy in the storage itself (disk, network, and power). It is recommended that the shared storage array be a high-quality dedicated storage. Standby Node Active Node SAS (6Gbps) Share Storage Array with dual active/active controllers 12 x 600GB HD 15K RPM (Raid10) Network Network SAS (6Gbps) Shared Storage Array NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports Standby NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports • NameNode and Standby NameNode • Shared storage array * High Availability
    • 44 Big Data Analytics with Hadoop HA* WITH JOURNAL MANAGER AND ZOOKEEPER • NameNode and Standby NameNode • Automatic Failover with Zookeeper : Quorum, ZKFC (ZKFailoverController) • Quorum Journal Manager for reliable edit log storage Standby Node Active Node Network Network NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports Standby NameNode Server • 2 CPU 6 core • 96GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 1GbE Ports HA Software • 3 x JournalNode daemons • 3 x Zookeeper daemon HA Software • 3 x JournalNode daemon • 3 x Zookeeper daemon • NameNode and Standby NameNode Servers: the Active and Standby nodes should have equivalent hardware • Quorum Journal Manager: In order for the Standby Node to keep its state synchronized with the Active Node, both nodes communicate with a group of separate daemons called "JournalNodes" • Zookeeper : a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. Zookeeper detects the Active node server failure and actives automatically the Standby node server. * High Availability
    • 45 Big Data Analytics with Hadoop HA* WITH DISASTER RECOVERY • DistCp: tool for parallelized copying of large amounts of data • Large inter-cluster copy • Based on MapReduce Hadoop Cluster #1 NameSpace #1 Hadoop Cluster #2 NameSpace #2 Parallelized Data Replication Disaster Recovery SitePrimary Site Map ( ) Racks Racks Moving an Elephant: Large Scale Hadoop Data Migration at Facebook http://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop- data-migration-at-facebook/10150246275318920 * High Availability
    • 46 Big Data Analytics with Hadoop SECURITY • Files permissions – Files permissions like Unix (owner, group, mode) • User identity – Simple – Super-user • Kerberos connectivity – Users authenticate to the edge of the cluster with Kerberos – Users and group access is maintained in cluster specific access control lists • Microsoft Active Directory connectivity • Knox is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster • Knox simplifies Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster • Knox runs as a server or cluster of servers that serve one or more Hadoop clusters HDFS & KERBEROS & AD KNOX
    • 47 Big Data Analytics with Hadoop BACKUP • Replication blocks is not a form of backup • HDFS Snapshots • When a file is deleted by a user or an application, it is not immediately removed from HDFS – The file is moved in the /trash directory – The file can be restored quickly as long as it remains in /trash – A file remains in /trash for a configurable amount of time • When a file is corrupted restore from backup is necessary – Create incremental backup HDFS files – Use a time-stamp of the file – Use a staging area to store the backup files – Move the backup files to tape if necessary • Backing up with DistCp or copyToLocal commands
    • 48 Big Data Analytics with Hadoop ARCHIVING • Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently • Reduce the NameNode memory usage while still allowing transparent access to files. • An effective solution for the small files problem – http://developer.yahoo.com/blogs/hadoop/posts/2010/07/hadoop _archive_file_compaction/ – http://www.cloudera.com/blog/2009/02/the-small-files-problem/ • Archives are immutable • Rename, deletes and create return an error • Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input Master Index Index Data HAR File Layout
    • 49 Big Data Analytics with Hadoop DATA COMPRESSION • https://github.com/toddlipcon/hadoop-lzo • By enabling compression, the store file uses a compression algorithm on blocks as they are written (during flushes and compactions) and thus must be decompressed when reading • Compression reduces the number of bytes written/read to/from HDFS • Compression effectively improves the efficiency of network bandwidth and disk space • Compression reduces the size of data needed to be read when issuing a read LZO • Hadoop Snappy is a project for Hadoop that provide access to the snappy compression. http://code.google.com/p/snappy/ • Hadoop-Snappy can be used as an add-on for recent (released) versions of Hadoop that do not provide Snappy Codec support yet • Hadoop-Snappy is being kept in synch with Hadoop Common • Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library SNAPPY
    • 50 Big Data Analytics with Hadoop • Advanced Key Management - Stores keys separate from the encrypted data • Transparent Data Encryption - Protects data at rest resulting in minimal performance impact • Process Based Access Controls - Restricts access to specific processes rather than by OS user • Encrypt and Decrypt Unstructured Data - Secures sensitive data that could be considered damaging if exposed outside the business • Automation Tools - Rapid distributed deployment from ten to thousands of data nodes DATA ENCRYPTION Gazzang Source: http://www.gazzang.com/solutions/securing-big-data/apache-hadoop
    • 51 Big Data Analytics with Hadoop HDFS OVER NFS & CIFS PROTOCOLS • Make a HDFS filesystem available across networks as a exported share • Over NFS – Install FUSE server – Mount HDFS filesystem over FUSE – Export HDFS filesystem • Over CIFS – Install FUSE server – Install SAMBA server on FUSE server – Mount HDFS filesystem over SAMBA – Export HDFS filesystem FUSE is a framework which makes it possible to implement a filesystem in a userspace program. Features include: • Simple yet comprehensive API • Secure mounting by non-root user • Multi-threaded operation SAMBA is a suite of Linux applications that speak the SMB (Server Message Block) protocol. Many operating systems, including Windows use SMB to perform client-server networking. FUSE SAMBA
    • 52 Big Data Analytics with Hadoop GANGLIA MONITORING • Ganglia by itself is a highly scalable cluster monitoring tool, and provides visual information on the state of individual machines in a cluster or summary information for a cluster or sets of clusters. Ganglia provides the ability to view different time windows • 2 daemons: GMON & GMETAD • GMOND collects or received metric data on each DataNode • 1 GMETAD/grid • Polls 1 GMOND per cluster for data • A node belongs to a cluster • A cluster belongs to a grid Grid Cluster Node Failover Poll Poll FailoverFailover PollPoll G M O ND Node G M O ND Node G M O ND Node Cluster 1 G M O ND Node G M O ND Node G M O ND Node Cluster 2 G M ETAD Cluster 3 G M O ND Node G M O ND Node G M O ND Node G M ETAD Apache W eb Frontend W eb Client http://ganglia.sourceforge.net/Source :
    • 53 Big Data Analytics with Hadoop NAGIOS MONITORING • A system and network monitoring application • Nagios is an open source tool especially developed to monitor hosts and services and designed to inform the network incidents before end-users, clients do • Nagios watches hosts and services which we specify and alerts when things go bad and when things get recovered • Initially developed for servers and applications monitoring, it is widely used to monitor and networks availability Apache Hadoop CGI Nagios Plugins Database Hadoop infrastructure Common Graphical Interface Users http://www.nagios.org/Source :
    • 54 Big Data Analytics with Hadoop CROWBAR SOFTWARE FOR HADOOP A modular, open source framework that accelerates multi-node deployments, simplifies maintenance, and streamlines ongoing updates • Deploy Hadoop cluster in hours instead of days • Use or build barclamps to install and configure software modules Opscode Chef Server Capabilities Download the open source software: https://github.com/dellcloudedge/crowbar Active community http://lists.us.dell.com/mailman/listinfo/crowbar Resources on the Wiki: https://github.com/dellcloudedge/crowbar/wiki Solutions Componants • Supports a cloud operations model to interact, modify, and build based on changing needs
    • 55 Big Data Analytics with Hadoop LOGICAL DATA WAREHOUSE WITH HADOOP Files Web Data RDBMS Development BI / Analytics Activity Reporting ENGINEERS ANALYSTS BUSINESS USERS Mobile Apps MOBILE CLIENTS Data Modeling DATA SCIENTISTSADMINISTRATOR Data Management Unstructured and structured Data Warehouse MPP, No SQL Engine, Distributed File Systems Share-Nothing Architecture, Algorithms Structured Data Warehouse MPP, In-Memory, Columns Database, SQL Engine, Share-Nothing Architecture or Share-Disk Architecture via SAN Data Integration Exploration, Visualization Data Quality Master Data Management Data sources Data Transfer SQLNoSQL Oracle Exadata, SAP HANA, IBM Netezza, Teradata, EMC Greenplum, Microsoft PDW, HP Vertica….
    • 56 Big Data Analytics with Hadoop INFRASTRUCTURE RECOMMENDATIONS General • Determine the volume of data that needs to be stored and processed • Determine the rate at which the data is expected to grow • Determine when the cluster needs to grow, and whether there will be a need for additional processing and storage • For greater power efficiency and higher ROI over time, choose machines with more capacity. This helps to reduce the frequency of new machines being added • For high availability 2 Admin servers in different racks are recommended • For high availability 2 EdgeNodes mini in different racks are recommended Storage • The number of disks should be based on the amount of raw data required • Check on the rate of growth of data and try reducing the requirement of adding new machines every year. For example, depending on the net new data per year, it may be worthwhile using 12 hard drives per server than using 6 hard drives per server, to accommodate larger amount of new data in the existing cluster DataNode • Each DataNode runs a TaskTracker daemons • 2 CPU 6-core is recommended for each DataNode for mostly cases • For increase the I/O performance use the SAS 15K RPM disk, otherwise the SATA/SAS NL 7.2K RPM at better price is sufficient • Using RAID is not recommended • JBOD configuration is required. HDFS provides built-in redundancy by replicating blocks across multiple nodes. The x3 replication factor is recommended • 48GB RAM per server is recommended for mostly cases • For tmp, log , etc, add 20% to usable disk space • The ratio between useable data and raw data is 3.6 NameNode • The NameNode runs a JobTracker daemons • A copy of the NameNode metadata is stored on a separate machine • Losing the NameNode metadata would mean all data in HDFS lost. Use the Standby NameNode for high availability • The NameNode is not commodity hardware, and needs to have sufficient RAM and disk performance • The amount of RAM allocated to the NameNode limits the size of the cluster • Having plenty of extra NameNode memory space is highly recommended, so that the cluster can grow without having to add more memory to the NameNode, requiring a restart. • 96GB of RAM per server is recommended for mostly cases in the large cluster • Using RAID10 and 15K RPM disks is highly recommended • The NameNode and the secondary NameNode are the same server configuration Network • Use 10GbE switch per rack according the performance of the Hadoop cluster • Use low-latency 10GbE switch across multiple racks • For high availability 2 network switches on top of each rack are recommended
    • 57 Big Data Analytics with Hadoop • Add 1 or 2 x Control Rack 24RU integrating NameNode, Secondary NameNode, EdgeNodes and AdminNodes depending on availability of the Hadoop cluster INFRASTRUCTURE SIZING Feature Description & Formula Example Replication_blocks • Number of replication blocks • 3 recommended • 3 Usable_data_volume • Data source volume (business data) • 400TB Temp_data_volume_ratio • tmp, log… • 20% of Usable_data_volume • 1,2 Data_compression_ratio • Data compression ratio • 0,6 Raw_data_volume • (Usable_data_volume x Replication_blocks x Temp_data_volume_ratio x Data_compression_ratio • 400 x 3 x 1,2 x 0,6 = 864TB Rack_units_per_rack • Number of rack units in one rack • 42RU Rack_units_switch • Number of rack units for a network switches in one rack • 2RU Rack_units_per_DataNode • Number of rack units for one DataNode • 2RU DataNode_volume • Raw data volume in one DataNode • 12 x Disks 2TB = 24TB DataNodes • Number of DataNodes • Raw_data_volume / DataNode_volume • 864/24 = 36 DataNodes Data Racks • Number of racks • (DataNodes x Rack_units_per_DataNode) / (Rack_units_per_rack – Rack_units_switch) • (36 x 2) / (42 – 2) = 2 Data Racks
    • 58 Big Data Analytics with Hadoop HADOOP ARCHITECTURE 2 x EdgeNode • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 10GbE Ports 3 to n x DataNode • 2 CPU 6 core • 48 GB RAM • 12 x HDD 3TB 7.5K • 2 x 10GbE Ports Network Switch 2 x NameNode / Standby NameNode • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 10GbE Ports 1 x AdminNode • 2 CPU 6 core • 48GB RAM • 6 x HDD 600GB 15K (Raid10) • 2 x 10GbE Ports Edge Nodes Control Nodes Worker Nodes Example
    • 59 Big Data Analytics with Hadoop RACKS CONFIGURATION OVERVIEW EdgeNodes NameNode DataNodes DataNodes DataNodes DataNodes Standby NameNode EdgeNodes Rack1 Rack2 Rack3 Rack4 Rack5 Rack6 Admin Node Admin Node Switches Switches Switches Switches Switches Switches Cloud Control Racks Data Racks
    • 60 Big Data Analytics with Hadoop POC CONFIGURATION • Architecture example • The exact configuration and sizing is designed depending on the customer’s needs • AdminNode in on Standby NameNode server • Zookeeper processes are on NameNode and Standby NameNode servers 1 x EdgeNode • 2 CPU 6 core • 32 GB RAM • 6 x HDD 600GB 15K Raid10 • 2 x 10GbE Ports 3 x DataNode • 2 CPU 6 core • 48 GB RAM • 12 x HDD 1TB 7.2K • 2 x 10GbE Ports Network Switch 1 x NameNode 1 x Standby NameNode • 2 CPU 6 core • 96 GB RAM • 6 x HDD 600GB 15K Raid10 • 2 x 10GbE Ports Edge Nodes Control Nodes Worker Nodes Example
    • 61 Big Data Analytics with Hadoop • Designing appropriate hardware for a Hadoop cluster requires benchmarking or POC and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and it is recommended deploying initial hardware with balanced specifications when getting started • HiBench, a Hadoop benchmark suite constructed by Intel, is used intensively for Hadoop benchmarking, tuning & optimizations • A set of representative Hadoop programs including both micro- benchmarks and more "real world" applications such as: search, machine learning and Hive queries Source: Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platform – Apache Hadoop - February 2012 HADOOP BENCHMARKS
    • 62 Big Data Analytics with Hadoop HIBENCH Category Workload Description Microbenchmarks Sort This workload sorts its binary input data, which is generated using the Hadoop* RandomTextWriter example. WordCount This workload counts the occurrence of each words in the input data which is generated using Hadoop RamdomTextWrter TeraSort A standard benchmark for large-size data sorting that is generated by the TeraGen program DFSIO Computes the aggregated bandwidth by sampling the number of bytes read/written at fixed time intervals in each map task Web Search Nutch Indexing This workload tests the indexing subsystem in Nutch, a popular Apache* open-source search engine. The crawler subsystem in Nutch is used to crawl an in-house Wikipedia* mirror and generates 8.4 GB of compressed data (for about 2.4 million web pages) total as workload input Page Rank This workload is an open-source implementation of the page-rank algorithm, a link analysis algorithm used widely in Web search engines Machine Learning K-Means Clustering Typical application area of MapReduce for large-scale data mining and machine learning Bayesian Classification This workload tests the naive Bayesian (a well-known classification algorithm for knowledge discovery and data mining) trainer in Mahout*, which is an Apache open- source machine-learning library Analytical Query Hive Join This workload models complex analytic queries of structured (relational) tables by computing the sum of each group over a single read-only table Hive Aggregation This workload models complex analytic queries of structured (relational) tables by computing both the average and sum for each group by joining two different tables
    • 63 Big Data Analytics with Hadoop HADOOP TERASORT WORKFLOW • Teragen is a utility included with Hadoop for use when creating data sets that will be used by Terasort. Teragen utilizes the parallel framework within Hadoop to quickly create large data sets that can be manipulated. The time to create a given data set is an important point when tracking performance of a Hadoop environment • Terasort benchmark tests HDFS and MapReduce functions in the Hadoop cluster. Terasort is a compute- intensive operation that utilizes the Teragen output as the Terasort input. Terasort will read the data created by Teragen into the system’s physical memory and then sort it and write it back out to the HDFS. Terasort will exercise all portions of the Hadoop environment during these operations • Teravalidate is used to ensure the data produced by Terasort is accurate. It will run across the Terasort output data and verify all data is properly sorted, with no errors produced, and let the user know the status of the results Start Terasort Create Data Start Sorts Start Reduce Combine Sorts Results Complete Terasort Control (1 node) Map (n nodes, n tasks) Control (1 node) Control (1 node) Reduce (n nodes, n tasks) Control (1 node) Sorts DataMap (n nodes, n tasks)
    • THANK YOU