Big Data Analytics with HadoopPresentation Transcript
Open for Business…
2 Big Data Analytics with HadoopWho Am I• Big Data/Analytics & Cloud Solutions Specialist• Social Media– http://www.linkedin.com/in/JulioPhilippe– http://twitter.com/PhilippeJulio• My professional main areas of interest are–Big Data, Analytics, Business Intelligence, Data Warehousing–IT Strategy, IT Transformation–Next Generation Computing Solutions–Cloud Computing, Datacenter optimization
3 Big Data Analytics with Hadoop« Data don’t spring relevant,they become though ! »Data Management Insight
4 Big Data Analytics with HadoopData-Driven on-Line Websites• To run the apps : messages, posts, blogentries, video clips, maps, web graph...• To give the data context : friendsnetworks, social networks, collaborativefiltering...• To keep the applications running : weblogs, system logs, system metrics,database query logs...
5 Big Data Analytics with HadoopBig Data – What Does it Mean?VelocityOften time-sensitive, big data must be used asit is streaming in to the enterprise in order tomaximize its value to the business.Batch, Near time, Real time, StreamsVarietyBig data extends beyond structured data, including semi-structured and unstructured data of all varieties: text,audio, video, click streams, log files and more.Structured, Unstructured, SemistructuredVolumeBig data comes in one size: large. Enterprises are awashwith data, easily amassing terabytes and even petabytes ofinformation.TB, Records, Transactions, Tables,, FilesValueVeracityQuality and provenance of received dataGood, Bad, Undefined, Inconsistency,Incompleteness, Ambiguity
6 Big Data Analytics with HadoopBig Data Thematic• Hardware and software technologiesfor managing big volumes of data• Focus on Web 2.0 Technologies• Database Scale-out• Relational Data Analytics• Distributed Data Analytics• Distributed File Systems• Real Time Analytics
7 Big Data Analytics with HadoopBig Data Applications Domains• Digital marketing optimization (e.g., web analytics,attribution, golden path analysis)• Data exploration and discovery (e.g., data scientists,identifying new data-driven products, new markets)• Fraud detection and prevention (e.g., revenue protection,site integrity & uptime)• Social network and relationship analysis (e.g., influencermarketing, outsourcing, attrition prediction)• Machine-generated data analytics (e.g., Remote deviceinsight, remote sensing, location-based intelligence)• Data retention (e.g. long term retention, Data archiving)
8 Big Data Analytics with HadoopBusiness DriversTelecommunications: more reliablenetwork where we can predict andprevent failureFinance: better and deeperunderstanding of risk to avoid creditcrisis – Bale IIIMedia: more content that is linedup with your personal preferencesGovernment: governmentservices that are based on harddata, not just gutRetail: a personal experiencewith products and offers that arejust what you needLife Science: better targetedmedicines with fewercomplications and side effects
9 Big Data Analytics with HadoopTop 10 Big Data Sources1. Social network profiles2. Social influencers3. Activity-generated data4. SaaS & Cloud Apps5. Public web information6. MapReduce results7. Data warehouse appliances8. Columnar/NoSQL databases9. Network and in-stream monitoring technologies10.Legacy documents
10 Big Data Analytics with HadoopNew Data and Management EconomicsStorage TrendsNew Data StructureDistributed File Systems, NoSQL Database, NewSQL…)Compute TrendsNew Analytics(Massively Parallel Processing, MapReduce, Algorithms…)Proprietary and dedicateddata warehouseOLTP is thedata warehouseGeneral purposedata warehouseObjects storageDistributed File Systems Federated/ShardedMaster/MasterMaster/SlaveEnterprisedata warehouseMulti-structuredDataLogicaldata warehouseData Management
11 Big Data Analytics with HadoopDistributed File Systems• System that permanently store data• Divided into logical units (files, shards, chunks, blocks…)• A file path joins file and directory names into a relative orabsolute address to identify a file• Support access to file and remote servers• Support concurrency• Support distribution• Support replication• NFS, GPFS, Hadoop DFS, GlusterFS, MogileFS, MooseFS….
12 Big Data Analytics with HadoopUnstructured Data and Object Storage• Metadata values arespecific to eachindividual type• Enables automatedmanagement ofcontent• Ensure integrity,retention andauthenticityMetadata 101000101010100111010101100010110100…110010UIDVideoMetadata 111000101010100111010101100010110100…101010UIDImageMetadata 101100101010100111010101100010110100…110111UIDAudioMetadata 101011101010100111010101100010111110…110011UIDTextUnstructured Data + Metadata = Object Storage
13 Big Data Analytics with HadoopNoSQL Databases Categories• NoSQL = Not only SQL• Popular name for a subset of structured storagesoftware that is designed with the intention ofdelivering increased optimization for high-performance operations on large datasets• Basically, available, scalable, eventually consistent• Easy to use• Tolerant of scale by way of horizontal distributionKey-ValueRedis, Riak (Basho), CouchBase,Voldemort (LinkedIn)MemcacheDB…ColumnBigTable (Google), HBase,Cassandra (DataStax),Hypertable…DocumentMongoDB (10Gen),CouchDB, Terrastore,SimpleDB (AWS) …GraphNeo4j (Neo Technology), Jena,InfiniteGraph (Objectivity),FlockDB (Twitter)…
14 Big Data Analytics with HadoopWhat Is Hadoop ?“Flexible and availablearchitecture for large scalecomputation and dataprocessing on a network ofcommodity hardware”Open Source Software + Hardware Commodity= IT Costs Reduction
15 Big Data Analytics with HadoopWhat Is Hadoop used for ?• Searching• Log processing• Recommendation systems• Analytics• Video and Image analysis• Data Retention
16 Big Data Analytics with HadoopWho Used Hadoop ?• Top level Apache Foundation project• Large, active user base, mailing lists, usergroups• Very active development, strongdevelopment team• http://wiki.apache.org/hadoop/PoweredBy#L
17 Big Data Analytics with HadoopInfrastructure as a ServiceGeneral Purpose Storage Servers• Combine server with disks & networking• Specialized software enables general purpose systems designs to providehigh performance data servicesData moves to the infrastructureApplicationData ServicesMetadata MgntStorageLegacyApplicationData ServicesMetadata MgntStorageEmergingApplicationData ServicesMetadata MgntStorageFuture
18 Big Data Analytics with HadoopBig Data ArchitectureBI & DWH Architecture - Conventional• SQL based• High availability• Enterprise database• Right design for structured data• Current storage hardware (SAN, NAS, DAS)Analytics Architecture – Next Generation• Not only SQL based• High scalability, availability and flexibility• Compute and storage in the same boxfor reducing the network latency• Right design for semi-structured andunstructured dataDataNodesNetworkSwitchesEdgeNodesAppServersDatabaseServersNetworkSwitchesSANSwitchStorageArray
19 Big Data Analytics with HadoopHadoop EcosystemHardwareInstallation/DeploymentOperating System (RedHat, Suse, Ubuntu, Windows)Java Virtual MachineSecurityManagement/MonitoringBackup/Archiving/RecoveryData AccessOrchestrationOozieSqoopFlumeClient AccessHueHivePigData MiningMahoutData Storage Data ProcessingHDFS MapReduceNetworkCoordinationZooKeeperHBaseDevelopmentMetadataServices
20 Big Data Analytics with HadoopHadoop Common• Hadoop Common is a set of utilities thatsupport the Hadoop subprojects.• Hadoop Common includes Filesystem,RPC, and serialization libraries.
21 Big Data Analytics with Hadoop• Hadoop Distributed File System- A scalable, Fault tolerant, High performancedistributed file system- Asynchronous replication- Write-once and read-many (WORM)- Hadoop cluster with 3 DataNodes minimum- Data divided into 64MB (default) or 128MB blocks,each block replicated 3 times (default)- No RAID required for DataNode- Interfaces: Java, Thrift, C Library, FUSE, WebDAV,HTTP, FTP- NameNode holds filesystem metadata- Files are broken up and spread over the DataNodes• Hadoop Map Reduce- Software framework for distributed computation- Input | Map() | Copy/Sort | Reduce() | Output- JobTracker schedules and manages jobs- TaskTracker executes individual map() and reduce()tasks on each cluster nodeHDFS & MapReduceWorker NodesMaster Node
22 Big Data Analytics with HadoopHDFS - Read File1. The client API calculates the blocks indexbased on the offset of the file pointer andmake a request to the NameNode2. The NameNode replies which DataNodeshas a copy of that block3. The client contacts the DataNodes directlywithout going through the NameNode4. The DataNodes read the blocks5. The DataNodes response to the client aboutthe successNameNodeDataNodes1. Calculateblocks index3. ContactDataNode3. 3.2.ClientEdgeNodeRead FileEnd4. Read3. 5.4.Read 4. Read 4. Read
23 Big Data Analytics with HadoopHDFS – Write File1. Client contacts the NameNode whodesignates one of the replica as the primary2. The response of the NameNode containswho is the primary and who are thesecondary replicas3. The client pushes its changes to allDataNodes in any order, but this change isstored in a buffer of each DataNode4. The client sends a “commit” request to theprimary, which determines an order toupdate and then push this order to all othersecondaries5. After all secondaries complete the commit,the primary response to the client about thesuccess6. All changes of blocks distribution andmetadata changes be written to an operationlog file at the NameNodeNameNodeDataNodes1. Select primaryreplica3.5.3.3. Push toDataNode4. Commit2.Primary DataNodeClientEdgeNodeWrite FileEnd3.6. Write metadata5. Write 5. Write 5. Write 5. Write
24 Big Data Analytics with HadoopMapReduce – Exec FileLe client program is copied on each node1. The JobTracker determines the number of splitsfrom the input path, and select some TaskTrackersbased on their network proximity to the datasources2. JobTracker sends the task requests to thoseselected TaskTrackers3. Each TaskTracker starts the map phase processingby extracting the input data from the splits4. When the map task completes, the TaskTrackernotifies the JobTracker. When all the TaskTrackersare done, the JobTracker notifies the selectedTaskTrackers for the reduce phase5. Each TaskTracker reads the region files remotelyand invokes the reduce function, which collectsthe key/aggregated value into the output file (oneper reducer node)6. After both phase completes, the JobTrackerunblocks the client programClientEdgeNodeTaskTrackers1. SelectTaskTrackersfrom splits2. Sendrequest toTackTracker3. Map 3. Map 3. Map 3. Map5. Reduce 5. Reduce 5. Reduce 5. Reduce222JobTrackerExec File6. End4.End
25 Big Data Analytics with HadoopHBase• Its not a relational database (No joins)• Sparse data – nulls are stored for free• Semi-structured or unstructured data• Data changes through time• Versioned data• Scalable – Goal of billions of rows xmillions of columnsTable(Table, Row_Key, Family, Column, Timestamp) = Cell (Value)RegionRow Timestamp Animal RepairType Size CostEnclosure112 Zebra Medium 1000€11 Lion BigEnclosure2 13 Monkey Small 1500€ColumnKey FamilyCell• Clone of Big Table (Google)• Implemented in Java (Clients : Java, C++,Ruby...)• Data is stored “Column‐oriented”• Distributed over many servers• Tolerant of machine failure• Layered over HDFS• Strong consistency
26 Big Data Analytics with HadoopHBase• Table- Regions for scalability, definedby row [start_key, end_key)- Store for efficiency, 1 perFamily- 1..n StoreFiles(HFile format on HDFS)• Everything is byte• Rows are ordered sequentiallyby key• Special tables -ROOT- ,.META.- Tell clients where to find userdatahttp://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
27 Big Data Analytics with Hadoop• Data Warehouse infrastructure thatprovides data summarization and ad hocquerying on top of Hadoop- MapReduce for execution- HDFS for storage• MetaStore- Table/Partitions properties- Thrift API : Current clients in Php (WebInterface), Python interface to Hive, Java (QueryEngine and CLI)- Metadata stored in any SQL backend• Hive Query Language- Basic SQL : Select, From, Join, Group By- Equi-Join, Multi-Table Insert, Multi-Group-By- Batch queryHive
28 Big Data Analytics with Hadoop• A high-level data-flowlanguage and executionframework for parallelcomputation• Simple to write MapReduceprogram• Abstracts you from specificdetail• Focus on data processing• Data flow• For data manipulationPIG Language ExamplePig
29 Big Data Analytics with HadoopSqoop• Data import/export• Sqoop is a tool designedto help users of large dataimport existing relationaldatabases into theirHadoop clusters• Automatic data import• Easy import data frommany databases toHadoop• Generates code for use inMapReduce applicationsRelational DatabaseHadoop Cluster
30 Big Data Analytics with HadoopZookeeper• A high-performance coordination service for distributed applications• ZooKeeper is a centralized service for maintaining configurationinformation, naming, providing distributed synchronization, and providinggroup services• All servers store a copy of the data• A leader is elected at startup• Followers service clients, all updates go through leader• Update responses are sent when a majority of servers have persisted thechange
31 Big Data Analytics with HadoopFlume• Apache Flume is a distributed,reliable, and available service forefficiently collecting,aggregating, and moving largeamounts of log data• Simple and flexible architecturebased on streaming data flows• Robust fault tolerant withtunable reliability mechanismsand many failover and recoverymechanisms• The system is centrallymanaged and allows forintelligent dynamicmanagement
32 Big Data Analytics with HadoopOozie• Oozie is a server based Bundle Engine that provides a higher-level Oozie abstraction that will batch a set of coordinatorapplications. The user will be able tostart/stop/suspend/resume/rerun a set coordinator jobs in thebundle level resulting a better and easy operational control• Oozie is a server based Coordinator Engine specialized inrunning workflows based on time and data triggers. It cancontinuously run workflows based on time (e.g. run it everyhour), and data availability (e.g. wait for my input data to existbefore running my workflow)• Oozie is a server based Workflow Engine specialized in runningworkflow jobs with actions that execute Hadoop Map/Reduceand Pig jobs
33 Big Data Analytics with HadoopHue• HUE (aka. HadoopUser Experience)• Open source projectstarted as Cloudera• HUE is a web UI forHadoop.• Platform for buildingcustom applicationswith a nice UI libraryUser Admin: Account management for HUE users.File Browser: Browse HDFS; change permissionsand ownership; upload, download, view and editfiles.Job Designer: Create MapReduce jobs, which canbe templates that prompt for parameters when theyare submitted.Job Browser: View jobs, tasks, counters, logs, etc.Beeswax: Wizards to help create Hive tables, loaddata, run and manage Hive queries, and downloadresults in Excel format.Help: Documentation and help
34 Big Data Analytics with HadoopMahout• Apache Mahout is an Apache project to produce freeimplementations of distributed or otherwise scalable machinelearning algorithms on the Hadoop platform• Mahout machine learning algorithms:– Recommendation mining, takes users’ behavior and find items saidspecified user might like– Clustering, takes e.g. text documents and groups them based on relateddocument topics– Classification, learns from existing categorized documents what specificcategory documents look like and is able to assign unlabeled documentsto the appropriate category– Frequent item set mining, takes a set of item groups (e.g. terms in aquery session, shopping cart content) and identifies, which individualitems typically appear together
35 Big Data Analytics with Hadoop• A data serialization system that provides dynamicintegration with scripting languages• Avro Data- Expressive- Smaller and Faster- Dynamic- Schema store with data- APIs permit reading and creating- Include a file format and a textual encoding• Avro RPC- Leverage versioning support- For Hadoop service provide cross-language accessAvro
36 Big Data Analytics with Hadoop• A data collection system for managing large distributedsystems• Build on HDFS and MapReduce• Tools kit for displaying, monitoring and analyzing the logfilesChukwa
37 Big Data Analytics with HadoopWhirr• Apache Whirr is a set of libraries for running cloudservices• A cloud-neutral way to run services• A common service API. The details of provisioningare particular to the service. Talk to cloud provider• Smart defaults for services• Provision, Install, Configure and Manage• Deploy clusters on demand for processing or fortesting. Ideal if you are building applications on topof components of the Hadoop stack• Command line for deploying clusters
38 Big Data Analytics with HadoopAmbari• Ambari is a web-based set of tools for deploying, administeringand monitoring Apache Hadoop clusters• Easily Install a Hadoop Cluster– Ambari provides an easy-to-use, step-by-step wizard for installingHadoop services across any number of nodes.– Ambari leverages Puppet to perform installation and configuration ofHadoop services for the cluster.• Manage a Hadoop Cluster– Ambari provides central management for starting, stopping, andreconfiguring Hadoop services across the entire cluster.• Monitor a Hadoop Cluster– Ambari provides a dashboard for monitoring health and status of theHadoop cluster. Ambari leverages Ganglia to collect system metrics.– Ambari sends email alerts when your attention is needed (e.g., a nodegoes down, remaining disk space is low…). Ambari leverages Nagios tomonitor and trigger alerts
39 Big Data Analytics with HadoopHCatalog• Table and storage management service for datacreated using Apache Hadoop• Providing a shared schema and data type mechanism• Providing a table abstraction so that users need not beconcerned with where or how their data is stored.• Providing interoperability across data processing toolssuch as Pig, Map Reduce and Hive• HCatalog DDL (Data Definition Language)• HCatalog CLI (Common Language Interface)
40 Big Data Analytics with HadoopTempleton• Templeton provides a REST-like web API for HCatalogand related Hadoopcomponents• Developers make HTTPrequests to access Hadoop,MapReduce, Pig, Hive andHCatalog DDL from withinapplications• Data and code used by Templeton is maintained in HDFS• HCatalog DDL commands are executed directly when requested.MapReduce, Pig, and Hive jobs are placed in queue by Templeton• Can be monitored for progress or stopped as required• Developers specify a location in HDFS into which Templeton shouldplace Pig, Hive, and MapReduce results.
41 Big Data Analytics with HadoopHigh Availability Solutions• Each block is replicated 3 times (recommended)• Secondary NameNode• Disaster Recovery with DistCpNameNodeJobTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerName Node(active)WorkerNodeWorkerNodeWorkerNodeNameNodeJobTrackerSecondary NN(passive)
42 Big Data Analytics with HadoopHA - Secondary NameNode• NameNode and Secondary NameNode Servers: the servers on which you run theActive and Standby NameNodes should have equivalent hardware to each other, andequivalent hardware to what would be used in a non-HA cluster.• Shared Storage Array: you will need to have a shared directory which both NameNodemachines can have read/write access to. Typically this is a remote filer which supportsNFS and is mounted on each of the NameNode machines. Currently only a singleshared edits directory is supported. Thus, the availability of the system is limited by theavailability of this shared edits directory, and therefore in order to remove all singlepoints of failure there needs to be redundancy for the shared edits directory.Specifically, multiple network paths to the storage, and redundancy in the storage itself(disk, network, and power). Because of this, it is recommended that the shared storageserver be a high-quality dedicated storage array rather than a simple Linux server.PassiveNodeActiveNodeSAS (6Gbps)Share Storage Arraywith dual active/activecontrollers12 x 600GB HD 15K RPM (Raid10)Network NetworkSAS (6Gbps)Shared Storage ArrayNameNode Server• 2 CPU 6 core• 96GB RAM• 6 x HDD 600GB 15K(Raid10)• 2 x 1GbE PortsSecondary NameNode Server• 2 CPU 6 core• 96GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 1GbE Ports
43 Big Data Analytics with HadoopHA - Disaster Recovery• DistCp: tool for parallelized copying of large amounts of data• Large inter-cluster copy• Based on MapReduceHadoop Cluster #1NameSpace #1Hadoop Cluster #2NameSpace #2Parallelized Data ReplicationDisaster Recovery SitePrimary SiteMap ( )Racks RacksMoving an Elephant: Large Scale Hadoop Data Migration at Facebookhttp://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920
44 Big Data Analytics with HadoopHadoop Security• Files permissions–Files permissions like Unix (owner, group, mode)• User identity–Simple–Super-user• Kerberos connectivity–Users authenticate to the edge of the cluster with Kerberos(via GSSAPI)• Users and group access is maintained in cluster specificaccess control lists• Microsoft Active Directory connectivity
45 Big Data Analytics with HadoopHadoop Archiving• Hadoop Archives, or HAR files, are a file archiving facility thatpacks files into HDFS blocks more efficiently• Reduce the NameNode memory usage while still allowingtransparent access to files.• An effective solution for the small files problem– http://developer.yahoo.com/blogs/hadoop/posts/2010/07/hadoop_archive_file_compaction/– http://www.cloudera.com/blog/2009/02/the-small-files-problem/• Archives are immutable• Rename, deletes and create return an error• Hadoop Archives is exposed as a file system MapReduce willbe able to use all the logical input files in Hadoop Archives asinput
46 Big Data Analytics with HadoopHadoop Compression - LZO• LZO Compression:https://github.com/toddlipcon/hadoop-lzo• By enabling compression, the store file uses acompression algorithm on blocks as they are written(during flushes and compactions) and thus must bedecompressed when reading• Compression reduces the number of bytes writtento/read from HDFS• Compression effectively improves the efficiency ofnetwork bandwidth and disk space• Compression reduces the size of data needed to beread when issuing a read
47 Big Data Analytics with HadoopHadoop Compression - Snappy• Hadoop Snappy is a project for Hadoop that provide access tothe snappy compression. http://code.google.com/p/snappy/• This project is integrated into Hadoop Common (Jun. 2011)• Hadoop-Snappy can be used as an add-on for recent(released) versions of Hadoop that do not provide SnappyCodec support yet• Hadoop-Snappy is being kept in synch with HadoopCommon• Snappy is a compression/decompression library. It does notaim for maximum compression, or compatibility with anyother compression library
48 Big Data Analytics with HadoopHadoop Backup• Replication blocks is not a form of backup• When a file is deleted by a user or an application, it isnot immediately removed from HDFS– The file is moved in the /trash directory– The file can be restored quickly as long as it remains in /trash– A file remains in /trash for a configurable amount of time• When a file is corrupted restore from backup isnecessary– Create incremental backup HDFS files– Use a time-stamp of the file– Use a staging area to store the backup files– Move the backup files to tape if necessary– http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/• Backing up with DistCp or copyToLocal commands– http://blog.rapleaf.com/dev/2009/06/11/multiple-ways-of-copying-data-out-of-hdfs/– http://hadoop.apache.org/common/docs/current/distcp.htmlMaster IndexIndexDataHAR File Layout
49 Big Data Analytics with HadoopData EncryptionGazzang
50 Big Data Analytics with HadoopHDFS over NFS & CIFS Protocols• Make a HDFS filesystem availableacross networks as a exported share• Over NFS– Install FUSE server– Mount HDFS filesystem over FUSE– Export HDFS filesystem– http://mapredit.blogspot.fr/2011/11/nfs-exported-hdfs-cdh3.html• Over CIFS– Install FUSE server– Install SAMBA server on FUSE server– Mount HDFS filesystem over SAMBA– Export HDFS filesystem– http://mapredit.blogspot.fr/2011/12/export-hdfs-over-cifs-samba3.htmlFUSE is a framework which makes it possibleto implement a filesystem in a userspaceprogram. Features include:• Simple yet comprehensive API• Secure mounting by non-root user• Multi-threaded operationSAMBA is a suite of Linux applications thatspeak the SMB (Server Message Block)protocol. Many operating systems, includingWindows use SMB to perform client-servernetworking.FUSESAMBA
51 Big Data Analytics with HadoopGanglia Monitoring• Ganglia: Ganglia by itself is a highlyscalable cluster monitoring tool,and provides visual information onthe state of individual machines in acluster or summary information fora cluster or sets of clusters. Gangliaprovides the ability to view differenttime windows into the past,normally one hour, one day, oneweek, one month, and so on.http://ganglia.sourceforge.net/• 2 deamons: GMON & GMETAD• GMOND collects or received metricdata on each Datanode• 1 GMETAD/grid• Polls 1 GMOND per cluster for data• A node belongs to a cluster• A cluster belongs to a gridGridClusterNodeFailoverPollPollFailoverFailoverPollPollGMONDNodeGMONDNodeGMONDNodeCluster 1GMONDNodeGMONDNodeGMONDNodeCluster 2GMETADCluster 3GMONDNodeGMONDNodeGMONDNodeGMETADApacheWebFrontend WebClient
52 Big Data Analytics with HadoopNagios Monitoring• A system and network monitoringapplication• Nagios is an open source toolespecially developed to monitor hosts(disk usage…) and services anddesigned to inform the networkincidents before end-users, clients do• Nagios watches hosts and serviceswhich we specify and alerts whenthings go bad and when things getrecovered• Initially developed for servers andapplications monitoring, it is widelyused to monitor and networksavailabilityApacheHadoopCGINagiosPluginsDatabaseHadoop infrastructureCommonGraphical InterfaceUsers
53 Big Data Analytics with HadoopCrowbar Software for HadoopA modular, open source framework thataccelerates multi-node deployments,simplifies maintenance, and streamlinesongoing updates• Deploy Hadoop cluster in hours insteadof days• Use or build barclamps to install andconfigure software modulesOpscode Chef Server CapabilitiesDownload the open source software:https://github.com/dellcloudedge/crowbarActive communityhttp://lists.us.dell.com/mailman/listinfo/crowbarResources on the Wiki:https://github.com/dellcloudedge/crowbar/wikiSolutions Componants• Supports a cloudoperations modelto interact,modify, and buildbased onchanging needs
54 Big Data Analytics with HadoopLogical Data WarehouseFiles Web Data RDBMSDevelopment BI / AnalyticsActivityReportingENGINEERS ANALYSTS BUSINESS USERSMobile AppsMOBILE CLIENTSData ModelingDATA SCIENTISTSADMINISTRATORDataManagementUnstructured and structured Data WarehouseMPP, No SQL Engine, Distributed File SystemsShare-Nothing Architecture, AlgorithmsStructured Data WarehouseMPP, In-Memory, Columns Database, SQLEngine, Share-Nothing Architectureor Share-Disk Architecture via SANData IntegrationData Analysis & VisualizationData QualityMaster DataManagementData sourcesDataTransferSQLNoSQL
55 Big Data Analytics with HadoopHadoop SizingFeature DescriptionReplication_blocks• Number of replication blocks• 3 recommendedUsable_data_volume • Data source volume (business data)Temp_data_volume_ratio• tmp, log…• 20% of Usable_data_volumeData_compression_ratio • Data compression ratioRaw_data_volume• (Usable_data_volume x Replication_blocks x Temp_data_volume_ratio xData_compression_ratioRack_units_per_rack • Number of rack units in a rackRack_units_switch • Number of rack units for a network switches in one rackRack_units_per_DataNode • Number of rack units for one DataNodeDataNode_volume • Raw data volume in one DataNodeDataNodes• Number of DataNodes• Raw_data_volume / DataNode_volumeRacks• Number of racks• (DataNodes x Rack_units_per_DataNode) / (Rack_units_per_rack – Rack_units_switch)
56 Big Data Analytics with HadoopHadoop Architecture Overview2 x EdgeNode• 2 CPU 6 core• 96 GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE Ports3 to n DataNode• 2 CPU 6 core• 48 GB RAM• 12 x HDD 3TB 7.5K• 2 x 10GbE PortsNetwork Switches2 x NameNode / Secondary NameNode• 2 CPU 6 core• 96 GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE Ports1 x AdminNode• 2 CPU 6 core• 48GB RAM• 6 x HDD 600GB 15K (Raid10)• 2 x 10GbE PortsEdge Nodes Control Nodes Worker Nodes
57 Big Data Analytics with HadoopPOC Configuration• Architecture example• The exact configuration and sizing is designed depending on thecustomer’s needs1 x EdgeNode2 CPU 6 core32 GB RAM6 x HDD 600GB 15K Raid103 x DataNode or 5 if Zookerper is used2 CPU 6 core48 GB RAM12 x HDD 1TB 7.2KNetwork Switch1 x NameNode1 x Secondary NameNode / AdminNode2 CPU 6 core96 GB RAM6 x HDD 600GB 15K Raid10Edge Nodes Control Nodes Worker Nodes
58 Big Data Analytics with Hadoop• Designing appropriate hardware for a Hadoop cluster requiresbenchmarking or POC and careful planning to fully understand theworkload. However, Hadoop clusters are commonly heterogeneous and itis recommended deploying initial hardware with balanced specificationswhen getting started• HiBench, a Hadoop benchmark suite constructed by Intel, is usedintensively for Hadoop benchmarking, tuning & optimizations both insideIntel and by our customers/partners• A set of representative Hadoop programs including both micro-benchmarks and more "real world" applications such as: search, machinelearning and Hive queries• HiBench 2.1 available under Apache License 2.0 athttps://github.com/hibench/HiBench-2.1• Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platform –Apache Hadoop - February 2012http://www.intel.in/content/dam/www/public/us/en/documents/guides/cloud-builders-xeon-apache-hadoop-guide.pdfHadoop Benchmarks
59 Big Data Analytics with HadoopHiBenchCategory Workload DescriptionMicrobenchmarksSortThis workload sorts its binary input data, which is generated using the Hadoop*RandomTextWriter example.WordCountThis workload counts the occurrence of each words in the input data which isgenerated using Hadoop RamdomTextWrterTeraSortA standard benchmark for large-size data sorting that is generated by theTeraGen programDFSIOComputes the aggregated bandwidth by sampling the number of bytesread/written at fixed time intervals in each map taskWeb SearchNutch IndexingThis workload tests the indexing subsystem in Nutch, a popular Apache* open-source search engine. The crawler subsystem in Nutch is used to crawl an in-house Wikipedia* mirror and generates 8.4 GB of compressed data (for about2.4 million web pages) total as workload inputPage RankThis workload is an open-source implementation of the page-rank algorithm, alink analysis algorithm used widely in Web search enginesMachine LearningK-Means ClusteringTypical application area of MapReduce for large-scale data mining and machinelearningBayesian ClassificationThis workload tests the naive Bayesian (a well-known classification algorithm forknowledge discovery and data mining) trainer in Mahout*, which is an Apacheopen-source machine-learning libraryAnalytical QueryHive JoinThis workload models complex analytic queries of structured (relational) tablesby computing the sum of each group over a single read-only tableHive AggregationThis workload models complex analytic queries of structured (relational) tablesby computing both the average and sum for each group by joining two differenttables
60 Big Data Analytics with HadoopHadoop TeraSort Workflow• Teragen is a utility included with Hadoop for usewhen creating data sets that will be used byTerasort. Teragen utilizes the parallel frameworkwithin Hadoop to quickly create large data setsthat can be manipulated. The time to create agiven data set is an important point when trackingperformance of a Hadoop environment• Terasort benchmark tests HDFS and MapReducefunctions in the Hadoop cluster. Terasort is acompute-intensive operation that utilizes theTeragen output as the Terasort input. Terasort willread the data created by Teragen into the system’sphysical memory and then sort it and write it backout to the HDFS. Terasort will exercise all portionsof the Hadoop environment during theseoperations• Teravalidate is used to ensure the data producedby Terasort is accurate. It will run across theTerasort output data and verify all data is properlysorted, with no errors produced, and let the userknow the status of the resultsStart TerasortCreate DataStart SortsStart ReduceCombine SortsSorts ResultsCompleteTerasortControl (1 node)Map (n nodes, ntasks)Control (1 node)Control (1 node)Reduce (n nodes, n tasks)Control (1 node)Sorts DataMap (n nodes, ntasks)
61 Big Data Analytics with HadoopArchitecture OverviewEdgeNodeNameNodeDataNodesSwitches Switches Switches Switches Switches SwitchesDataNodes DataNodes DataNodes DataNodes DataNodesSecondaryNameNodeEdgeNodeRack1 Rack2 Rack3 Rack4 Rack5 Rack6Admin Node Admin Node
Let LinkedIn power your SlideShare experience
+
Let LinkedIn power your SlideShare experience
Customize SlideShare content based on your interests
We will import your LinkedIn profile and you will be visible on SlideShare.
Keep up to date when your LinkedIn contacts post on SlideShare
1–10 of 27 previous next Post a comment
1–10 of 27 previous next