ENTER THE AGE OF HADOOPTMSUPERCOMPUTINGIntroducing Cray cluster supercomputers forHadoopTMHadoop, and the elephant logo, are trademarks of the Apache Software Foundation.
Background:The Baby Elephant in the Room6/10/132Hadoop widely perceived as high potential,not yet high value, but that’s about to change…Current Perception of Hadoop• Synonymous with Big Data and openness• Excellent for the three “V’s” (Volume,Velocity, &Variety) • Capable of huge scale with ad-hoc infrastructure Current Reality of Hadoop• Many experimenting (64% of F1000, in 2012)• Much expertise in Warehousing – little beyond that • Bottlenecks Data Scientist, so performance not yet an issue Current Trajectory of Hadoop• Industry Momentum–Vendors, analysts, organizations, etc.• More Users – Beyond Data scientists, Business & Ops users • More Complexity – Near real-time, complex algorithms, etc.
Untapped Hadoop Potential:Most Organizations just scratching the SurfaceWarehouseBatchReportingBusinessIntelligenceReal-Time Ops.MonitoringSearch for whathappenedReport whathappenedDetermine why ithappenedKnow when it’shappeningPredict it willhappenPredictiveAnalyticsIn 2013 – Mostorganizations, usingHadoop, areapproximately here.ComplexityValueHigh Value Hadoop -Ideally, mostorganizations aspireto be here, but…StoreReportAnalyzeMonitorPredict
Realizing Hadoop Potential:Increasing Value adds Complexity6/10/13Internal use Only - Hadoop CS300 Launch Draft 4Store Report Analyze Monitor PredictType orUsers• Data Scientists• Data Scientists• Analysts• Data Scientists• Analysts• Business• Data Scientists• Analysts• Business• Ops• Data Scientists• Analysts• Business• Ops# of Users Few Few Medium High HighAlgorithms • Few Crude • Few Basic• Complex• 3rd party apps• Many Complex• 3rd party apps• Many Advanced• 3rd party appsLatency Infrequent Batch Frequent Batch• Frequent Batch• Some Real-time• Frequent Batch• Much Real-time• Frequent Batch• Much Real-timeData Types• Unstructured• Binary• Semi-Structured• Unstructured• Binary• Semi-Structured• Unstructured• Binary• Semi-Structured• Big Table-RDBMS• Unstructured• Binary• Semi-Structured• Big Table-RDBMS• Unstructured• Binary• Semi-Structured• Big Table-RDBMSDataVolumeMedium-High Medium-High High High HighValue Low Medium-High Value High Value High Value High Value
For High Value Hadoop:Performance is Critical6/10/135Beyond performance, other attributes arerequired for High Value Hadoop…More ComplexityBig & Fast – Oceans of data,correlated with streamsAlgorithms – Complexityincreases, with analysis & predictiveSprawl – Hadoop clusters sprawl,with performance gapsMore UsersFriendly Tools – Analysis toolsextend use beyond “scientists”Business Users– Non-technicalusers demanding accessFrequency - Users will requiremore frequent queries/joinsMore DataVariety – Unstructured, Big Table,binary, etc. all incorporatedVolume – Explosive growth, as newsources addedVelocity – Near real-time required,as Hadoop broadens
For High Value Hadoop:Reliability & Maintainability are also Critical6/10/136ReliabilityYet, few of today’s organizations are prepared for High Value Hadoop…ReliabilityFast ROI – Need quick &predictable time to productionConﬁdence – Ability to meetfuture objectives Stafﬁng – Must not overburden,or require erratic ramp-up MaintenanceSupport – Must not rely uponan army of vendorsManagement – Integrated,holistic, management Change – Controlled risk withupgrades or conﬁg. changes
ExperimentalAttributesHigh Value AttributesFor High Value Hadoop:Ad-hoc Infrastructure is limiting6/10/137Great forSkunk worksRepurpose Easy to allocate/reallocate existingresourcesFlexible Leverage virtualizationto grow up & spindown Cheap Start Minimal investment tovalidate potential value PerformanceMixed OK for simple storage& reporting, bad forHigherValue appsInefﬁcient Typicallypoor use of rawhardware performance Not HPC Will never reachperformance of HPCsystem ReliableSlow ROI Uncertainty and longlead times toproductionF.U.D. Low conﬁdence inability to meet futureobjectives Chaotic Stafﬁng Requires large cross-functional team commit MaintenanceSupport Many vendors and many“indeterminate” issuesManagement Many tools with poorintegration &correlation Change Cascading risk withevery upgrade orconﬁg. change
6/10/138Best HadoopDistribution• Security–Comprehensive, andfast, encryption• Performance – FasterHive, Cacheacceleration, etc.• Management – IntelManager for HadoopSoftwarePerformanceof a Cray• Proven HPC – CrayHPC technology andexpertise• Vast Scale – Grow tomeet any missionrequirements• Holistic Design –Balanced: Compute,networking, & StorageTurnkeySolution• Reliable – Rapid ROI…runs as-advertised• Support – One throat tochoke, for the wholestack• Maintenance – Update& evolve, withoutconcernsHigh ValueHadoop• Performance – Power toaccommodate current &future goals• Reliability – Will meetany challenge, withoutsurprises• Maintenance – Easy tomaintain &accommodate changeCray Cluster Supercomputers for Hadoop:Purpose-Built, Turnkey, Hadoop Solutions
Cray Cluster Supercomputers for Hadoop:Intel Distribution for Apache Hadoop6/11/139ReliabilityHPC-CentricHPC Investments Performance, Management, andSecurity Hardware Optimizations Enhancements throughout IDH stackLustre IDH working on Lustre support &improvements PerformanceHPC Performance Emphasis on hardware optimizationvs. virtualizationSQL Signiﬁcant HBase performanceoptimizations MapReduce Faster job launches = more frequentanalysis Others Optimized text search (Lucene), SSDcache, etc. SecurityCell Level Security Critical for multiuser/multitenantAPI & Services Intel Expressway for security acrossall Hadoop services Role-Based Intel Manager access control Fast Encryption Hardware-optimized encryption/decryption
Cray Cluster Supercomputers for Hadoop:Simple Linux Utility for Resource Management6/11/1310ResultsEfﬁciency & ROI – Improved utilization& Economies of scaleMultitenant – Necessary for supportingmultiple internal or external orgs. No Cluster Sprawl – One datarepository, with many uses & users Resource ManagementHadoop Diversity – Allocate Hadoop libraries by jobMultipurpose – Essential, for multiuse (eg. Scientiﬁc compute &Hadoop) Job SchedulingService Levels – Predictable, as workload grows Prioritization – Ensure important jobs perform
Cray Cluster Supercomputers for Hadoop:Purpose-Built, Turnkey, Hadoop Solutions6/10/13Internal use Only - Hadoop CS300 Launch Draft 11Flexible ArchitectureMultipurpose – Analytics design,with Compute-intensive capabilitiesHDFS – Rack mount servers, withdirect-attach SSD/SATALustre – Greenblade, with SonexionCooling – Air-cooled, or liquid-cooled, depending on reqts.Hadoop Software StackHadoop – Latest, Cray-optimizedIntel Distribution of HadoopWorkload Management– SLURM JobScheduling & Resource Mgt.OS- Cray-optimized Linux OSACE– Advanced Cluster EngineManagement SoftwareCray ServiceValidation – Entire configuration andstack validated, in Cray labsImplementation – Cray servicesfacilitates a production frameworkSupport – One primary point ofcontact for for all solution issues
6/10/13Internal use Only - Hadoop CS300 Launch Draft 12• Support More Data – Enormous Variety, Volume and Velocity• Support More Users – Enables user growth, while maintaining SLAs• Support More Complexity – Both big & fast, with complex algorithms• Reliable Service Levels – Integrated workload management• Change – Controls risk with upgrades & config. changes• Secure – Integrated, and thorough, security• Support – One primary point of contact for for all solution issues• Management – Integrated and holistic management• Staffing – Doesn’t overburden staff or require erratic ramp-upsPerformanceReliabilityMaintain-ability• Best of Breed – Integrated HPC Hadoop Solution• It just works – Validated and optimized, by Cray, without surprises• Rapid ROI – Rapid deployment & predictable production schedulesTurnkeyCray Cluster Supercomputers for Hadoop:Enables High Value Hadoop