Productionizing Hadoop: 7 Architectural Best Practices
Upcoming SlideShare
Loading in...5
×
 

Productionizing Hadoop: 7 Architectural Best Practices

on

  • 316 views

 

Statistics

Views

Total Views
316
Views on SlideShare
316
Embed Views
0

Actions

Likes
0
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Best practices in productionizing Hadoop
  • Question Q26
  • Base: 634 Business Intelligence users and planners
  • Base: 634 Business Intelligence users and planners
  • Image source: Google (http://www.google.com/)
  • Image source: Google (http://www.google.com/)
  • Image source: istockphoto
  • Image source: istockphoto
  • 1 year = 525,948.766 minutes1-.9999 = .00011-.9995 = .00054 nines = .0001 x 525949 = 52 minutes per year5 nines = .00001 x 525949 = 5 minutes99.5 = .0005 x 525949 = 263 minutes = 4 hours and
  • With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self –healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.
  • Image source: istockphoto
  • Image source: istockphoto.com
  • Image source: istockphoto
  • Image source (clothing): istockphotoImage source (to tell the truth logo): Wikimedia
  • MapR enables integration by providing industry-standard interfacesMore 3rd party solutions work with MapR than any other distributionProprietary connectors not neededNFSAll file-based applications can read and write dataExamples: Linux utilities, file browsers, Informatica UltraMessagingODBC 3.52All BI applications can leverage HiveExamples: Excel, Crystal Reports, Tableau, MicroStrategyLinux PAMAny authentication provider can be usedExamples: LDAP, Kerberos, 3rd party
  • Image source: Mike Gualtieri
  • Image source: istockphoto
  • A recent Wall Street Journal article cited that “MapR is Cheaper than Free”We provide a very powerful ROI that encompasses hard dollar opex and capex savings and provides value across multiple dimensions
  • Source: NASA (http://www.nasa.gov/)

Productionizing Hadoop: 7 Architectural Best Practices Productionizing Hadoop: 7 Architectural Best Practices Presentation Transcript

  • Productionizing Hadoop: 7 Architectural Best Practices Mike Gualtieri, Principal Analyst
  • #BigData
  • © 2013 Forrester Research, Inc. Reproduction Prohibited 7% 13% 7% 17% 31% Implemented, not expanding Expanding/upgrading implementation Planning to implement in the next 12 months Planning to implement in more than 1 year Interested but no plans Base: 634 business intelligence users and planners “What best describes your firm's current usage/plans to adopt Big Data technologies and solutions?” Source: Forrsights BI/Big Data Survey, Q3 2012 Big Data has momentum 20% have implemented some big data technology 37% are planning some big data technology project
  • “Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all of the data it needs to operate, make decisions, reduce risks, and serve customers.” DEFINITION FORRESTER
  • © 2013 Forrester Research, Inc. Reproduction Prohibited 2% 3% 21% 22% 28% 32% 32% 36% 36% 38% 41% Other Don't know Earlier generation technology is too expensive The velocity of data is too high for earlier technologies We can achieve (or are achieving) significant cost reductions by changing our data management and analytic architecture Data changes or becomes available much faster than we can process in support of business decisions The number of data formats that we must be able to deal with exceeds our ability to cost-effectively integrate Analysis requirements change too fast to keep up with We want to access data that was not accessible for us with existing technologies Data volumes have grown beyond what we can cost effectively manage We don't know what our entire data universe contains, we need new ways to explore data and discover patterns and… “What are the main business and technical requirements or inadequacies of earlier-generation BI technologies that lead you to consider new BI techniques and technologies?” Firms seek more value in data, struggle to wrangle it, & seek lower cost solutions
  • © 2013 Forrester Research, Inc. Reproduction Prohibited Integrating data from a variety of data sources is a top challenge
  • © 2013 Forrester Research, Inc. Reproduction Prohibited Big Data architecture must support three core capabilities (SPA): •Can you capture and store all your data++?Store •Do you have the compute power to cleanse, enrich, & analyze your data++? Process •Can you retrieve, search, integrate, a nd visualize all your data++? Access 7
  • 8
  • #Production
  • How can you keep your Big Data operations running smoothly? Production
  • © 2013 Forrester Research, Inc. Reproduction Prohibited Productionizing Big Data can be complex because of: Integration with heterogeneous infrastructure Use of multiple analytical software applications Reliance on 3rd-party cloud services Always available modeling and visualization sandboxes Increasing volume, velocity, variety of data from multiple data sources Compute intensive analytics
  • Big Data production requires sound architecture. Production
  • The 7 architectural qualities of Big Data production platforms Quality What it means 1 Experience Users’ perceptions of the usefulness, usability, and desirability of the application. 2 Availability The readiness of the service or application to perform its functions when needed 3 Performance The speed to perform functions to meet business and user expectations 4 Scalability Handle increasing volumes of data, transactions, services, and applications. 5 Adaptability The ease with which an application or service can be changed or extended 6 Security Supports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation 7 Economy Minimize cost to build, operate, & change an application or service without compromising its business value
  • Operational experience is critical to production. 1. Experience
  • Best practices: User experience Usefulness, Usability, Desirability of applications require ease of use with power Developers Administrators • Standard Tools • Linux Commands • Direct Access with NFS • Visibility • Self Healing • Architectural Simplicity
  • Easy Workflow Management Workload Automation with Cisco Tidal Enterprise Scheduler • Detailed, dependency-driven event execution • Point-and-click dynamic variables and parameters • Scalable, extensible architecture • Granular notification and alerts
  • High-availability strategy and architecture are often overlooked in proof-of-concepts. 2. Availability
  • What does high availability mean? Uptime %* Downtime per year 99.999% (5 nines) 5.26 minutes 99.99% (4 nines) 52.6 minutes 99.5% 1.83 days 99% (2 nines) 3.65 days 98% 7.30 days 95% 18.25 days *Uptime calculations assume no scheduled downtime.
  • 19©MapR Technologies - Confidential High Availability and Dependability Reliable Compute Dependable Storage  Automated stateful failover  Automated re-replication  Self-healing from HW and SW failures  Load balancing  Rolling upgrades  No lost jobs or data  99999’s of uptime  Business continuity with snapshots and mirrors  Recover to a point in time  End-to-end check summing  Strong consistency  Data safe  Mirror across sites to meet Recovery Time Objectives
  • Unexpected latencies can emerge from rapid fluctuations in volume, velocity, & variety of data and interactions of the larger Big Data ecosystem. 3. Performance
  • 21©MapR Technologies - Confidential World Record Performance New Minute Sort World Record 1.5 TB in 1 minute 2103 nodes Previous Record: 1.4 TB Benchmark MapR 2.1.1 CDH 4.1.1 MapR Speed Increase Terasort (1x replication, compression disabled) Total 13m 35s 26m 6s 1.9x Map 7m 58s 21m 8s 2.7x Reduce 13m 32s 23m 37s 1.7x DFSIO throughput/node Read 1003 MB/s 656 MB/s 1.5x Write 924 MB/s 654 MB/s 1.4x YCSB (50% read, 50% update) Throughput 36,584.4 op/s 12,500.5 op/s 2.9x Runtime 3.80 hr 11.11 hr 2.9x YCSB (95% read, 5% update) Throughput 24,704.3 op/s 10,776.4 op/s 2.3x Runtime 0.56 hr 1.29 hr 2.3x
  • Scalability is as much about scaling up as it is about scaling down. 4. Scalability
  • 23©MapR Technologies - Confidential MapR’s Relative Scale Testing completed on 10 node cluster, 2x Quad-Core, 24G DRAM 12 x 1TB SATA Drives @ 7200 rpm 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 1000 2000 3000 4000 5000 6000 Filecreates/s Files (M) 0 100 200 400 600 800 1000 0 100 200 300 400 0 0.5 1 1.5 Filecreates/s Files (M) Other distribution MapR distribution Scale Advantage: 4600x
  • Firms have barely scratched the surface of what is possible with Big Data analytics. Change is always in the wind. 5. Adaptability
  • I am a data scientist. I am a data scientist. I am a data scientist. Data scientists will constantly have new requirements
  • …to accelerate the pace of discovery Compress… Production must address and help compress the full Big Data analytics life cycle
  • 27©MapR Technologies - Confidential Direct Integration with Existing Applications  100% POSIX compliant  Industry standard APIs - NFS, ODBC, LDAP, REST  More 3rd party solutions  Proprietary connectors unnecessary  Language neutral
  • A breach can devastate an organization's reputation with customers or have legal repercussions. 6. Security
  • All, some, or none of these 6 security properties may apply to Big Data • Information is available only to the people intended to use it or see itConfidentiality • Information is only changed in appropriate ways by people authorized to change itIntegrity • Applications are available when needed and perform acceptablyReadiness • A person’s identity is determined before access is granted if anonymous people are not allowedAuthentication • People are allowed or denied access to applications or application resourcesAuthorization • A person cannot perform and action and then later deny performing that actionNonrepudiation
  • 30©MapR Technologies - Confidential Securing Big Data Corporate Security Requirements  Authentication Wire-level security  Authorization (Access Control) Standard: UID, GID based Granular: File, Table, Column Family, Column, Cell  Integration into Existing Environments Kerberos or non-Kerberos Use existing Directory for credential lookups  Seamless Access with Single Sign-On
  • Every architectural decision has an impact on the return on investment for Big Data analytics platforms. 7. Economy
  • Production Sweet Spot Beware of pilot programs that don’t scale economically Business value of big data Investment People- intensive platforms Technology- intensive platforms
  • 33©MapR Technologies - Confidential Maximizing Economic Value  Analytics – Ability to perform broader and deeper analytics – Real-time streaming – Mission critical SLAs – Cloud based analysis  Ease of Development  Ease of Administration  Value of Uptime  Value of Data Protection  Hardware Efficiency  First Class Support
  • 34©MapR Technologies - Confidential One Platform for Big Data … 99.999% HA Data Protection Disaster Recovery Scalability & Performance Enterprise Integration Multi- tenancy Map Reduce File-Based Applications SQL Database Search Stream Processing Batch Interactive Real-time
  • The 7 qualities of Big Data production platforms Quality What it means 1 Experience Users’ perceptions of the usefulness, usability, and desirability of the application. 2 Availability The readiness of the service or application to perform its functions when needed 3 Performance The speed to perform functions to meet business and user expectations 4 Scalability Handle increasing or decreasing volumes of transactions, services, and data 5 Adaptability The ease with which an application or service can be changed or extended 6 Security Supports the security properties of confidentiality, integrity, authentication, authorization, and nonrepudiation 7 Economy Minimize cost to build, operate, & change an application or service without compromising its business value
  • Big Data is about innovation, but not if you don’t productionize it. 36 Collectors • Capture • Store Journalists • Reports • Dashboards Innovators • Predictive analytics Operations Business Intelligence Predictive Power
  • Frontier Big data is about pushing limits. Exponential growth in data means the frontier is vast.
  • Thank you Mike Gualtieri mgualtieri@forrester.com Twitter: @mgualtieri