NetApp is in the process of moving a petabyte-scale database of customer support information from a traditional relational data warehouse to a Hadoop-based application stack. This talk will explore the application requirements and the resulting hardware and software architecture. Particular attention will be paid to trade-offs in the storage stack, along with data on the various approaches considered, benchmarked, and the resulting final architecture. Attendees will learn a range of architectures available when contemplating a large Hadoop project and some of the process used by NetApp to choose amongst the alternatives.
2. Agenda
NetApp: Drowning in Data
Technology Assessment
Business Drivers to Choose E-Series
Solution Architecture
Performance Benchmarks
Best Practices
Questions
2
3. The AutoSupport Family
The foundation of NetApp Support strategies
Catch issues before they become critical
Secure automated “call-home” service
System monitoring and nonintrusive
alerting
RMA requests without customer action
Enables faster incident management
“My AutoSupport Upgrade Advisor tool does all the hard work for
me, saving me 4 to 5 hours of work per storage system and
providing an upgrade plan that’s complete and easy to follow.”
3
4. AutoSupport Capabilities
Customer Install Base NetApp and Partner Usage
Auto Replacement Parts (Reactive) Auto Case
Creation
(Reactive)
Customer Assess &
Optimize
Environments (Proactive)
AutoSupport
Messages (HTTPS)
AutoSupport
Database
NetApp Storage
System
Risk Detection
& Automation Engine
Customer
Messages
Sizing and
(Email) modeling
(Proactive)
Storage My AutoSupport – Customer
Administrator Portal (Proactive and Predictive)
4
5. Business Challenges
Gateways ETL Data Warehouse Reporting
• 600K ASUPs • Data needs to • Only 5% of data goes into the • Numerous mining
every week be parsed and data warehouse, rest requests are not satisfied
• 40% coming over loaded in 15 unstructured, yet it’s growing currently
the weekend mins 6-8TB per month • Huge untapped potential
• .5% growth week • Oracle DBMS struggling to of valuable information for
over week scale, maintenance and lead generation,
backups challenging supportability, and BI
• No easy way to access this
unstructured content
Finally, the incoming load doubles every 16 months!
5
6. Incoming AutoSupport Volumes
and TB Consumption
Flat-File Storage Requirement
3500
3000
Total Usage (tb)
2500
2000 Projected Total Usage (tb)
1500 Doubles
1000
500
0
Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16
At projected current rate of growth, As of June 2011:
total storage requirement will ~ 600,000 events archived each week
double every 16 months ~ 3 TB Disk space used each week
Cost Model: Events growing at 40% year over year
> $15M per year Ecosystem costs Disk use growing faster
Expanding products & features
7
7. Big Data is Expensive
Growth Rates (CAGR)
– Data: +68%
– Cost/byte: -30%
– Net cost: +30%
4
Budget is flat
8
8. Problem Summary
1. Data Growing at 68% CAGR
2. Current implementation will not survive
much longer
– We will fail to meet SLAs on ingest of new
data
– To meet business critical SLAs we will limit
the scope of the data warehouse
3. Many new opportunities / requirements
9
9. New Functionality Needed
Weeks
Product
Analysis
Service
Cross Sell & Performance
Up Sell Planning
Customer
Intelligence Sales
License
Management Proactive
Support
Customer Product
Self Service Development
Seconds
Gigabytes Petabytes
10
12. Requirements used for POC & RFP
Cost Effective
Highly Scalable
Adaptive
New Analytical Capabilities
13
13. POC Tests
Log Data: Report analysis for an event across all install-
base (25% of the install base and 2 months of data
used for benchmarks)
– 6 months to 1 year.
– I/O bound
Counter Manager : Analysis restricted generally to 1
system or 1 cluster data for a single month (2 days
25% install base used for benchmark)
– Trending across install-base are generally rare and
ad-hoc.
– More CPU bound (some tools will query large
numbers of counters)
14
15. Prime Hadoop Use Cases in POC
Workload
Use Case Current Capabilities How Hadoop can help?
Type
Logs (EMS) I/O • One month of data is worth • POC shows a 10 node
Find bound 24 B records cluster could process
occurrence • Out of this some 100 M one month of data
of a pattern records are loaded per within 20 minutes
across all log month in DW. Takes 4 days
files in last to load a week
6 months • No ad-hoc capability exists
to mine the pending records
17
16. Prime Hadoop Use Cases in POC
Workload
Use Case Current Capabilities How Hadoop can help?
Type
Logs (EMS) I/O • One month of data is worth • POC shows a 10 node
Find bound 24 B records cluster could process
occurrence • Out of this some 100 M one month of data
of a pattern records are loaded per within 20 minutes
across all log month in DW. Takes 4 days
files in last to load a week
6 months • No ad-hoc capability exists
to mine the pending records
CM CPU • Up to 10 M records in • Achieved throughput of
Find hot disks bound single CM file 3M records per
by disk types, • 200 B records in a month second during POC
sys model etc. • No capability exists today • 100 node cluster is
in backend infrastructure projected to process
to process these one month of data
in 1.8 hours
18
18. ASUP.Next Hadoop Architecture
HDFS Lookup
F
L Ingest Logs, R
Ingest Asup
Ingest U E
Performance Config Tools
M and raw config S
Data
E T
Pig
Subscribe
Analyze
Metrics, Analytics, E
BI
20
19. NetApp Open Solution for Hadoop
Easy to Deploy, Manage, Scale
Performance; Resilience; Density
Performance
Bandwidth for streaming
IOPs for metadata
Reduced cluster network congestion
Capacity and density
4 servers and 120TB fit in 8U
Fully serviceable storage system
Reliability
Hardware RAID and hot swap prevent job
restart in case of media failure
Reliable metadata (Name Node)
Enterprise-class fit and finish
Enterprise Class Hadoop
21
20. NetApp Open Solution for Hadoop
Easy to Deploy, Manage, Scale
Performance; Resilience; Density
Performance
Bandwidth for streaming
IOPs for metadata
Reduced cluster network congestion
Capacity and density
4 servers and 120TB fit in 8U
Fully serviceable storage system
Reliability
Hardware RAID and hot swap prevent job
restart in case of media failure
Reliable metadata (Name Node)
Enterprise-class fit and finish
Enterprise Class Hadoop
22
21. NetApp Storage Solution Architecture
Key Attributes:
– Storage is protected by in-box RAID
Shared spare pool defers replacement of drives
Rebuild does not consume network bandwidth
– Storage is striped
Maximize performance by minimizing unequal
storage utilization
– Reliable storage: HDFS replication count 2
Fewer disks
Less space, power, cooling, cost, …
23
23. NetApp Storage Solution Architecture
RESULTS ARE
PRELIMINARY
Performance
Concerns
– Initial testing has
focused on using
TestDFSIO
25
24. NetApp Storage Solution Architecture
RESULTS ARE
PRELIMINARY
Performance
Concerns
– Initial testing has
focused on using
TestDFSIO
Per-Disk:
– 14 disks/server in array
– 6 disks/server direct-
attach
26
25. NetApp Storage Solution Architecture
Minimizing TCO
– Disk rebuild
Handled in the controller
Minimal impact to performance
No network bandwidth consumed
– Server uptime
Very high
– Hardware maintenance
Swap out dead disks as routine, not exception
Swap out of stateless servers is painless
27
27. Take Aways
NetApp Assessed multiple traditional DB
technologies to solve it’s Big Data problem and
determined Hadoop was the best fit
Moved from direct attach disks to array-based
storage to improve TCO
The overall architecture supports scale out
growth
30
AutoSupport (resident in DATA ONTAP (OS) of every NetApp storage system) constantly monitors, troubleshoots and reports on the health of NetApp systemsIn addition to using AutoSupport for case generation and part dispatch, NetApp’s risk prognosis ecosystem (developed through innovations in people, process, and technology) delivers exemplary storage uptime and customer satisfactionRisks handled include issues in areas of configuration, interoperability, and other errors induced in the storage system from unintentional operationsNetApp support site has knowledgebase articles and support bulletins to help SAMs (Support Account Managers) and FSEs (Field Support Engineers) drive adoption and awareness and help customers actively mitigate risks
The Current DataWarehouse will reach limits of capacity as well as processing capabilities for future Data ONTAP releasesMissed SLAsThe current environment has limited reporting capabilities, with a large demand for ASUP reportingProcessing all Performance Data for analysis is not due to size and scale of dataData doubling every 16 months