• Like

OSC2012: Big Data Using Open Source: Netapp Project - Technical

  • 567 views
Uploaded on

Presented during the Open Source Conference 2012, organized by Accenture and Redhat on December 14th 2012. This presentation discusses an open source Big Data case study. …

Presented during the Open Source Conference 2012, organized by Accenture and Redhat on December 14th 2012. This presentation discusses an open source Big Data case study.

By Jonathan Bender, Consultant, Accenture Technology Labs

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
567
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
23
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Open source Big Data case study: Building aplatform for remote device support at NetApp(Part II – Technical)
  • 2. Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design OverviewCopyright © 2012 Accenture All rights reserved. 2
  • 3. Big Data The concept is disruptive. The technology is disruptive. And, markets and clients are being impacted. 1 Wordle for Credit Suisse, Does Size Matter Only?, September 2011Copyright © 2012 Accenture All rights reserved. 3
  • 4. Shifts in Data and Analytics The changing landscape and required winning strategies are creating shifts within Big Data collection and analytics Data Explosion Monetization • Unstructured data is doubling • Growth of enterprise data every 3 months monetization services • 2011 saw 47% growth overall • Large retailers monetizing own • By 2015, number of networked data to provide insights to devices will be 2x global suppliers population Data-led Innovation Social Media • De-coupling data from • Growing market for scrubbed, applications aggregate data from social • Disparate external data shaping media and blogs context • Greater focus on data that • Cost effective mobilization of provides insight in a customer’s massive scale data digital persona Technology Data Mobilization • Commodity priced storage and • Novel approaches to analyze compute unstructured data creating shorter time from data to insight • Emergence of open source and big data technologies solving • Shift towards data consumption production problems at scale in multiple environments (business apps, mobile, social) Copyright © 2012 Accenture All rights reserved. 4
  • 5. The Big Data Approach Treat data as a strategic asset, seek to maximize it’s value to the organization Invest in common services, data platforms and tools Rapidly prototype, deliver, and measure value-added data services, evolve over time • Data-driven decision making • End-to-end ownership of • Experimentation and services continuous improvement with • Sharing of platform, tools and academic rigor code CultureCopyright © 2012 Accenture All rights reserved. 5
  • 6. Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design OverviewCopyright © 2012 Accenture All rights reserved. 6
  • 7. Client Context NetApp, Inc. • Industry: Data storage, data management • 77% Fortune 500 companies are customers • Creator of Data ONTAP: industry leading storage OSCopyright © 2012 Accenture All rights reserved. 7
  • 8. AutoSupport • Secure automated “call-home” service • Catch issues before they become critical • System monitoring and alerting • RMA requests without customer action • Faster incident management AutoSupport Storage Devices Messages AutoSupport Data WarehouseCopyright © 2012 Accenture All rights reserved. 8
  • 9. Business Challenges SAP CRM MyASUP eBI STOR ASUP Tools Analytics & Mining • Increase in response times / lower Presentation availability for services CRM Module Rules Module Java Interface Rules Rules Jasper Stored Proc Rest Interface Rules Rules Rules Rules Various Interface Rules • Incoming data volume doubling every 16 Rules Rules Rules eB PMBTA BI I Integrate months Custom ETL Custom ETL DSS Custom ETL Custom ETL Transform • Proliferation of ad hoc datamarts and Xterra DB PWillows DW 3 ODS DW 2 Adhoc DB’s Stage point solutions Xterra Parser Light Parser Parser Loader Parser Core Parser Adhoc Extract • Unable to analyze full AutoSupport Parsers Xterra File Source contents efficiently SAP CRM GEO DRM HDD ASUP STAGE PNOW DM File Storage Messages AutoSupport Flat-File Storage Requirement 3500 3000 Total Usage (tb) 2500 Projected Total Usage (tb) 2000 1500 Doubles 1000 500 0 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16Copyright © 2012 Accenture All rights reserved. 9
  • 10. Solution Design GoalsImprove data access and technology cost effectiveness and performance. • Improve system response times and data availability • Expose common data services for consumption across business units • Standardize key business metrics into common rules repository • Lower operational costs as ecosystem continues to scale • Provide more granular analytical capabilities Copyright © 2012 Accenture All rights reserved. 10
  • 11. Role of Open Source Platform is composed of open source technologies purpose-built for large-scale storage, processing and analysis 1 Actual Big Data Solution Blueprint for a hybrid deploymentCopyright © 2012 Accenture All rights reserved. 11
  • 12. Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design OverviewCopyright © 2012 Accenture All rights reserved. 12
  • 13. Technology Primer – HadoopHadoop Distributed Filesystem Hadoop MapReduce(HDFS) • Parallel processing for large datasets• Divides files into smaller “blocks”, across machines stored across machines • Breaks job into tasks, using a simple map()• Automated replication, fault tolerance and reduce() paradigm for data flowsCopyright © 2012 Accenture All rights reserved. 13
  • 14. Technology Primer – MapReduceMapReduce Map(key,value)(Simple Example – Word Count) Reduce(key, List<value> values) Map Phase Shuffle Phase <one,1> <one,1> m <fish,1> Input <two,1> r One fish, <two,1> m <fish,1> <red,1> two fish, r <blue,1> red fish, blue fish. <red,1> m <fish,1> r <fish,4> m <blue,1> <fish,1>Copyright © 2012 Accenture All rights reserved. 14
  • 15. Technology Primer – NoSQL• “Not only” SQL • Catch-all term for various non-relational database systems• Typical areas of differentation • Data model semantics • eg. Database, Document, Key-Value • CAP trade-offs • Consistency, Availability, Partition-Tolerance • Scale-out architecture • eg. Sharding, Distributed hash • Query language Examples: HBase, Cassandra, mongoDB, Neo4j, etc.Copyright © 2012 Accenture All rights reserved. 15
  • 16. Topics  Big Data Perspective  Case Study: NetApp AutoSupport  Technology Primer  Design OverviewCopyright © 2012 Accenture All rights reserved. 16
  • 17. Data Pipeline Overview Data Service Interface Incoming Messages Core Data Ad hoc Ingestion Processing analytics ETLCopyright © 2012 Accenture All rights reserved. 17
  • 18. Data Ingestion Technologies • Apache Flume, Apache Hadoop, Drools BRMS, JMS Capabilities • Handle dynamic data volumes Notifications • Normalization of disparate file formats • Real-time aggregation of documents JMS • JMS alerts for critical messages Parsing tier Aggregation & sink tierDocuments fromFront End HTTP/SMTP Flume Flume FlumeGateway Routing tier agent agent agent Aggregated files Flume Flume Flume Flume client agent agent agent Rules HDFS Engine Flume Flume Flume agent agent agentCopyright © 2012 Accenture All rights reserved. 18
  • 19. Core Data ProcessingTechnologies• MapReduce, HBase, Solr, AvroCapabilities• Parallel processing for increased throughput• Efficient storage of complex data objects in Avro Search indexes Parse text Solr contents Transform and derive data objects Primary storage Documents gathered from Flume Map HBase Reduce Map HDFS Write derived objects to Data warehouse data stores Map Reduce HiveCopyright © 2012 Accenture All rights reserved. 19
  • 20. Data Services Technologies • Apache HBase, Solr, Tomcat Capabilities • Unified web services API for end users • Support for complex queries and searches across multiple dimensions with Solr • Access both raw and derived content for a given systemCopyright © 2012 Accenture All rights reserved. 20
  • 21. Analytics / ETL Technologies • Apache Hive, Pig, Datameer (Ad hoc analytics) • Pentaho (ETL / Data Integration) Capabilities • Analytical environment for both business analysts and “power users” • Hive or Pig as higher level query languages • Datameer for analytics with a spreadsheet UI • ETL through Pentaho MapReduce • (runs Pentaho ETL server inside of a MapReduce Job)Copyright © 2012 Accenture All rights reserved. 21
  • 22. Successes and Challenges Successes • Web service interface contracts simplified integration with user tools, allowed for flexibility in internal implementation • Open source core allowed rapid for rapid iteration • Met or exceeded all SLAs using commodity hardware, significantly driving down costs Challenges • Monitoring a large distributed system requires discipline and a strong operations team • Shared storage systems and Big Data technologies don’t always play well together • “Schemaless” systems can become a headache to maintain, especially with complex data modelsCopyright © 2012 Accenture All rights reserved. 22
  • 23. Thank you Jonathan Bender Consultant, Accenture Technology Labs jonathan.bender@accenture.comCopyright © 2012 Accenture All rights reserved. 23