Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage


Published on

Case Study:

Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage

  1. 1. Cloudian® S3 Cloud Storage Platform Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object Storage Paul Turner Cloudian Inc. June 11th 2014
  2. 2. About Cloudian • Hybrid cloud storage startup in Silicon Valley – Strong venture backing: Goldman Sachs, Intel Capital – Solid management with storage, big data, enterprise software and telco expertise – 50 employees, offices in Foster City, Japan and China • Production hardened product • Target market: mid- to large-enterprises & regional service providers • GTM: traditional storage distribution/VARs CLOUDIAN PARTNERS
  3. 3. The Challenge • Business problem = Analysis of log data from our customer systems to improve support (classic ‘Internet of Things’ content) • Existing system required transformation of the data into HDFS for analytics (slow and costly) Goal : Reduce cost and provide faster results 6/16/2014 3
  4. 4. Use Case : Support Analytics • Compare system statistics and usage patterns to previous normal results 6/16/2014 4 Abnormal Operations Analysis End User Analysis to root cause issues Trend Analysis for Capacity Planning and Traffic Patterns • Identify all operations for a particular user and review patterns and any faults • Build capacity and traffic trend lines based on statistical analysis of all traffic 100tps S3 Server = 83million lines info log = 3.5GB/Day 10 Server System = 35GB/Day ~ 1TB/month 100 Customer Systems => 1.2PB Annually
  5. 5. Traditional Big Data Flow Event Processing Platform Big Data Storage Platform Analytics PlatformContent Storage Consumer Activity (Events, GPS, WiFi) Social MediaDevice Tracking and Logs (Event, Configuration, Usage, Performance, ) Real Time Events Big Data Result of analysis 6/16/2014 5
  6. 6. Traditional Big Data Flow Event Processing Platform Analytics Platform (HDFS)Content Storage (Object, NAS) • Wasted storage = storage for content and analytics • Transform of data into HDFS can be costly • High overhead of HDFS (3copy replica) for content which may be poor quality Logs, Config 6/16/2014 6
  7. 7. S3 and Hadoop • Apache Hadoop supports S3 since Jan 2008 – • Well-proven by Amazon with Elastic MapReduce • State-of-the-art and advancing quickly to provide much easier Hadoop over S3 – e.g. Netflix Genie – 6/16/2014 7
  8. 8. Cloudian Approach Event Processing Platform AnalyticsCloudian HyperStore Storage • No redundant storage of data • Hyperstore scales out with your data – adding nodes for I/O • Analyze more - allows for efficient bulk data analysis in place • Take advantage of multi-core CPUs – makes sense for MapReduce • Can feed smarter data for subsequent analytic systems • Faster time to decision 6/16/2014 8
  9. 9. Cloudian Hadoop Configuration • Hadoop 2.2 • Configured for native S3 file system (etc/hadoop/core-site.xml) – S3N native file system for reading and writing regular files on S3. The advantage of this file system is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. • Configure Hadoop to use Cloudian (etc/hadoop/ – s3service.s3-endpoint=CLOUDIAN_ENDPOINT – s3service.s3-endpoint-http-port=CLOUDIAN_PORT 6/16/2014 9 Note: you can also dedicate a bucket for Hadoop analytics and then Hadoop will chunk the content into blocks for storage – like HDFS
  10. 10. S3 NFS Cloudian HyperStore® Software  Scalable peer-to-peer architecture  Multi-data center replication  Multi-Tenancy and Chargeback  Hybrid cloud-ready (any S3 cloud)  100s of supported applications  Optimized for any workload  Storage for OpenStack & CloudStack 6/16/2014 10
  11. 11. Elastic, Distributed and Reliable NOSQL database distributes and replicates data Logical Ring Data is automatically replicated to multiple nodes. Location of data can be designated, for instance, to multiple datacenters and per rack. DC1 DC2 In theory, # of nodes in a logical ring can be up to 2127 (almost infinite). Data load can be rebalanced when a node is added or removed. Jun-14 116/16/2014
  12. 12. Enhanced HyperStore® Technology • Policies tailored for different object types • Optimized for all data • Chunking for better performance • Erasure Coding for deep archive efficiency • Reliable storage across multi-node failures HyperStore Patent Pending Small Objects Large Objects Active Content File System NOSQL DB Erasure Coding Deep Archives 6/16/2014 12
  13. 13. Cloudian Complete S3 API • Core REST API – Get, Put, Post, Head, Delete • Multi-part uploads: Allows uploading large objects in multiple parts • Versioning: Multiple versions of same object • Bucket Lifecycle: Auto-expiration using rules • Server side encryption: Managed by Cloudian • Location Constraint: Assign data to specific region (e.g. for HIPAA compliance) • Bucket Website: Create buckets as websites to host web content • Access control lists (ACLs) define access rights to bucket and object • And more... Cloudian Complete S3 API Products S3 API Cloudian AmpliData Basho Caringo Cleversafe EMC Atmos NetApp Bycast Scality OpenStack Swift 6/16/2014 13
  14. 14. Seamless tiering to Amazon S3, Glacier and other S3 Service Providers 146/16/2014 • Cloudian deployed as On-Premises S3 cloud behind the firewall • Automatically migrates data to AWS using Bucket Lifecycle Policies – Optional migration to Glacier – Metadata maintained for search/list of objects • Configurable to reduce overhead • Read/Writes to migrated objects – restore by default, option to redirect to AWS/S3 Service Provider On-Premises S3 S3 Client/Application Content migrated or restored via Bucket Lifecycle Policies Option to redirect migrated content Amazon S3 Firewall Amazon Glacier
  15. 15. Big Data Storage Platform 15 Event Processing Platform Big Data Storage Platform Input I/F Recommend CEP Engine Filter Judge Aggregate Real Time Analysis Big Data Analysis Analyze Recommend Data Analysis and Storage Platform Content Storage Consumer Activity (Events, GPS, WiFi) Social mediaBusiness Tracking (goods, inventory, campaign, sales) Smarter Business 6/16/2014
  16. 16. Future Work • Delivery of Cloudian Hadoop-ready object storage (2HCY14) • Integration with key Hadoop distributions • Locality awareness • Potentially use new drive technology for processing (eg HGST Ethernet drive) • Find out more – Booth 139 6/16/2014 16
  17. 17. Cloudian® S3 Cloud Storage Platform Thank You! Questions? “The Leading Provider of Hybrid Cloud Storage”