Big Unstructured Data


Published on

Presentation at CloudExpo West 2011. Topic: object "cloud" storage for Big Unstructured Data

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • A Quick History of Big Data & Hadoop ▪ Facebook, Yahoo! and Google found themselves collecting data on an unprecedented scale.  They were the first massive companies collecting tons of data from millions of users. ▪ They quickly overwhelmed traditional data systems and techniques like Oracle and MySql.  Even the best, most expensive vendors using the biggest hardware could barely keep up and certainly couldn’t give them tools to powerfully analyze their influx of data. ▪ In the early 2000’s their armies of PhDs developed new techniques like MapReduce, BigTable and Google File System to handle their big data.  Initially these techniques were held proprietary.  But… ▪ Around 2005 Facebook, Yahoo! and Google started sharing whitepapersdescribing their big data technologies. ▪ In 2006 Doug Cutting starts the Hadoop project as an open source version of these technologies. ▪ Companies in every industry now find themselves with big data problems because their ability to collect data grows every day. ▪ A thriving ecosystem of companies, projects and individuals has emerged to tackle big data problems. What is Big Data? Big Data generally means having so much data that you overwhelm your traditional systems and techniques.  Systems that worked last year, and that felt nimble when launched, suddenly feel sluggish as the burden of massive data loads crushes them. Systems work… engineers make heroic efforts to guarantee that they do.  But they never feel agile or responsive again. Although big data’s often in the many-terabyte, petabyte and exabyte range, there is no official size threshold.   In fact, some of the best big data problems don’t involve massive amounts of data… they just require massive amounts of processing on that data. Signs of a Big Data Problem ▪ Batch jobs that take too long to run… what if you had that business intelligence in a matter of minutes? ▪ CPU-bound database or datawarehouse servers ▪ Repeated emergency meetings to discuss scaling of the data systems ▪ Long waits just to move data around ▪ Business managers asking for insight that IT can’t provide What is Hadoop? Hadoop is an open source software platform that makes big data look like normal data.  It makes it possible to do very complex analysis against very large data sets that would overwhelm even the biggest and most expensive database installations. The problem with traditional databases and techniques is that they invariably centralize data.  A massive Java/MySQL app is architected such that many Java computing machines sit around a central MySQL machine (or even a MySQL cluster.)  Scaling to thousands of Java compute machines means that your central MySQL installation gets hammered by requests.  In essence you’re running a distributed denial of service (ddos) attack on your own systems!  At a fundamental level it’s the disk IO bottleneck that prevents your system from scaling. Hadoop changes the core architecture of computing problems.  Instead of a centralized data store it chunks the massive data sets and stores those chunks all across a cluster of machines.  Then, and this is key, it sends compute jobs *out to* the data!  So the compute jobs run where the data is.  This system leverages disk IO across the entire cluster by putting the data close to the CPU that needs it. Hadoop is built on two main components… MapReduce and the Hadoop File System.  MapReduce is what chunks processing code out to the cluster.  The Hadoop File System is what chunks data out to the cluster. MapReduce is a way of programming computational problems.  While MapReduce jobs can be written in many languages, most are written in Java.  So MapReduce isn’t a language.  It’s a way of thinking about computing problems.  It’s another way to skin a cat.  If you have business computations encoded in PL/SQL, Java, stored procedures or some arcane XML BI syntax, MapReduce can accomplish the same task. The Hadoop File System (HDFS) allows tens, hundreds or thousands of servers to share files.  It’s like creating one hard drive from thousands.  It’s redundant… mission-critical data is always stored on at least three machines… if one machine goes down HDFS automatically shuffles a new copy of your data to another machine.  And it’s smart… HDFS knows how to move segments of data closer to computing processes that need it. Hadoop provides the framework to use MapReduce and HDFS to run massive compute jobs.  When you stand up a cluster of 1000 machines, Hadoop keeps track of which one is running which job, where its data is stored, etc. Massive Investment Momentum in the Big Data Space Big deals are flowing within the Big Data space as enterprises across all industries encounter the same data problems that Facebook, Yahoo! and Google did ten years ago.  This is good for all enterprises as it means that the tools will continue to mature and industry-specific solutions will emerge. ▪ EMC to spend $3B on big data in 2011 after spending $3B in 2010 ▪ IBM invests $100M in big data ▪ Yahoo! Spins HortonWorks out @ $200M valuation ▪ HP Acquires Vertica The list of deals, big and small, goes on and on.  Venture capital is pursuing the space more aggressively than it did the social media space because big data points of pain aren’t tied to discretionary marketing budgets… they’re core to an enterprise’s existence. The space is still nascent with many investments being made in toolsets which will compete and often lose against open source community-developed solutions.  SocketWare chooses a strategy of deep industry insight to create hard-to-replicate products.
  • ----- Meeting Notes (9/1/11 11:01) ----- datacenter here, in powerbus
  • Big Unstructured Data

    1. 1. Big “Unstructured” Data in the Cloud A Case for Optimized Object Storage
    2. 2. Agenda <ul><li>Introduction: storage facts and trends </li></ul><ul><li>Big Data for Analytics vs. Big “Unstructured” Data </li></ul><ul><li>Object Storage for Big “Unstructured” Data </li></ul><ul><li>AmpliStor: Optimized Object Storage </li></ul><ul><li>Cost Reduction through Erasure Coding </li></ul><ul><li>Use Case: Montreux Jazz </li></ul><ul><li>Questions </li></ul>Amplidata Confidential
    3. 3. <ul><li>Introduction: storage facts and trends </li></ul>
    4. 4. Introduction, facts and trends <ul><li>Studies show that data storage capacities will likely increase by over 30X in the coming decade to over 35 Zettabytes </li></ul>30X 35ZB Time Storage Consumption High-capacity drives Less Staff / TB Unstructured Data 2020
    5. 5. Introduction, facts and trends The number of qualified people to manage this data will stay flat (~1.5X) Time Capcity / Budget Efficiency: automate & reduce overhead Storage Requirements Storage Budget
    6. 6. Introduction, facts and trends <ul><li>Much of that growth (80%) is driven by unstructured data : billions of large objects </li></ul>Active Archives Online Images Large Files Medical Images Online Storage Online Movies
    7. 7. Introduction, facts and trends <ul><li>Storage currently accounts for 37-40% of overall data center energy consumption from hardware </li></ul><ul><li>Energy consumption will influence technology procurement criteria </li></ul>Data Center Power Usage
    8. 8. Introduction, facts and trends <ul><li>Data migration will soon take longer than the lifetime of media </li></ul><ul><li>“ It’s like painting the Golden Gate Bridge, but the bridge is continuously getting longer ” </li></ul>
    9. 9. Introduction, facts and trends <ul><li>There is a growing interest in Object Storage </li></ul><ul><li>Erasure coding is the proclaimed successor of RAID </li></ul>
    10. 10. <ul><li>Big Data for Analytics vs. Big “Unstructured” Data </li></ul>
    11. 11. Big Data for Analytics <ul><li>In the 1990ies, we experienced an explosion of data captured for analytics purposes: </li></ul><ul><ul><li>Academic Research </li></ul></ul><ul><ul><li>Chemical R&D facilities </li></ul></ul><ul><ul><li>Geo-industry, oil & gas </li></ul></ul><ul><ul><li>… </li></ul></ul>
    12. 12. Big Data for Analytics <ul><li>Data is captured as many small log files & concatenated as “Big Data” </li></ul><ul><li>Relational databases were not optimal: </li></ul><ul><ul><li>Too much data, too big </li></ul></ul><ul><ul><li>Not performant for analytics </li></ul></ul><ul><li>This stimulated innovations: </li></ul><ul><ul><li>Hadoop, MapReduce, GFS </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>=> Big Data for Analytics </li></ul>
    13. 13. Big Data Evolution <ul><li>Today, Big Data trend refers to both Big Data for Analytics and Big Unstructured Data: </li></ul><ul><ul><li>Fundamentally different </li></ul></ul><ul><ul><li>Lots of similarities </li></ul></ul><ul><li>Unstructured data is traditionally stored on host files systems but: </li></ul><ul><ul><li>File systems do not scale up to the size we need </li></ul></ul><ul><ul><li>File systems do not meet performance requirements </li></ul></ul>
    14. 14. Big Unstructured Data <ul><li>80% of data growth comes from unstructured data </li></ul><ul><li>Unstructured data takes all shapes based on specific industries </li></ul><ul><ul><li>Healthcare: medical images </li></ul></ul><ul><ul><li>Travel and hospitality: surveillance video footage </li></ul></ul><ul><ul><li>Retail and manufacturing: design data and product images </li></ul></ul><ul><ul><li>Huge amount of documents generated in any corporate </li></ul></ul><ul><ul><li>… </li></ul></ul>Source: Oraclestorageguy
    15. 15. Big Unstructured Data <ul><li>Most unstructured data is archived, often to tape (cost) </li></ul><ul><li>Data archives are a burden (Grandma’s Attic) </li></ul>
    16. 16. Big Unstructured Data <ul><li>Big Unstructured Data represents the next generation analytics that can help businesses make more informed decisions related to: </li></ul><ul><ul><li>Product strategy </li></ul></ul><ul><ul><li>Marketing </li></ul></ul><ul><ul><li>Research </li></ul></ul><ul><ul><li>Historical trends </li></ul></ul><ul><ul><li>… </li></ul></ul>
    17. 17. Big Unstructured Data <ul><li>Companies are starting to see the value of the data in their archives: </li></ul><ul><ul><li>Documents of individuals can be valuable for others </li></ul></ul><ul><li>Some companies have legal reasons to keep data available </li></ul><ul><li>Unexplored analytics opportunities </li></ul>
    18. 18. Big Unstructured Data But how do store all this data in a cost efficient way?
    19. 19. Big Unstructured Data <ul><li>What are the requirements? </li></ul><ul><ul><li>Tape is not an option: latency is key </li></ul></ul><ul><ul><li>Data has to be always available online </li></ul></ul><ul><ul><li>Direct interface to the applications </li></ul></ul><ul><ul><li>Petabyte scalability </li></ul></ul><ul><ul><li>Extreme reliability, integrity </li></ul></ul><ul><ul><li>Cost-efficient </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Disk Storage </li></ul></ul><ul><ul><li>+ REST API, Cloud-enabled </li></ul></ul><ul><ul><li>+ Erasure Coding </li></ul></ul><ul><ul><li>= Optimized Object Storage </li></ul></ul>} }
    20. 20. <ul><li>Object Storage for </li></ul><ul><li>Big Unstructured Data </li></ul>
    21. 21. Disk vs. Tape <ul><li>Tape has several obvious advantages over disk & there will always be use cases for tape </li></ul><ul><li>But disks enable live archives with instant data accessibility </li></ul><ul><li>More arguments for disk-based archives </li></ul><ul><ul><li>Disks can be powered down </li></ul></ul><ul><ul><li>Tape requires replication </li></ul></ul><ul><ul><li>Data integrity? </li></ul></ul><ul><ul><li>Massive migration projects </li></ul></ul><ul><ul><li>… </li></ul></ul>
    22. 22. Storage Clouds <ul><li>Storage Cloud infrastructures </li></ul><ul><ul><li>Private or public setup </li></ul></ul><ul><ul><li>Provide highest availability </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>File systems are obsolete </li></ul></ul><ul><ul><li>Use REST API </li></ul></ul>REST API Application Application Application
    23. 23. Petabyte Scalability <ul><li>Object Storage systems will scale: </li></ul><ul><ul><li>Beyond petabytes of data </li></ul></ul><ul><ul><li>Beyond billions of data objects </li></ul></ul><ul><li>Systems should scale uniformly </li></ul><ul><ul><li>Add resources incrementally </li></ul></ul><ul><ul><li>Scale performance and capacity separately </li></ul></ul>
    24. 24. Petabyte Scalability <ul><li>Scalable metadata repository (capacity & performance) </li></ul><ul><li>Lightweight metadata, designed to scale up to billions of objects </li></ul><ul><li>Flat namespace </li></ul>
    25. 25. Data Integrity <ul><li>Ensuring the integrity of long-term unstructured data archive requires new data protection algorithms, to: </li></ul><ul><ul><li>Address the increasing capacity of disk drives </li></ul></ul><ul><ul><li>Solve issues related to long RAID rebuild windows </li></ul></ul><ul><li>“ Object storage systems based on erasure-coding can not only protect data from higher numbers of drive failures, but also against the failure of entire storage modules. ” </li></ul>
    26. 26. Cost-efficient <ul><li>Power, cooling and floor-space requirements are paramount concerns: erasure coding drastically reduces overhead numbers </li></ul><ul><li>Systems need to be self-managing </li></ul><ul><li>The system needs to be hardware independent: data migration needs to be an automatic, continuous background process. </li></ul>
    27. 27. Cost-efficient <ul><li>Eliminate the need for manual disk swaps: move to higher-level container management tasks. </li></ul><ul><li>The system should automatically manage allocation to the underlying disks </li></ul>
    28. 28. Security <ul><li>Multi-tenant authentication/authorisation </li></ul><ul><ul><li>Read </li></ul></ul><ul><ul><li>Read/Write </li></ul></ul><ul><ul><li>List </li></ul></ul><ul><li>Auditing & Logging </li></ul><ul><li>Secure protocols/encryptions (https) </li></ul><ul><li>Individual disks cannot be mis-used </li></ul><ul><ul><li>Data is encoded and spread </li></ul></ul>
    29. 29. <ul><li>Amplidata Object Storage </li></ul>
    30. 30. AmpliStor for Big Unstructured Data <ul><li>Turnkey storage solution for BIG Unstructured Data </li></ul><ul><ul><li>Systems scales from beyond Petabytes with Global Object Namespace </li></ul></ul><ul><ul><li>Throughput scales with amount of resources </li></ul></ul><ul><li>Policy-Driven Storage Durability </li></ul><ul><ul><li>“ Ten 9’s” of Durability (99.99999999%) and beyond through policies </li></ul></ul><ul><ul><li>Eliminates the reliability exposures of RAID on high-density disk drives </li></ul></ul><ul><ul><li>Eliminates data corruption or loss due to bit errors </li></ul></ul><ul><li>50-70% improvement in Storage Efficiency </li></ul><ul><ul><li>70% reduction in storage footprint compared to “Three copies in the cloud” </li></ul></ul><ul><ul><li>50% reduction in storage footprint compared to mirrored RAID </li></ul></ul><ul><ul><li>Drives proportional reductions in data center floor space & power </li></ul></ul><ul><li>Automated Management </li></ul><ul><ul><li>Self-healing design manages data integrity assurance and auto-repairs data </li></ul></ul><ul><li>50-70% reduction in TCO </li></ul><ul><ul><li>Storage footprint (Capex), power, data center space & management costs </li></ul></ul>
    31. 31. Big Unstructured Data Use Cases <ul><li>Online Applications </li></ul><ul><ul><li>SaaS applications managing large-scale rich media </li></ul></ul><ul><ul><li>Photography & video within social media </li></ul></ul><ul><ul><li>Tens of petabytes are becoming common – RAID is insufficient, triple-mirrors too expensive </li></ul></ul><ul><li>Storage Clouds </li></ul><ul><ul><li>Online file sharing & backup services </li></ul></ul><ul><ul><li>Cloud Service Providers building competitors to Amazon S3 </li></ul></ul><ul><ul><li>Corporate private cloud repositories for unstructured data </li></ul></ul><ul><li>Media & Entertainment </li></ul><ul><ul><li>Online video repositories (HD video driving huge capacities) </li></ul></ul><ul><ul><li>New tier that fills the void between fast/expensive SAN (post-production) & tape archives </li></ul></ul><ul><li>Others </li></ul><ul><ul><li>Video surveillance, medical imaging, satellite imaging, backups & BIG DATA archives </li></ul></ul>Amplidata Confidential
    32. 32. Erasure Coding, simply explained <ul><li>BitSpread Encodes data in linear equations </li></ul><ul><li>Distributes the equations across disks, storage nodes, racks, data centers </li></ul><ul><li>Original data can always be uniquely determined from a subset of the equations </li></ul><ul><li>BitSpread uses 4K variables independent of object size </li></ul><ul><li>Extra blocks can be generated without knowing what is missing </li></ul>75 7 5 X+Y=12 X-Y=2 2X+Y=19 7 5 7 5 7 5 BitSpread Simplified mathematics: Original Object Decomposed Object Series of Equations Any 2 out of 3 equations uniquely determine object
    33. 33. Core Software Technology Components <ul><li>BitSpread – Distributed Encoder/Decoder </li></ul><ul><ul><li>RAID replacement technology based on unique variant of Erasure Coding </li></ul></ul><ul><ul><li>“ Dial-in” fault tolerance through namespace level policies </li></ul></ul><ul><ul><ul><li>Namespace1: 16/4 policy protects against any 4 failures in 16 disks </li></ul></ul></ul><ul><ul><ul><li>Namespace2: 18/6 policy protects against any 6 failures in 18 disks </li></ul></ul></ul><ul><ul><li>Provides availability and reliability even during failures </li></ul></ul><ul><ul><li>Policies can be dynamically changed </li></ul></ul><ul><li>BitDynamics – Maintenance & Self-Healing Agent </li></ul><ul><ul><li>Out of band operations agent for disk monitoring, integrity verification & object self-healing </li></ul></ul><ul><ul><li>Performs automated tasks: scrubs, verifies, self-heals, repairs & optimizes data on disk </li></ul></ul>
    34. 34. AmpliStor System <ul><li>Controller Nodes (3+) </li></ul><ul><ul><li>Dual, quad-core Xeon processors, 16GB RAM, 2 x 200GB SSD, 2 x 10 Gigabit Ethernet network interfaces </li></ul></ul><ul><ul><li>Object Based Interfaces: http/REST API, C API, Python CLI, WebDav </li></ul></ul><ul><ul><li>3 Controllers per System (minimum) – can be scaled up for performance (fully shared metadata & storage pool) </li></ul></ul><ul><li>AS20 Low Power Storage Nodes (8+) </li></ul><ul><ul><li>1 U rack mount chassis with 20TB capacity </li></ul></ul><ul><ul><li>2 x 1 Gigabit network interfaces </li></ul></ul><ul><ul><li>Low power processor (Intel Atom) </li></ul></ul><ul><ul><li>10 x 2 TB low-power “Green” SATA disk drives </li></ul></ul><ul><ul><li>Low power: 65 - 140 watts per node utilization </li></ul></ul><ul><ul><li>(3.5 - 7 watts per TB) </li></ul></ul>
    35. 35. AmpliStor: Dense, Fast & Power-efficient <ul><li>High-Density Rack Definition </li></ul><ul><ul><li>Single 44U rack: </li></ul></ul><ul><ul><li>(3) Controllers Nodes & (36) Storage Nodes </li></ul></ul><ul><ul><li>OR </li></ul></ul><ul><ul><li>(42) Storage Nodes </li></ul></ul><ul><ul><li>(2) 48-port Ethernet switches </li></ul></ul><ul><li>Storage Density </li></ul><ul><ul><li>Up to 420 disk drives in a single rack </li></ul></ul><ul><ul><li>840TB raw capacity / 525TB usable capacity protected against 4 simultaneous failures </li></ul></ul><ul><li>Power </li></ul><ul><ul><li>Nominal / peak usage 4.2 / 6.6 KWatts </li></ul></ul><ul><ul><li>2 x 30A / 240VAC circuit power supplies </li></ul></ul><ul><li>Performance </li></ul><ul><ul><li>3 x 10GbE ports to customer network </li></ul></ul><ul><ul><li>This provides 1.3 GB/sec aggregate throughput </li></ul></ul>Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Controller node Storage node Storage node Storage node Storage node Storage node Controller node Controller node 2 x 10GbE (expansion racks) 3 x 10GbE (customer network) Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Storage node Ethernet Ethernet Ethernet Ethernet
    36. 36. AmpliStor Summary Advantages <ul><li>Ultra-Durable & Efficient platform </li></ul><ul><ul><li>Our erasure-coding implementation provides the most flexible & efficient storage durability </li></ul></ul><ul><ul><li>Dial-in Ten 9’s durability and higher through policies </li></ul></ul><ul><li>Performance </li></ul><ul><ul><li>We can demonstrate throughput of 1.3 GB/sec per rack today and scale-up controllers for higher throughput </li></ul></ul><ul><li>Power & Density </li></ul><ul><ul><li>AmpliStor provides 50-70% better power efficiency & density than competitors </li></ul></ul><ul><li>Pricing & TCO </li></ul><ul><ul><li>50-70% TCO reduction compared to alternative storage with high-durability </li></ul></ul>
    37. 37. Amplidata Background <ul><li>Technology was incubated since 2005 at Incubaid ( </li></ul><ul><li>Amplidata Incorporated in 2008 </li></ul><ul><li>Designed by Founders of DCT (became NetBackup Puredisk deduplication technology - acquired by Veritas/Symantec) </li></ul><ul><li>Belgium based R&D (Lochristi, outside Gent) </li></ul><ul><li>US Headquarters in Redwood City, CA </li></ul><ul><li>World Wide Support centers in Redwood City, CA; Belgium, Egypt, India, (Taiwan in Q4) </li></ul>
    38. 38. <ul><li>AmpliStor Use Case: </li></ul><ul><li>Montreux Jazz </li></ul>
    39. 39. Montreux Jazz, an invaluable research asset <ul><li>45 years of Montreux Jazz festivals </li></ul><ul><ul><li>5000 hours of video (2000 critical) </li></ul></ul><ul><ul><li>5000 hours of high quality audio </li></ul></ul><ul><ul><li>3000 concerts descriptions </li></ul></ul><ul><ul><li>High-def video formats used since 1991 </li></ul></ul><ul><ul><li>Also a collection of photos, press releases, … </li></ul></ul><ul><li>Selected AmpliStor as the scale-out Archive system </li></ul><ul><ul><li>Collaboration with the University of Laussane, Switzerland (EPFL) </li></ul></ul><ul><ul><li>Acquired a 1PB AmpliStor system </li></ul></ul><ul><li>The 3 main objectives: </li></ul><ul><ul><li>Save the recordings in a secure archive (static archive) </li></ul></ul><ul><ul><li>Make the archive available for cultural and scientific projects (live archive) </li></ul></ul><ul><ul><li>Scale and maintain the archives </li></ul></ul><ul><ul><li>Enable end-user access in a series of Jazz Café’s </li></ul></ul>
    40. 40. Tom Leyden, Director of Alliances & Marketing Thank You!