Storage 2.0 (Unstructured Data)


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Storage 2.0 (Unstructured Data)

  1. 1. Vikas Deolaliker 2008
  2. 2. Executive Summary – I <ul><li>Opportunity </li></ul><ul><li>Fixed content mining is a computationally intensive operation. A purpose built appliance with adequate integration hooks to back end data warehousing systems with add-on/plug-ins for most popular BI clients will meet the requirements for departmental and small and medium sized businesses. </li></ul><ul><li>The key value multiplier for such a product is in its ability to seamlessly integrate with existing enterprise systems and generate reports which can be printed, imported (into excel, access etc). </li></ul><ul><li>The market for such an appliance is expected to reach $200M in 2010 (not counting the storage/server pullthrough). </li></ul><ul><li>Industry </li></ul><ul><li>Unstructured data or “Fixed Content” refers to digital content that is generated outside a business context i.e. the data does not have a schema and is not stored in databases. Normally, all non-transactional data such as email, IM, media, Web Content, metadata and customer generated content is considered unstructured. </li></ul><ul><li>It is increasing desired by businesses to add information from “unstructured data” to their library of information sources to improve business decision making. Business Intelligence industry offers numerous products that enable discovery, intercept, metadata extraction, semantic analysis, storage and lifecycle management (ILM) of unstructured data. Companies such as InterWoven, Vignette, Informatica, Manugistics, IBM and Oracle have products that enable warehousing of content that is considered “unstructured”. BusObj recently acquired Inxight software for text analytics. </li></ul><ul><li>Storage system vendors such as EMC and NetApp have created a new category of storage systems called “Content Addressable Storage” or CAS. IBM recently acquired XIV to offer a competition to EMC called Nextra. </li></ul><ul><li>SNIA, a storage industry standard body has a initiative called XAM (Extensible Access Management). XAM compliant f products offer a programmable interface for archiving applications to query (search), retrieve and control access to fixed content. It also allows compliance software to access fixed content for SOX and other regulatory compliance tests. </li></ul>
  3. 3. Executive Summary - II <ul><li>Market </li></ul><ul><li>Fixed content is stored on NAS appliances and clusters. The market for rich media NAS is fast growing and expected to surpass $1B in 2009. EMC is the current leader in NAS with 36% of the market share and is expected to lead the NAS market for fixed content as well. </li></ul><ul><li>BI on fixed content is an emerging market. the market for fixed content BI software is still in its infancy with annual revenues under $20M. This market came under focus with acquisition of Inxight software by BusObjs. This software mostly runs on windows and accesses storage using iSCSI over IP. </li></ul><ul><li>Content Service Providers (CSPs) like Google/Yahoo store fixed content in clusters of servers which run their own proprietary filesystems a.k.a Storage 2.0. This market is proprietary as those filesystems are source of differentiation for the CSPs. </li></ul>
  4. 4. The Trend: According to IDC, transactional data is growing at 32.3% while fixed content (unstructured data) is growing at 63.7%. Replicated or back-up data is growing at 43% p.a. <ul><li>IDC, Dec, 2007 </li></ul><ul><li>Unstructured data growth means growth in file services over LAN/WAN </li></ul><ul><li>Replicated data growth is lower than expected. </li></ul><ul><li>Structured data growth is lowest because most of the data generated today is outside a transactional context. </li></ul>
  5. 5. The Opportunity: Client can take a three pronged strategy to enter the fixed content market. (a) Develop asset management software (b) Develop Storage 2.0 infrastructure (c) Develop a purpose built BI appliance for fixed content. Digital Asset Management Software, IDC, 2008 $M <ul><li>Client can enter the market with media asset management software. The software pulls server and storage infrastructure. </li></ul><ul><ul><li>Digital Asset Management </li></ul></ul><ul><ul><ul><li>- Workflow and ILM management of fixed content </li></ul></ul></ul><ul><ul><ul><li>- Content Intelligence Functionality such as text analytics, search, visualization </li></ul></ul></ul><ul><ul><li>File System Infrastructure (a.k.a Storage 2.0) </li></ul></ul><ul><ul><ul><li>- Enhanced NAS storage for low latency, API based access to media content in filesystems </li></ul></ul></ul><ul><ul><li>Business can add modules to BI pipeline and make a purpose built appliance for fixed content. </li></ul></ul>  2005 2006 2007 2008 2009 2010 2011 2012 2007 Share (%) 2007-2012 CAGR (%) 2012 Share (%) Unix 82 97 113 131 153 179 210 250 22.6 17.2 20.9 Linux/other open source 6 7 9 11 14 17 22 26 1.8 23.9 2.1 Windows 32 and 64 197 238 285 341 409 492 590 703 57.3 19.8 58.8 Other 64 77 91 108 129 153 182 218 18.3 19 18.2 Total 349 418 498 591 705 841 1,004 1,196 100 19.2 100
  6. 6. The Market: Fixed content market total is approximately $1B. Majority of the market is storage hardware and software. NAS is the dominant organization for storage for fixed content. Emerging products such as Storage 2.0, Fixed media asset management software and BI appliances are currently in nascent stage. <ul><li>Hardware (~$1B) </li></ul><ul><ul><li>EMC, NetApp, HDS, IBM, HP </li></ul></ul><ul><ul><li>Sell directly to CSPs like Yahoo, Google, eBay </li></ul></ul><ul><li>Software (~$20M) </li></ul><ul><ul><li>Asset Management: HP, IBM </li></ul></ul><ul><ul><li>Business Intelligence: BusObj (Inxight) </li></ul></ul><ul><ul><li>Fixed Content Search/Retrieval: Endeca, Lucene, Microsoft (FAST) </li></ul></ul><ul><ul><li>Content Editing: Adobe, Microsoft, Apple </li></ul></ul>
  7. 7. Storage 2.0: Focus has shifted from IO in storage 1.0 to higher level fileservices. The driver is no longer access protocol but content preference. Feature Storage 1.0 Storage 2.0 Management Local Application Web Application Access SCSI Over FC, GbE or Bus SCSI Over IP Provisioning LUN level granularity Filesystem size Virtualization LUN, Volume Mgr. Object Level Application Profile Write Many Read One WORM Oversubscription 1:1 (provisioning is equal to allocation) N:1 (Provisioning can be n-times allocation)
  8. 8. The Infrastructure Play: Filesystems are turning into a platform with programming interfaces, data routers, load balancers, autonomic functions, analyzers, parsers. Filesystem 1.0 Filesystem 2.0 Kernel Space User Space Block Driver Disk Object Cache Kernel Space User Space Disk Block Driver File Driver Most functional blocks in kernel space. Data and Control have the same path. Block level semantics is exposed to applications Most functional blocks in user space. Data and Control have separate paths. Block level semantics is hidden from application. Buffer FileSys iNode Cache VFS Process/ Socket Tools ClusterFS Name Server MetaData Server Client
  9. 9. Filesystem 2.0: Content centric filesystem where the primitive is no longer a block but a file. <ul><li>Content Addressable Storage </li></ul><ul><ul><li>Next generation of NAS devices that implement CAS filesystem. The focus shift from block device driver to a file driver which manages the “chunks” of a file on multiple underlying block devices. </li></ul></ul><ul><ul><li>NAS with CAS does not use FC disks, prefers SAS/SATA disk for its low cost. Replaces backplane with high performance interconnect with RDMA. </li></ul></ul><ul><ul><li>Provides an API for application. When application queries meta data server for a file, it is given a fileID instead of iNode with block level addresses. It uses file addresses to locate a file. This creates a need for file data router. </li></ul></ul><ul><ul><li>The actual data can be stored on existing node level filesystem or a cluster filesystem </li></ul></ul>
  10. 10. BI Play: Fixed content is being warehoused in the enterprise and information from this content is being analyzed and delivered in many ways to the client. Client’s opportunity lies in adding support for fixed content across the BI pipeline. Mining Tools Analytics Delivery Warehousing Visualization Web Service Reports Real Time Portal Supply Chain Customer Relationship Financials Sales Force Human Resources Image/Video Query Search Media Search Report Generators OLAP Router ETL Metadata Extraction Scatter/ Gather Data Storage Workflow Integration Fixed Content Support
  11. 11. BI Modules I: BI pipeline needs to be enhanced to support 54% of the data that exists in the enterprise i.e. Fixed Content. <ul><li>Warehouse </li></ul><ul><ul><li>Existing warehousing techniques depend upon ETL (Extract, Transform & Log) methodology i.e. changing the format of the data and making it amenable for further processing. </li></ul></ul><ul><ul><li>Fixed content warehousing needs tools such as transcoder, metadata annotation tools, caching, Variable Bit Rate (VBR) encoding etc. </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>Fixed content needs tools such as search which searches content and metadata. Semantic analysis is going to end up in this domain once it is well specified in industry bodies. </li></ul></ul><ul><ul><li>Mashup is a tool that will probably end up in the BI domain once it is standardized. Current XHR based mashup is for web browsers only. </li></ul></ul>
  12. 12. BI Modules II: <ul><li>Analytics </li></ul><ul><ul><li>Fixed content analytics requires recognition and analysis of all media types: Text, Voice & Video. Streaming video is a challenge </li></ul></ul><ul><ul><li>Visualization backend is going to end up in this domain. Visualization transcodes data based on access device </li></ul></ul><ul><li>Delivery </li></ul><ul><ul><li>End user can ask for data delivery as real time streaming or downloadable file or as a visual image or as a web service. </li></ul></ul>
  13. 13. Content Management Play: Digitization and automation is redefining the upstream portion of the value chain. SOA is increasingly being used as the integration technology to drive the C&P framework. Film DVD Camera File Music Broadcast Cinema Cable/Sat Print Tape Internet Storage Distribution Channels Content Streaming Download Repackaging <ul><li>Static/Dynamic </li></ul><ul><li>Structured/Un </li></ul><ul><li>Format (s) </li></ul><ul><li>DRM </li></ul><ul><li>Batch/Auto </li></ul><ul><li>Metadata Tagging/ Cataloging </li></ul><ul><li>Domain Specifics </li></ul><ul><li>Ontologies </li></ul>iSCSI Block File Storage Search <ul><li>Adaptation for devices </li></ul><ul><li>Integration with User Interface </li></ul><ul><li>Formats </li></ul><ul><li>Encryption </li></ul>Workflow Orchestration Processing Repository Processing Distribution
  14. 14. The Product: Enhancing existing storage products to make them metadata aware and global name space aware will enable Business to enter the fixed content market. <ul><li>Target CSPs with enhanced FS for unstructured data. </li></ul><ul><ul><li>FS expects nodes to be shared memory. Enable support of COTS servers </li></ul></ul><ul><ul><li>Separate name server, head node and store nodes </li></ul></ul><ul><li>Target Enterprise unstructured ILM market with xFrame. </li></ul><ul><ul><li>Enhance Xframe to include unstructured data ILM by moving time-sensitive data into near storage and tag stale data for archival. </li></ul></ul>
  15. 15. GTM Strategy: Enterprise, CSPs and SMBs form the three segments of the fixed content market. To Target the segment Make alliances with and/or make enhancements to With the following offering Enterprises Fixed content asset management software companies Enhanced FS on Cluster with enhanced xFrame CSPs Integrate with CSPs proprietary metadata servers and tools Server and storage SMBs Create a CSP-In-A-Box solution CSP-In-A-Box
  16. 16. Unstructured Data, Storage 2.0 Vikas Deolaliker 2008