Fixed content mining is a computationally intensive operation. A purpose built appliance with adequate integration hooks to back end data warehousing systems with add-on/plug-ins for most popular BI clients will meet the requirements for departmental and small and medium sized businesses.
The key value multiplier for such a product is in its ability to seamlessly integrate with existing enterprise systems and generate reports which can be printed, imported (into excel, access etc).
The market for such an appliance is expected to reach $200M in 2010 (not counting the storage/server pullthrough).
Unstructured data or “Fixed Content” refers to digital content that is generated outside a business context i.e. the data does not have a schema and is not stored in databases. Normally, all non-transactional data such as email, IM, media, Web Content, metadata and customer generated content is considered unstructured.
It is increasing desired by businesses to add information from “unstructured data” to their library of information sources to improve business decision making. Business Intelligence industry offers numerous products that enable discovery, intercept, metadata extraction, semantic analysis, storage and lifecycle management (ILM) of unstructured data. Companies such as InterWoven, Vignette, Informatica, Manugistics, IBM and Oracle have products that enable warehousing of content that is considered “unstructured”. BusObj recently acquired Inxight software for text analytics.
Storage system vendors such as EMC and NetApp have created a new category of storage systems called “Content Addressable Storage” or CAS. IBM recently acquired XIV to offer a competition to EMC called Nextra.
SNIA, a storage industry standard body has a initiative called XAM (Extensible Access Management). XAM compliant f products offer a programmable interface for archiving applications to query (search), retrieve and control access to fixed content. It also allows compliance software to access fixed content for SOX and other regulatory compliance tests.
Fixed content is stored on NAS appliances and clusters. The market for rich media NAS is fast growing and expected to surpass $1B in 2009. EMC is the current leader in NAS with 36% of the market share and is expected to lead the NAS market for fixed content as well.
BI on fixed content is an emerging market. the market for fixed content BI software is still in its infancy with annual revenues under $20M. This market came under focus with acquisition of Inxight software by BusObjs. This software mostly runs on windows and accesses storage using iSCSI over IP.
Content Service Providers (CSPs) like Google/Yahoo store fixed content in clusters of servers which run their own proprietary filesystems a.k.a Storage 2.0. This market is proprietary as those filesystems are source of differentiation for the CSPs.
The Trend: According to IDC, transactional data is growing at 32.3% while fixed content (unstructured data) is growing at 63.7%. Replicated or back-up data is growing at 43% p.a.
IDC, Dec, 2007
Unstructured data growth means growth in file services over LAN/WAN
Replicated data growth is lower than expected.
Structured data growth is lowest because most of the data generated today is outside a transactional context.
The Opportunity: Client can take a three pronged strategy to enter the fixed content market. (a) Develop asset management software (b) Develop Storage 2.0 infrastructure (c) Develop a purpose built BI appliance for fixed content. Digital Asset Management Software, IDC, 2008 $M
Client can enter the market with media asset management software. The software pulls server and storage infrastructure.
Digital Asset Management
- Workflow and ILM management of fixed content
- Content Intelligence Functionality such as text analytics, search, visualization
File System Infrastructure (a.k.a Storage 2.0)
- Enhanced NAS storage for low latency, API based access to media content in filesystems
Business can add modules to BI pipeline and make a purpose built appliance for fixed content.
The Market: Fixed content market total is approximately $1B. Majority of the market is storage hardware and software. NAS is the dominant organization for storage for fixed content. Emerging products such as Storage 2.0, Fixed media asset management software and BI appliances are currently in nascent stage.
EMC, NetApp, HDS, IBM, HP
Sell directly to CSPs like Yahoo, Google, eBay
Asset Management: HP, IBM
Business Intelligence: BusObj (Inxight)
Fixed Content Search/Retrieval: Endeca, Lucene, Microsoft (FAST)
Content Editing: Adobe, Microsoft, Apple
Storage 2.0: Focus has shifted from IO in storage 1.0 to higher level fileservices. The driver is no longer access protocol but content preference. Feature Storage 1.0 Storage 2.0 Management Local Application Web Application Access SCSI Over FC, GbE or Bus SCSI Over IP Provisioning LUN level granularity Filesystem size Virtualization LUN, Volume Mgr. Object Level Application Profile Write Many Read One WORM Oversubscription 1:1 (provisioning is equal to allocation) N:1 (Provisioning can be n-times allocation)
The Infrastructure Play: Filesystems are turning into a platform with programming interfaces, data routers, load balancers, autonomic functions, analyzers, parsers. Filesystem 1.0 Filesystem 2.0 Kernel Space User Space Block Driver Disk Object Cache Kernel Space User Space Disk Block Driver File Driver Most functional blocks in kernel space. Data and Control have the same path. Block level semantics is exposed to applications Most functional blocks in user space. Data and Control have separate paths. Block level semantics is hidden from application. Buffer FileSys iNode Cache VFS Process/ Socket Tools ClusterFS Name Server MetaData Server Client
Filesystem 2.0: Content centric filesystem where the primitive is no longer a block but a file.
Content Addressable Storage
Next generation of NAS devices that implement CAS filesystem. The focus shift from block device driver to a file driver which manages the “chunks” of a file on multiple underlying block devices.
NAS with CAS does not use FC disks, prefers SAS/SATA disk for its low cost. Replaces backplane with high performance interconnect with RDMA.
Provides an API for application. When application queries meta data server for a file, it is given a fileID instead of iNode with block level addresses. It uses file addresses to locate a file. This creates a need for file data router.
The actual data can be stored on existing node level filesystem or a cluster filesystem
BI Play: Fixed content is being warehoused in the enterprise and information from this content is being analyzed and delivered in many ways to the client. Client’s opportunity lies in adding support for fixed content across the BI pipeline. Mining Tools Analytics Delivery Warehousing Visualization Web Service Reports Real Time Portal Supply Chain Customer Relationship Financials Sales Force Human Resources Image/Video Query Search Media Search Report Generators OLAP Router ETL Metadata Extraction Scatter/ Gather Data Storage Workflow Integration Fixed Content Support
BI Modules I: BI pipeline needs to be enhanced to support 54% of the data that exists in the enterprise i.e. Fixed Content.
Existing warehousing techniques depend upon ETL (Extract, Transform & Log) methodology i.e. changing the format of the data and making it amenable for further processing.
Fixed content warehousing needs tools such as transcoder, metadata annotation tools, caching, Variable Bit Rate (VBR) encoding etc.
Fixed content needs tools such as search which searches content and metadata. Semantic analysis is going to end up in this domain once it is well specified in industry bodies.
Mashup is a tool that will probably end up in the BI domain once it is standardized. Current XHR based mashup is for web browsers only.
Fixed content analytics requires recognition and analysis of all media types: Text, Voice & Video. Streaming video is a challenge
Visualization backend is going to end up in this domain. Visualization transcodes data based on access device
End user can ask for data delivery as real time streaming or downloadable file or as a visual image or as a web service.
Content Management Play: Digitization and automation is redefining the upstream portion of the value chain. SOA is increasingly being used as the integration technology to drive the C&P framework. Film DVD Camera File Music Broadcast Cinema Cable/Sat Print Tape Internet Storage Distribution Channels Content Streaming Download Repackaging
Metadata Tagging/ Cataloging
iSCSI Block File Storage Search
Adaptation for devices
Integration with User Interface
Workflow Orchestration Processing Repository Processing Distribution
The Product: Enhancing existing storage products to make them metadata aware and global name space aware will enable Business to enter the fixed content market.
Target CSPs with enhanced FS for unstructured data.
FS expects nodes to be shared memory. Enable support of COTS servers
Separate name server, head node and store nodes
Target Enterprise unstructured ILM market with xFrame.
Enhance Xframe to include unstructured data ILM by moving time-sensitive data into near storage and tag stale data for archival.
GTM Strategy: Enterprise, CSPs and SMBs form the three segments of the fixed content market. To Target the segment Make alliances with and/or make enhancements to With the following offering Enterprises Fixed content asset management software companies Enhanced FS on Cluster with enhanced xFrame CSPs Integrate with CSPs proprietary metadata servers and tools Server and storage SMBs Create a CSP-In-A-Box solution CSP-In-A-Box