• Share
  • Email
  • Embed
  • Like
  • Private Content
Using Advanced Analytics for Value-based Healthcare Delivery

Using Advanced Analytics for Value-based Healthcare Delivery



Promoting Value-based Healthcare Delivery ...

Promoting Value-based Healthcare Delivery

The fundamental principles of the Affordable Care Act recognize that the volume-based, fee-for-service payment model is unsustainable and that a value-based healthcare delivery system is essential. With the emergence of Accountable Care Organizations (ACOs), providers are incentivized to implement payment reforms and participate in shared savings programs that seek to balance quality of care, access to care and cost of care.

Our healthcare analytics payment model uses predictive analytics to assist ACOs in patient attribution, budget development, bench-marking and performance monitoring to maximize incentives through shared savings and quality improvements.



Total Views
Views on SlideShare
Embed Views



1 Embed 4

http://www.linkedin.com 4


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • The fundamental challenge: balancing these three aspects: cost of care, access to care, and the quality of care provided. This is known as the iron triangle of healthcare. Unfortunately, the truth is that only two of these three areas can be optimized. For example, if a nation chose to provide high quality care to all, then the costs must be high as well.If instead a healthcare system was designed to be low cost but give high quality care, access to care would need to be limited. Finally if a country wanted to have a low cost healthcare system with universal access, the quality delivered would not be high quality.The US Healthcare CrisisThe healthcare crisis in America is particularly startling when you realize that the United States does the worst on all three aspects.We have the highest cost per capita, don’t provide healthcare to all, and the health quality outcomes are the worst of the industrialized nations. It is expected that by 2016, healthcare costs will account for one of every five dollars spent. This is double the amount of what was spent a decade earlier and likely the same problems facing the country today will remain unchanged: Limited access, less than optimal care, and high costs.Nevertheless within our country, there are many health plans, employers, doctors, and other organizations trying to reform our healthcare system. Time will tell whether they will succeed. Whatever healthcare reform will look like in our country, the solution will be uniquely American as many innovative ideas to problems often are.
  • The current payment structure results in redundant testing, medical errors, and over-utilization that maximizes providers’ fees and reimbursements; focuses on volume of services provided and revenue generated--not quality and outcomes; and incentivizes multiple tests and procedures , regardless of necessity, quality, and efficiency
  • Our model embeds business rules and algorithms of the Medicare Shared Savings Program (MSSP) Accountable Care Organization (ACO). Our application includes the following features and capabilities: ACO benchmarks and budget based on historical cost baseline, trend estimates and risk adjustments Performance monitoring of key measures and metrics related to cost, utilization and quality Predictive modeling to determine the proper mix of inputs to maximize payment incentivesDynamic dashboards and visualizations to perform trade-off analysis and scenario planning
  • Two questions: How many seconds would it take for one person to find four Jacks in a deck of 52 cards? How long would it take 52 people, each holding one card?Why important?Accelerating time to value…MPP serves as the basis for “next generation” database management software designed to run on a shared-nothing massively parallel processing (MPP) platform with features such as row- and/or column-based storage, compression, and in-database analytics. These database systems alter the data management landscape in terms of response times at almost any scale, enabling analytic offload. Apache Hadoop, which will follow, is based on MPP and has emerged from humble beginnings to worldwide adoption - infusing data centers with new infrastructure concepts and generating new business opportunities by placing parallel processing into the hands of the average programmer.http://whatis.techtarget.com/definition/MPP-massively-parallel-processingIn some implementations, up to 200 or more processors can work on the same application. An MPP system is also known as a "loosely coupled" or "shared nothing" system.Many vendor solutions, both appliance-based and software only, deployed on commodity HW.MPP architecture offers linear scalability on commodity technology at a competitive price point. Analytical applications with frequent full table scans and complex algorithms such as regression analysis, joins, sorting, and aggregations can potentially saturate network bandwidth in a shared everything architecture. In MPP architecture, each compute node (usually a standard server) equally shares the workload in parallel and is able to utilize the full IO bandwidth of locally-attached disk storage. The ability to take queries that once ran in hours and reduce them tominutes opens up the opportunity to search for value in massive data volumes quickly and iteratively. Parallelized Data LoadingThe MPP architecture is designed to compartmentalize individual node processing and is therefore ideal for moving load processes from dedicated servers to the MPP database. Rather than extract, transform, and load, processes are shifted to extract, load, and then transform, taking advantage of parallelism. The same is true for advanced analytics: the more functions that can be pushed down into the database, the faster analysis will complete. Reducing network traffic between the database and the analytical application frees up IO and bandwidth for additional processing and data loads.Intel has packed just shy of a billion transistors into the 216 square millimeters of silicon that compose its latest chip, each one far, far thinner than a sliver of human hair. >50 nm feature sizes using optical lithography13.5 nm wavelength using extreme UV lithography
  • Hadoop is an Apache Open Source project that provides a framework that allows for the distributed processing of large data sets across clusters of computers, each offering local computation and storage. Based on Google File System and MapReduce papers.Hadoop scales out to large clusters of servers (nodes) using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers.Hadoop’s disributed architecture as a Big Data platform allows MapReduce programs to run in parallel across 10s to 1000s of servers, or nodes.MapReduce is a general-purpose execution engine that handles the complexities of parallel programming for a wide variety of applicationsThe Map phase, for computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairsThe Reduce phase, in which the set of values associated with the intermediate key/value pairs output by the Map phase are combined (that is, “reduced”) to provide the results.More on MapReduce later…We have seen that Hadoop also augments Data Warehouse environments. Hadoop is becoming a critical part of many modern information technology (IT) departments. It is being used for a growing range of requirements, including analytics, data storage, data processing, and shared compute resources. As Hadoop’s significance grows, it is important that it be treated as a component of your larger IT organization, and managed as one. Hadoop is no longer relegated to only research projects, and should be managed as your agency would manage any other large component of your IT infrastructure.A multi-node Hadoop clusterA small Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.[13]Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system index, and Similarly, a standalone JobTracker server can manage job scheduling. HADOOP—THE FOUNDATION FOR CHANGEHadoop has the potential to reach beyond Big Data to catalyze new levels of business productivity and transformation. As the foundation for change in business, Hadoop represents an unprecedented opportunity to improve how organizations can get the most value from large amounts of data. Businesses that rely on Hadoop as the core of their infrastructure can not only do analytics on top of vast amounts of data, but can also go beyond analytics and the foundation for that data layer to build applications that are meaningful, and that have a very tightly coupled relationship with the data. Consumer Internet companies have reaped the benefits of this approach, and EMC believes more traditional enterprises will adopt the same model as they evolve and transform their businesses.Hadoop has rapidly emerged as the preferred solution for Big Data analytics applications that grapple with vast repositories of unstructured data. It is flexible, scalable, inexpensive, fault-tolerant, and enjoys rapid adoption rates and a rich ecosystem surrounded by massive investment. However, customers face high hurdles to broadly adopting Hadoop as their singular data repository due to a lack of useful interfaces and high-level tooling for Business Intelligence and data mining—components that are critical to data analytics and building a data-driven enterprise. As the world's first true SQL processing for Hadoop, Pivotal HD addresses these challenges. THE HADOOP ECOSYSTEM The Hadoop family of products includes the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, Mahout, Lucene, Oozie, Flume, Cassandra, YARN, Ambari, Avro, Chukwa, and Zookeeper.  Pivtoal hd: HDFS, MapReduce, Hive, Mahout, Pig, HBase, Yarn, Zookeeper, Sqoop and FlumeHDFS A distributed file system that partitions large files across multiple machines for high-throughput access to dataData LayerFlume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.   Workload Management LayerMapReduceA programming framework for distributed batch processing of large data sets distributed across multiple serversMapReduce, which is typically used to analyze web logs on hundreds, sometimes thousands of web application servers without moving the data into a data warehouse, is not a database system, but is a parallel and distributed programming model for analyzing massive data sets (“big data”). One elegant aspect of the MapReduce is its simplicity, mostly due to its dependence on two basic operations that are applied to sets or lists of data value pairs:The Map phase, for computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs, andThe Reduce phase, in which the set of values associated with the intermediate key/value pairs output by the Map phase are combined (that is, “reduced”) to provide the results. MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. Oozie is a workflow scheduler system to coordinate and manage Apache Hadoop jobs.Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Mahout: Scalable to reasonably large data sets. Mahout also provides Java libraries for common math (focused on linear algebra and statistics) operations and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly,[3] but there are still various algorithms missing.While Mahout's core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict contributions to Hadoop based implementationsMahout: Mahout is a scalable machine learning and data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Application LayerApache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:Pig [1] is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin.[1] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy [2] and then call directly from the language. PigA high-level data-flow language for expressing Map/Reduce programs for analyzing large HDFS distributed data setsPig was originally [3] developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007,[4] it was moved into the Apache Software Foundation.[5] Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.HBaseAn open-source, distributed, versioned, column-oriented store modelled after Google’s BigtableHive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.HiveA data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into Map/Reduce programsCluster Sizing The sizing guide for HDFS is very simple: each file has a default replication factor of 3 and you need to leave approximately 25% of the disk space for intermediate shuffle files.  So you need 4x times the raw size of the data you will store in the HDFS.  However, the files are rarely stored uncompressed and, depending on the file content and the compression algorithm, on average we have seen a compression ratio of up to 10-20 for the text files stored in HDFS.  So the actual raw disk space required is only about 30-50% of the original uncompressed size.  Compression also helps in moving the data between different systems, e.g. Teradata and Hadoop.MemoryMemory demand for a master node is based on the NameNode data structures that grow with the storage capacity of your cluster. We found 1 GB per petabyte of storage is a good guideline for master node memory. You then need to add on your OS overhead,etc. We have found that with Intel Sandybridge processors 32GB is more than enough memory for a master node.Cluster Design TradeoffsWe classify clusters as small (around 2-3 racks), medium(4-10 racks) and large(above 10 racks). What we have been covering so far are design guidelines and part of the design process is to understand how to bend the design guidelines to meet you goals. In the case of small, medium and large clusters things get progressively more stringent and sensitive when you bend the guidelines. For a small the smaller number of slave nodes allow you greater flexibility in your decisions. There are a few guidelines you don’t want to violate like isolation. When you get to a medium size cluster the number of nodes will increase your design sensitivity. You also now have enough hardware the physical plant issues of cooling and power become more important. Your interconnects also become more important. At the large scale things become really sensitive and you have to be careful because making a mistake here could result in a design that will fail. Our experience at Hortonworks has allowed us to develop expertise in this area and we strongly recommend you work with us if you want to build Internet scale clusters.    detailed and specific on what a typical slave node for Hadoop should be: Mid-range processor4 to 32 GB memory1 GbE network connection to each node, with a 10 GbE top-of-rack switchA dedicated switching infrastructure to avoid Hadoop saturating the network4 to 12 drives (cores) per machine, Non-RAIDEach node has 8 cores, 16G RAM and 1.4T storage.FacebookWe use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.Currently we have 2 major clusters:A 1100-machine cluster with 8800 cores and about 12 PB raw storage.A 300-machine cluster with 2400 cores and about 3 PB raw storage.Each (commodity) node has 8 cores and 12 TB of storage. Yahoo now manages more than 42,000 Hadoop nodes.(2011)Yahoo!More than 100,000 cores in >40,00 nodes running HadoopOur biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) 
  • MapReduce is a general-purpose execution engine that handles the complexities of parallel programming for a wide variety of applicationsThe Map phase, for computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairsThe Reduce phase, in which the set of values associated with the intermediate key/value pairs output by the Map phase are combined (that is, “reduced”) to provide the results.YARN: Apache Hadoop NextGen MapReduce (YARN)MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.
  • There are many categories of NoSQL designs -- key value, graph, document-oriented -- and well-known technologies include BigTable, HBase, Cassandra, Couchbase, MongoDB and SimpleDB.The buzzIf Not Tables, Then What?Instead of using structured tables to store multiple related attributes in a row, NoSQL databases use the concept of a key/value store. Quite simply, there is no schema for the database. It simply stores values for each provided key, distributes them across the database and then allows their efficient retrieval. The lack of a schema prevents complex queries and essentially prevents the use of NoSQL as a transactional database environment. There are four main types of NoSQL databases:The basic key/value store performs nothing other than the function described above – taking a binary data object, associating it with a key, and storing it in the database for later retrieval.Columnar databases are a hybrid between NoSQL and relational databases. They provide some row-and-column structure, but do not have the strict rules of relational databases.Document stores go beyond this slightly by imposing a little more structure on the binary object. The objects must be documents, encoded in some recognizable format, such as XML or PDF, but there are no requirements about the structure or content of the document. Each document is stored as the value portion of a key/value store and may be accompanied by metadata embedded in the document itself.Graph databases store information in multi-attribute tuples that reflect relationships in a different way. For example, a graph database might be used to store the "friend" relationships of a social network, with a record merely consisting of two friends who share a relationship.NoSQL ArchitectureThe core of the NoSQL database is the hash function – a mathematical algorithm that takes a variable length input and produces a consistent, fixed-length output. The key of each key/value pair being fed to a NoSQL database is hashed and this hash value is used to direct the pair to a particular NoSQL database server, where the record is stored for later retrieval. When an application wishes to retrieve a key value pair, it provides the database with the key. This key is then hashed again to determine the appropriate server where the data would be stored (if the key exists in the database) and then the database engine retrieves the key/value pair from that server. As you read the description of this process, you may find yourself wondering “How does the user or application perform more advanced queries, such as finding all of the keys that have a particular value or sorting data by a value?” And, there’s the rub – NoSQL databases simply do not support this type of functionality. They are designed for the rapid, efficient storage of key/value pairs where the application only needs a place to stash data, later retrieving it by the key, and only by the key. If you need to perform other queries, NoSQL is not the appropriate platform for your use.Redundancy and Scalability in NoSQLThe simplistic architecture of NoSQL databases is a major benefit when it comes to redundancy and scalability. To add redundancy to a database, administrators simply add duplicate nodes and configure replication between a primary node and its counterpart. Scalability is simply a matter of adding additional nodes. When those nodes are added, the NoSQL engine adjusts the hash function to assign records to the new node in a balanced fashion.STILL WANT TO KNOW -- WHAT ARE NOSQL DATABASES?NoSQL database technologymakes it to ThoughtWorksMPP hardware, NoSQL databases: New DBMS optionsCan Oracle peddle NoSQL databases?Developers looking to build big Web applications needed to add more and more processing nodes to keep up with almost boundless demand for computing power. Relational databases came up short, so software engineers at Google, Amazon, Facebook and Yahoo devised non-SQL solutions -- laying the groundwork for big data analytics and expanded cloud computing services. Since 2009, a cavalry charge of NoSQL software startups entered the void with commercial products.The realityNoSQL is also known as "not only SQL," because some NoSQL databases do support SQL elements. But most don’t share key traits of relational databases like atomicity and consistency, so though NoSQL may help keep the auction chant going on eBay, it might break any bank using it for transaction processing. And with a ready pool of skilled developers, the incumbent relational database will hold on in most business applications.Data integrity. The ACID properties (atomicity, consistency, isolation, durability) guarantee that database transactions are processed with integrity. Hadoop and NoSQL are not a DBMS’s, so it is not ACID compliant, and therefore is not appropriate where inserts and updates are required. The advantages of NoSQL data stores are:elastic scaling meaning that they scale up transparently by adding a new node, and they are usually designed with low-cost hardware in mind.NoSQL data stores can handle Big data easilyNoSQL databases are designed to have less management, automatic repair, data distribution, and simpler data models therefore no need to have a DBA on site for using it.NoSQL databases use clusters of cheap servers to manage the exploding data and transaction volumes and therefore they are cheap in comparison to the high cost of licenses of RDBMS systems.Flexible data models, the key value stores and document databases schema changes don’t have to be managed as on complicated change unit, therefore it lets application to iterate faster.http://catmousavi.wordpress.com/2012/03/29/what-is-big-data-what-are-nosql-databases-what-is-hadoop-pig-hive/http://nosql-database.org/ACID (atomicity, consistency, isolation, and durability) is an acronym and mnemonic device for learning and remembering the four primary attributes ensured to any transaction by atransaction manager (which is also called a transaction monitor). These attributes are:Atomicity. In a transaction involving two or more discrete pieces of information, either all of the pieces are committed or none are.Consistency. A transaction either creates a new and valid state of data, or, if any failure occurs, returns all data to its state before the transaction was started.Isolation. A transaction in process and not yet committed must remain isolated from any other transaction.Durability. Committed data is saved by the system such that, even in the event of a failure and system restart, the data is available in its correct state.A well-known problem with NoSQL databases in general is that they do not support the 'ACID' principals held dear by traditional RDBMS DBAs. The Register is reporting that this may soon change with FoundationDB. Atomicity ensures that a transaction is saved or undone, but never exists halfway between the two states. Consistency ensures that only valid data can be stored. Isolation of transactions prevents one transaction interfering with another. Finally, durability ensures that transactions committed will endure and be protected from loss.ACID provides principles governing how changes are applied to a database. In a very simplified way, it states (my own version):(A) when you do something to change a database the change should work or fail as a whole(C) the database should remain consistent (this is a pretty broad topic)(I) if other things are going on at the same time they shouldn't be able to see things mid-update(D) if the system blows up (hardware or software) the database needs to be able to pick itself back up; and if it says it finished applying an update, it needs to be certainAtomicity: Either the task (or all tasks) within a transaction are performed or none of them are. This is the all-or-none principle. If one element of a transaction fails the entire transaction fails.Consistency: The transaction must meet all protocols or rules defined by the system at all times. The transaction does not violate those protocols and the database must remain in a consistent state at the beginning and end of a transaction; there are never any half-completed transactions.Isolation: No transaction has access to any other transaction that is in an intermediate or unfinished state. Thus, each transaction is independent unto itself. This is required for both performance and consistency of transactions within a database.Durability: Once the transaction is complete, it will persist as complete and cannot be undone; it will survive system failure, power loss and other types of system breakdowns.There are of course many facets to those definitions and within the actual ACID requirement of each particular database, but overall in the RDBMS world, ACID is overlord and without ACID reliability is uncertain. BASE Introduces Itself and Takes a BowLuckily for the world of distributed computing systems, their engineers are clever. How do the vast data systems of the world such as Google’s BigTable and Amazon’s Dynamo and Facebook’s Cassandra (to name only three of many) deal with a loss of consistency and still maintain system reliability? The answer, while certainly not simple, was actually a matter of chemistry or pH: BASE (Basically Available, Soft state, Eventual consistency). In a system where BASE is the prime requirement for reliability, the activity/potential (p) of the data (H) changes; it essentiallyslows down. On the pH scale, a BASE system is closer to soapy water (12) or maybe the Great Salt Lake (10). Such a statement is not claiming that billions of transactions are not happening rapidly, they still are, but it is the constraints on those transactions that have changed; those constraints are happening at different times with different rules. In an ACID system, the data fizzes and bubbles and is perpetually active; in a BASE system, the bubbles are still there much like bath water, popping, gurgling, and spinning, but not with the same vigor required from ACID. Here is why:Basically Available: This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be ‘failure’ to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account.Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to ‘eventual consistency,’ thus the state of the system is always ‘soft.’Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one. Werner Vogel’s article “Eventually Consistent – Revisited” covers this topic is much greater detail.Conclusion – Moving ForwardThe new pH of database transaction processing has allowed for more efficient vertical scaling at cost effective levels; checking the consistency of every single transaction at every moment of every change adds gargantuan costs to a system that has literally trillions of transactions occurring. The computing requirements are even more astronomical. Eventual consistency gave organizations such as Yahoo! and Google and Twitter and Amazon, plus thousands (if not millions) more the ability to interact with customers across the globe, continuously, with the necessary availability and partition tolerance, while keeping their costs down, their systems up, and their customers happy. Of course they would all like to have complete consistency all the time, but as Dan Pritchett discusses in his article “BASE: An Acid Alternative,” there has to be tradeoffs, and eventual consistency allowed for the effective development of systems that could deal with the exponential increase of data due to social networking, cloud computing and other Big Data projects.Why NoSQL Is Effective for Mobile DevicesNoSQL databases are designed to handle the dynamic needs of mobile applications. NoSQL databases do not use fixed schemas. So, in the example used above, adding new characters does not require developers to make drastic changes to the database. The developer would just be adding to the database rather than altering an existing schema. I mentioned the different use cases that mobile applications must address. This is another issue that is fixed when using NoSQL databases. One of the best examples of NoSQL databases handling the complex use cases of mobile users is Foursquare. Because Foursquare is location based, the results users get from queries or even the options available to them will differ based on location. The geospatial capabilities of an open source NoSQL database such as MongoDB make it possible for developers to easily add location-aware features. Another issue with mobile applications that NoSQL addresses is the need for constant updates. After an application has been released, maintenance becomes a major concern, among other things to consider. Because NoSQL is document based, fixing certain types of bugs and other problems doesn’t require a complete overhaul of the database, because the changes made by developers don’t necessarily affect every other aspect of the application. Finally, NoSQL is well known for its scalability. Unlike relational databases, NoSQL databases scale outward rather than vertically. This is important because as the application’s user base grows, so will the amount of data being stored in the database. It’s important to have a growth strategy in place prior to developing an application because worrying about data constraints after the application has been released will result in downtime for maintenance and upset users. REST stands for Representational State Transfer, and it was proposed in a doctorate dissertation. It uses the four HTTP methods GET, POST, PUT and DELETE to execute different operations. This in contrast to SOAP (Simple Object Access Protocol) for example, which creates new arbitrary commands (verbs) like getAccounts() orapplyDiscount()A REST API is a set of operations that can be invoked by means of any the four verbs, using the actual URI as parameters for your operations. For example you may have a method to query all your accounts which can be called from /accounts/all/ this invokes a HTTP GET and the 'all' parameter tells your application that it shall return all accounts.A RESTful web API (also called a RESTful web service) is a web API implemented using HTTP and REST principles. It is a collection of resources, with four defined aspects:the base URI for the web API, such as http://example.com/resources/the Internet media type of the data supported by the web API. This is often JSON but can be any other valid Internet media type provided that it is a valid hypertext standard.the set of operations supported by the web API using HTTP methods (e.g., GET, PUT, POST, or DELETE).The API must be hypertext driven.[13]A good example for horizontal scaling is Cassandra , MongoDB (scale out; add nodes)Vertical scaling – relational data – limitations (scale up; add CPU/RAM to existing )NoSQL versus relational columnar databases – Is NoSql right for you?Relational columnar databases such as SybaseIQ continue to use a relational model and are accessed via traditional SQL. The physical storage structure is very different when compared to non-relational NoSQL columnar stores, which store data as rows whose structure may vary and are organized by the developer into families of columns according to the application use case.Relational columnar databases, on the other hand, require a fixed schema with each column physically distinct from the others, which makes it impossible to declaratively optimize retrievals by organizing logical units or families. Because a NoSQL database retrieval can specify one or more column families while ignoring others, NoSQL databases can offer a significant advantage when performing individual row queries. NoSQL databases cannot meet the performance characteristics of relational columnar databases when it comes to retrieving aggregated results from groups of underlying records, however.This distinction is a litmus test when deciding between NoSQL and traditional SQL databases. NoSQL databases are not as flexible and are exceptional at speedily returning individual rows from a query. Traditional SQL databases, on the other hand, forfeit some storage capacity and scalability but provide extra flexibility with a standard, more familiar SQL interface.Since relational databases must adhere to a schema, they typically need to reserve space even for unused columns. NoSQL databases have a dense per-row schema and so tend to be better at optimizing the storage of sparse data, although the relational databases often use sophisticated storage-optimization techniques to mitigate this perceived shortcoming.Most importantly, relational columnar databases are generally intended for the read-only access found in conjunction with data warehouses, which provide data that was loaded collectively from conventional data stores. This can be contrasted with NoSQL columnar tables, which can handle a much higher rate of updates.
  • relies on main memory for computer data storage. It is contrasted with database management systems which employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.[1][2]In applications where response time is critical, such as telecommunications network equipment and mobile advertising networks, main memory databases are often used.[3] IMDBs have gained a lot of traction, especially in the data analytics space, starting mid-2000s mainly due to cheaper RAM.[4][5]With the introduction of NVDIMM technology[6] , in-memory databases will now be able to run at full speed and maintain data in the event of power failure.In-memory allows analytics that are unconstrained by hardware and software limitations…The NVDIMM (dual in-line memory module) is a mixed memory subsystem that combines the speed and endurance of DRAM, together with the non-volatile data retention properties of NAND flash. NVDIMMs using DRAM and NAND technology can deliver high speed and low latency “non-volatile / persistent” memory with unlimited read/write activity that can sustain itself from host power failure or a system crash. The NVDIMM can be viewed as the first commercially viable “Storage Class Memory” for the enterprise computing market.(In digital electronics, a NAND gate (Negated AND or NOT AND) is a logic gate which produces an output that is false only if all its inputs are true. A LOW (0) output results only if both the inputs to the gate are HIGH (1); if one or both inputs are LOW (0), a HIGH (1) output results. It is made using transistors.The NAND gate is significant because any boolean function can be implemented by using a combination of NAND gates. This property is called functional completeness.)With pressure growing by the day to make their products quicker to deploy and easier to use, business intelligence (BI) and data warehouse vendors are increasingly turning to in-memory technology in place of traditional disk-based storage to speed up implementations and extend self-service capabilities.Traditional BI technology loads data onto disk, often in the form of intricately modeled tables and multidimensional cubes, which can take weeks or months to develop. Queries are then made against the tables and cubes on disk. In-memory technology removes these steps, as data is loaded into random access memory and queried in the application or database itself. This greatly increases query speed and lessens the amount of data modeling needed, experts agree, meaning that in-memory BI apps can be up and running significantly faster than disk-based tools.Caching on disc not as efficient as in-memory: Caching is the process whereby on-disk databases keep frequently-accessed records in memory, for faster access. However, caching only speeds up retrieval of information, or “database reads.” Any database write – that is, an update to a record or creation of a new record – must still be written through the cache, to disk. So, the performance benefit only applies to a subset of database tasks. In addition, managing the cache is itself a process that requires substantial memory and CPU resources, so even a “cache hit” underperforms an in-memory database.In-memory technology is emerging now thanks to both increased customer demand for fast and flexible operational BI and data analysis capabilities, as well as technological innovation, specifically the emergence of 64-bit processors.64-bit processors, which began to replace 32-bit processors in personal computers earlier this decade, significantly increased the amount of data that could be stored in-memory and ultimately helped reduce the price of memory, which traditionally had been much more expensive than disk, spurring its use in enterprise applications.http://whatis.techtarget.com/definition/in-memory-databaseAn in-memory database (IMDB, also known as a main memory database or MMDB) is a database whose data is stored in main memory to facilitate faster response times. Source data is loaded into system memory in a compressed, non-relational format. In-memory databases streamline the work involved in processing queries. An IMDB is one type of analytic database, which is a read-only system that stores historical data on metrics for business intelligence/business analytics (BI/BA) applications, typically as part of a data warehouse or data mart. These systems allow users to run queries and reports on the information contained, which is regularly updated  to incorporate recent transaction data from an organization’s operational systems.In addition to providing extremely fast query response times, in-memory analytics can reduce or eliminate the need for data indexing and storing pre-aggregated data in OLAPcubes or aggregate tables.  This capacity reduces IT costs and allows faster implementation of BI/BA applications.Three developments in recent years have made in-memory analytics increasingly feasible:64-bit computing, multi-core servers and lower RAM prices. In-memory analytics is an approach to querying data when it resides in a computer’s random access memory (RAM), as opposed to querying data that is stored on physical disks.  This results in vastly shortened query response times, allowing business intelligence (BI) and analytic applications to support faster business decisions.As the cost of RAM declines, in-memory analytics is becoming feasible for many businesses. BI and analytic applications have long supported caching data in RAM, but older 32-bit operating systems provided only 4 GB of addressable memory.  Newer 64-bitoperating systems, with up to 1 terabyte (TB) addressable memory (and perhaps more in the future), have made it possible to cache large volumes of data -- potentially an entire data warehouse or data mart -- in a computer’s RAM.In addition to providing incredibly fast query response times, in-memory analytics can reduce or eliminate the need for data indexing and storing pre-aggregated data in OLAPcubes or aggregate tables.  This reduces IT costs and allows faster implementation of BI and analytic applications. It is anticipated that as BI and analytic applications embrace in-memory analytics, traditional data warehouses may eventually be used only for data that is not queried frequently.
  • http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/102-nosql-database-tutorial.htmlOptimize the environment based on the analytical workloadThe type of database to select depends on the characteristics of the data and mix of workloads – transaction/batch . For high velocity capture and analysis, you need a “scale up” approach, i.e. vertical scaling using in-memory databases. For high variety datasets, you need the ability to distribute processing and leverage parallelism. As illustrated, the relational database has limited ability to scale up and out. As data volume increases (with velocity and/or variety), you can see that both in-memory and MPP are necessary. Our sponsor, SAP, offers solutions that embed in-memory and MPP. HANA is an in-memory, appliance-based database, while Sybase IQ is on on-disc, software-only (commodity hardware) database. Both are columnar structures (NoSQL) to optimize performance by having a compression rate of at least a factor of 5x that of relational databases. An analytic database, also called an analytical database, is a read-only system that stores historical data on business metrics such as sales performance and inventory levels. Business analysts, corporate executives and other workers can run queries and reports against an analytic database. The information is updated on a regular basis to incorporate recent transaction data from an organization’s operational systems.An analytic database is specifically designed to support business intelligence (BI) and analytic applications, typically as part of a data warehouse or data mart. This differentiates it from an operational, transactional or OLTP database, which is used for transaction processing – i.e., order entry and other “run the business” applications. Databases that do transaction processing can also be used to support data warehouses and BI applications, but analytic database vendors claim that their products offer performance and scalabilityadvantages over conventional relational database software.  There currently are five main types of analytic databases on the market:Columnar databases, which organize data by columns instead of rows – thus reducing the number of data elements that typically have to be read by the database engine while processing queries.Data warehouse appliances, which combine the database with hardware and BI tools in an integrated platform that’s tuned for analytical workloads and designed to be easy to install and operate.In-memory databases, which load the source data into system memory in a compressed, non-relational format in an attempt to streamline the work involved in processing queries.Massively parallel processing (MPP) databases, which spread data across a cluster of servers, enabling the systems to share the query processing workload.Online analytical processing (OLAP) databases, which store multidimensional “cubes” of aggregated data for analyzing information based on multiple data attributes.
  • The left side is your relational brain modeled around many tables, each with rows, columns, attributes, and keysThe right side is your more casual brain, mixing formal and distributed data structures that scale horizontally (maximizing in-memory). Used for exploring large amounts of data, when the performance and real-time nature is more important than consistencyQuery: Trying to find a needle in a haystackDiscovery: Trying to find a specific needle in a needle stackBut discovery problems are being worked on every day. Intelligence analysts are trying to discover new threats. Medical researchers are trying to discover new drugs. Financial analysts are trying to discover new trading strategies. Across every industry, discovery problems abound – and solving these discovery problems generates incredible value to organizations – whether it’s discovering a new threat, a new drug, a new trading strategy, etc. 
  • We have all heard about the explosive growth in data. The more important story is the software and hardware advancements that allow users to explore this data.Data Growth Chart1 Bit = Binary Digit8 Bits = 1 Byte1000 Bytes = 1 Kilobyte1000 Kilobytes = 1 Megabyte1000 Megabytes = 1 Gigabyte1000 Gigabytes = 1 Terabyte1000 Terabytes = 1 Petabyte1000 Petabytes = 1 Exabyte1000 Exabytes = 1 Zettabyte1000 Zettabytes = 1 Yottabyte1000 Yottabytes = 1 Brontobyte1000 Brontobytes = 1 GeopbyteWikipedia definition that was crafted by melding together commentary from analysts at The 451, IDC, Monash Research, a TDWI alumnus, and a Gartner expert.“Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.”RDBMS store gigabytes of data (for transactional data)Data Warehouses store terabytes of information (for analysis)Big Data repositories hold petabytes of data and growing
  • Answers to questions from attendees
  • The overall goal is to move from exploratory ideation and prototyping to high-value business intelligence that a program or policy organization can act upon for the benefit of customers and taxpayers.The typical path to actionable business intelligence begins with several iterations of customer-driven Design-Thinking sessions, using observations, questions and hypotheses to explore the customers’ complex problems and their context (e.g., the customers’ customer or the regulatory environment). The initial exploratory session is focused on defining the problem, understanding the impact on customers and users, and rapidly creating and testing alternative solutions on paper.During subsequent sessions, customers and analysts work together to mature the prototype by aggregating, filtering and correlating data using cloud-based software tools. It is important to confirm (reality test) any insights resulting from exploratory activities before making policy or program changes.The entire process takes a small, diverse team from 60 to 120 days.
  • Distributed DW architecture. The issue in a multi-workload environment is whether a single-platform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More DW teams are concluding that a multi-platform data warehouse environment is more cost-effective and flexible. Plus, some workloads receive better optimization when moved to a platform beside the data warehouse. In reaction, many organizations now maintain a core DW platform for traditional workloads but offload other workloads to other platforms. For example, data and processing for SQL-based analytics are regularly offloaded to DW appliances and columnar DBMSs. A few teams offload workloads for big data and advanced analytics to HDFS, discovery platforms, MapReduce, and similar platforms. The result is a strong trend toward distributed DW architectures, where many areas of the logical DW architecture are physically deployed on standalone platforms instead of the core DW platform. Big Data requires a new generation of scalable technologies designed to extract meaning from very large volumes of disparate, multi-structured data by enabling high velocity capture, discovery, and analysisSource of second graphic: http://www.saama.com/blog/bid/78289/Why-large-enterprises-and-EDW-owners-suddenly-care-about-BigDatahttp://www.cloudera.com/content/dam/cloudera/Resources/PDF/Hadoop_and_the_Data_Warehouse_Whitepaper.pdfComplex Hadoop jobs can use the data warehouse as a data source, simultaneously leveraging the massively parallel capabilities of two systems. Any MapReduce program can issue SQL statements to the data warehouse. In one context, a MapReduce program is “just another program,” and the data warehouse is “just another database.” Now imagine 100 MapReduce programs concurrently accessing 100 data warehouse nodes in parallel. Both raw processing and the data warehouse scale to meet any big data challenge. Inevitably, visionary companies will take this step to achieve competitive advantages.Promising Uses of Hadoop that Impact DW Architectures I see a handful of areas in data warehouse architectures where HDFS and other Hadoop products have the potential to play positive roles: Data staging. A lot of data processing occurs in a DW’s staging area, to prepare source data for specific uses (reporting, analytics, OLAP) and for loading into specific databases (DWs, marts, appliances). Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load (ETL). Imagine staging and processing a wide variety of data on HDFS. For users who prefer to hand-code most of their solutions for extract, transform, and load (ETL), they will most likely feel at home in code-intense environments like Apache MapReduce. And they may be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family. Note that I’m assuming that (whether you use Hadoop or not), you should physically locate your data staging area(s) on standalone systems outside the core data warehouse, if you haven’t already. That way, you preserve the core DW’s capacity for what it does best: squeaky clean, well modeled data (with an audit trail via metadata and master data) for standard reports, dashboards, performance management, and OLAP. In this scenario, the standalone data staging area(s) offload most of the management of big data, archiving source data, and much of the data processing for ETL, data quality, and so on. Data archiving. When organizations embrace forms of advanced analytics that require detail source data, they amass large volumes of source data, which taxes areas of the DW architecture where source data is stored. Imagine managing detailed source data as an archive on HDFS. You probably already do archiving with your data staging area, though you probably don’t call it archiving. If you think of it as an archive, maybe you’ll adopt the best practices of archiving, especially information lifecycle management (ILM), which I feel is valuable but woefully vacant from most DWs today. Archiving is yet another thing the staging area in a modern DW architecture must do, thus another reason to offload the staging area from the core DW platform. Traditionally, enterprises had three options when it came to archiving data: leave it within a relational database, move it to tape or optical disk, or delete it. Hadoop’s scalability and low cost enable organizations to keep far more data in a readily accessible online environment. An online archive can greatly expand applications in business intelligence, advanced analytics, data exploration, auditing, security, and risk management. Multi-structured data. Relatively few organizations are getting BI value from semi- and unstructured data, despite years of wishing for it. Imagine HDFS as a special place within your DW environment for managing and processing semi-structured and unstructured data. Another way to put it is: imagine not stretching your RDBMS-based DW platform to handle data types that it’s not all that good with. One of Hadoop’s strongest complements to a DW is its handling of semi- and unstructured data. But don’t go thinking that Hadoop is only for unstructured data: HDFS handles the full range of data, including structured forms, too. In fact, Hadoop can manage just about any data you can store in a file and copy into HDFS. Processing flexibility. Given its ability to manage diverse multi-structured data, as I just described, Hadoop’sNoSQL approach is a natural framework for manipulating non-traditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for SQL-based relational DBMSs. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer. In addition, Hadoop enables the growing practice of “late binding”. Instead of transforming data as it’s ingested by Hadoop (the way you often do with ETL for data warehousing), which imposes an a priori model on data, structure is applied at runtime. This, in turn, enables the open-ended data exploration and discovery analytics that many users are looking for today. Advanced analytics. Imagine HDFS as a data stage, archive, or twenty-first-century operational data store that manages and processes big data for advanced forms of analytics, especially those based on MapReduce, data mining, statistical analysis, and natural language processing (NLP). There’s much to say about this; in a future blog I’ll drill into how advanced analytics is one of the strongest influences on data warehouse architectures today, whether Hadoop is in use or not.Analyze and Store Approach (ELT?)The analyze and store approach analyzes data as it flows through businessprocesses, across networks, and between systems. The analytical results can then bepublished to interactive dashboards and/or published into a data store (such as a datawarehouse) for user access, historical reporting and additional analysis. Thisapproach can also be used to filter and aggregate big data before it is brought into adata warehouse.There are two main ways of implementing the analyze and store approach:• Embedding the analytical processing in business processes. This techniqueworks well when implementing business process management and serviceorientedtechnologies because the analytical processing can be called as a servicefrom the process workflow. This technique is particularly useful for monitoring andanalyzing business processes and activities in close to real-time – action times ofa few seconds or minutes are possible here. The process analytics created canalso be published to an operational dashboard or stored in a data warehouse forsubsequent use.• Analyzing streaming data as it flows across networks and between systems.This technique is used to analyze data from a variety of different (possiblyunrelated) data sources where the volumes are too high for the store and analyzeapproach, sub-second action times are required, and/or where there is a need toanalyze the data streams for patterns and relationships. To date, many vendorshave focused on analyzing event streams (from trading systems, for example)using the services of a complex event processing (CEP) engine, but this style ofprocessing is evolving to support a wider variety of streaming technologies anddata. Creates stream analytics from many types of streamingdata such as event, video and GPS data.The benefits of the analyze and store approach are fast action times and lower datastorage overheads because the raw data does not have to be gathered andconsolidated before it can be analyzed.using HiveQL to create a load-ready file for a relational database.
  • From the EDW to the multi-platform unified data architecture. A consequence of the workload-centric approach (coupled with a reassessment of DW economics) is a trend away from the single-platform monolith of the enterprise data warehouse (EDW) and toward a physically distributed unified data architecture (UDA).2 A modern UDA is a logical design that assumes deployment onto multiple platform types, ranging from the traditional warehouse (and its satellite systems for marts and ODSs) to new platforms such as DW appliances, columnar DBMSs, NoSQL databases, MapReduce tools, and even a file system on steroids such as HDFS. The multi-platform approach of UDA adds more complexity to the DW environment; yet, the complexity is being addressed by vendor R&D to abstract the complexity and take advantage of the various capability and cost options. Moving data around is inevitable in a multi-platform UDA, so there needs to be a well-defined data integration architecture as well. Even so, an assumption behind UDA is that data structures and their deployment platforms will integrate on the fly (due to the exploratory nature of analytics) in a loosely coupled fashion. So the architecture should define data standards, preferred interfaces, and shared business rules to give loose coupling consistent usage. Data virtualization is an umbrella term used to describe any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted or where it is physically located.You are probably familiar with the concept of data virtualization if you store photos on the social networking site Facebook.  When you upload a photo to Facebook from your desktop computer, you must provide the upload tool with information about the location of the photo -- the photo's file path. Once it has been uploaded to Facebook, however, you can retrieve the photo without having to know its new file path. In fact, you will have absolutely no idea where Facebook is storing your photo because Facebook software has an abstraction layer that hides that technical information. This abstraction layer is what is meant by some vendors when they use the term data virtualization. The term can be confusing because some vendors use the labels data virtualization anddata federation interchangeably. They do, however, mean slightly different things. The goal of data federation technology is to aggregate heterogeneous data from disparate sources and view it in a consistent manner from a single point of access. The term data virtualization, however, simply means that the technical information about the data has been hidden.  Strictly speaking, it does not imply that the data is heterogeneous or that it can be viewed from a single point of access.