Aginity Big Data Research Lab


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Aginity Big Data Research Lab

  1. 1. Introducing Aginity’s“Big Data” Research Lab Launched, March 2009
  2. 2. Background Google changed everything…. What makes Google great isn’t the user interface … or the word processor, or even gmail, although these are great tools. What made Google great was their massive database of searches and indexes to content that allows them to understand what you are searching for even better than you do yourself. Google is a database company. They process more data every day than almost any other company in the world. And unlike other big data companies, most of Google’s data is unstructured. To pull this off, Google invented a new class of database that could perform analytics on-the-fly “In-Database”, with largely unstructured data using large clusters of off the shelf computers. From this work, was launched a new class of data warehouse that we believe will change the world.
  3. 3. What Was Our Goal? We wanted to see what could be built using the framework invented by Google for under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse software. Our goal was to build a 10 terabyte MPP Always-on data Warehouse using desktop-class commodity hardware, an open source operating system, and the leading MPP database software on the planet. This is a technology sandbox in which we are seeing how close we can get to a 2 million dollar data warehouse of 5 years ago for $10,000 to $20,000. Obviously, this is not a production-class system but it is a good illustration of the power of the latest Software Only “Big Data” systems and Aginity’s mastery of those systems.
  4. 4. What Is A MPP Data Warehouse? MPP, or Massively Parallel Processing, is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. Degrees of Massively Parallel Processing John O'Brien InfoManagement Direct, February 26, 2009
  5. 5. What Is MapReduce? MapReduce was invented by Google and is a programming model and an associated implementation for processing and generating large data sets. The core ideas of MapReduce are: • MapReduce isn’t about data management, at least not primarily. It’s about parallelism. • In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such scenarios are found in CRM and relationship analytics. • MapReduce offers dramatic performance gains in analytic application areas that still need great performance speed-up • On its own, MapReduce can do a lot of important work in data manipulation and analysis. Integrating it with SQL should just increase its applicability and power. • At its core, most data analysis is really pretty simple – it boils down to arithmetic, Boolean logic, sorting, and not a lot else. MapReduce can handle a significant fraction of that. • MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But if you want to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help. DBMS2
  6. 6. What are we testing? • Very large 5 TB database with 2 TB fact table • Ability to do “on-the-fly” analytics without creating cubes or any form of pre-aggregation at sub- second speed. • Very large complex queries that span nodes • The benefits of using the MapReduce indexing model • In-Database Analytics • Fault tolerance at scale? What happens if I unplug one of the nodes during a complex process?
  7. 7. How much MPP power can $5,682.10 buy in 2009? At least 10 terabytes. We constructed a 9-box server farm using off-the-shelf components. Our Chief Architect, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise- wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the story of how we built similar capabilities for our lab for $5,682.10 U.S.. Then Our Lab Real-world blade servers
  8. 8. The Hardware Parts List and Cost: $5,682.10
  9. 9. The Databases We Are Testing Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the ones we work with cost about $15,000 per terabyte per year to operate.
  10. 10. The Foundation The databases are running on SUSE….Novell’s open source Linux.
  11. 11. About 11 hours to assemble the boxes
  12. 12. MapReduce MapReduce: Simplified Data Processing on Large Clusters Google Research Complete article here MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our [Google’s]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day…. Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data…continued in paper.
  13. 13. In-Database Analytics In-Database Analytics: A Passing Lane for Complex Analysis Seth Grimes Intelligent Enterprise, December 15, 2008 What once took one company three to four weeks now takes four to eight hours thanks to in-database computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it happen. A next-generation computational approach is earning front-line operational relevance for data warehouses, long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata, Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move calculations into the data warehouse, avoiding data movement that slows response time. Coupled with performance and scalability advances that stem from database platforms with parallelized, shared-nothing (MPP) architectures, database-embedded calculations respond to growing demand for high-throughput, operational analytics for needs such as fraud detection, credit scoring, and risk management. Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in September the company announced five partner-developed applications that rely on in-database computations to accelerate analytics. quot;Netezza's [on-stream programmability] enabled us to create applications that were not possible before,quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.
  14. 14. Massively Parallel Processing (MPP) Degrees of Massively Parallel Processing John O'Brien InfoManagement Direct, February 26, 2009 The concept of linear growth is obsolete. In the closing decades of the 20th century, we got used to the rapid pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with a breakneck speed of change and exponential growth almost everywhere we look, especially with the information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest databases in the world are literally dwarfed by today’s databases. The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking. Today, some telecommunications, energy and financial companies can generate that much data in a month. Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress. MPP is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world. If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it? The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized vendors are bringing solutions to the market that shield the user from the complexity of implementing their own MPP systems. These solutions take a variety of forms, such as custom-built deployments, software/hardware configurations and all-in-one appliances.