Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Aginity "Big Data" Research Lab


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Aginity "Big Data" Research Lab

  1. 1. Introducing Aginity’s“Big Data” Research Lab
  2. 2. What Was The Goal? <ul><li>To build a 10 terabyte database application using desktop-class commodity hardware, an open source operating system, and the leading database systems on the planet. This is not a production-ready system. This is an internal proving ground, a sandbox, if you will, in which we are having a ton of fun. </li></ul><ul><li>We wanted to see what could be built for under $10,000 in hardware cost and $15,000 per terabyte for the data warehouse software. </li></ul>
  3. 3. How much MPP power can $5,682.10 buy in 2009? <ul><li>At least 10 terabytes. In February 2009 Aginity launched our “Big Data” lab. We constructed a 9-box server farm using off-the-shelf components. Our Chief Intelligence Officer, Ted Westerheide, personally oversaw the construction of a 10 terabyte enterprise-wide “data production” system about 10 years ago. The cost at that time? $2.2 million. Here’s the story of how we built similar capabilities for our lab for $5,682.10 U.S. dollars! </li></ul>Then And now
  4. 4. The Hardware Parts List and Cost: $5,682.10
  5. 5. The Databases We Are Testing <ul><li>Think of these as “The Big Three”. All matter to us and all are in our lab. Databases such as the ones we work with cost about $15,000 per terabyte per year to operate. </li></ul>
  6. 6. The Foundation <ul><li>The databases are running on SUSE….Novell’s open source Linux. </li></ul>
  7. 7. Some assembly required – about 11 hours A very ordinary box A partial shipment arrives Pieces parts Fine-tuning the OS Few tools required The assembly line Snapping the front panel in Ready to boot The right driver helps 2/3rds complete All 9 yards….so to speak
  8. 8. What Are We Doing With This Fun Gear? <ul><li>We have various “bake-offs” in-process. Our March / April focus includes: MapReduce, In-Database Analytics, MPP….details below. </li></ul>
  9. 9. MapReduce <ul><li>MapReduce: Simplified Data Processing on Large Clusters </li></ul><ul><li>Google Research Complete article here </li></ul><ul><li>MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. </li></ul><ul><li>Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. </li></ul><ul><li>Our [Google]implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day. </li></ul><ul><li>Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure….. </li></ul><ul><li>of web documents, summaries of the number of p </li></ul>
  10. 10. In-Database Analytics <ul><li>In-Database Analytics: A Passing Lane for Complex Analysis Seth Grimes Intelligent Enterprise, December 15, 2008 What once took one company three to four weeks now takes four to eight hours thanks to in-database computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it happen. </li></ul><ul><li>A next-generation computational approach is earning front-line operational relevance for data warehouses, long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata, Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move calculations into the data warehouse, avoiding data movement that slows response time. Coupled with performance and scalability advances that stem from database platforms with parallelized, shared-nothing (MPP) architectures, database-embedded calculations respond to growing demand for high-throughput, operational analytics for needs such as fraud detection, credit scoring, and risk management. </li></ul><ul><li>Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in September the company announced five partner-developed applications that rely on in-database computations to accelerate analytics. &quot;Netezza's [on-stream programmability] enabled us to create applications that were not possible before,&quot; says Netezza partner Arun Gollapudi, CEO of Systech Solutions.  </li></ul>
  11. 11. Massively Parallel Processing (MPP) <ul><li>Degrees of Massively Parallel Processing John O'Brien  InfoManagement Direct, February 26, 2009 </li></ul><ul><li>The concept of linear growth is obsolete. In the closing decades of the 20thcentury, we got used to the rapid pace of change, but the shape of that change was still one of incremental growth. Now we’re contending with a breakneck speed of change and exponential growth almost everywhere we look, especially with the information we generate. As documented in “Richard Winter’s Top Ten” report from 2005, the very largest databases in the world are literally dwarfed by today’s databases. </li></ul><ul><li>The fact that the entire Library of Congress’s holdings comprised 20 terabytes of data was breathtaking. Today, some telecommunications, energy and financial companies can generate that much data in a month. Even midsized organizations are coping with data sets that will soon outgrow the Library of Congress. </li></ul><ul><li>MPP is a class of architectures aimed specifically at addressing the processing requirements of very large databases. MPP architecture has been accepted as the only way to go at the high end of the data warehousing world.  If it’s so well-suited to the very large data warehouses, why hasn’t everyone adopted it? The answer lies in its previous complexity. Engineering an MPP system is difficult and remains the purview of organizations and specialized vendors that have a deep layer of dedicated R&D resources. These specialized vendors are bringing solutions to the market that shield the user from the complexity of implementing their own MPP systems. These solutions take a variety of forms, such as custom-built deployments, software/hardware configurations and all-in-one appliances. </li></ul>