Boston Hadoop Meetup, April 26 2012

2,061 views

Published on

Daniel Abadi presentation at the Boston Hadoop Meetup held on April 26, 2012.

Published in: Technology, Business

Boston Hadoop Meetup, April 26 2012

  1. 1. The Proliferation of Database Systems and the Data Silo Problem @daniel_abadi Yale University / Hadapt April 26th, 2012
  2. 2. In The Old Days … Database
  3. 3. In The Old Days …Database ETL ToolsDatabase Data WarehouseDatabase Data Integration Tools External Data MDM Tools Data Governance Tools
  4. 4. One Size Does Not Fit AllTransactional Databases– Single digit millisecond latencies, and high throughput– Store data in rows– Heavy on flash and main memory– Indexing is very important– High availability extremely important
  5. 5. One Size Does Not Fit AllAnalytical Databases– Single digit second latencies (and higher)– Store data in columns– Scale out commodity hardware– Still need magnetic disk– Indexing less important– High availability less important
  6. 6. One Size Does Not Fit AllStreaming Databases– Continuous queries– Data flows through the system– Network latencies are paramount– Drop data to deal with load
  7. 7. Therefore, in my PhD years alone …Aurora and Borealis projects becameStreambaseC-Store project became VerticaH-Store project became VoltDB
  8. 8. Right Tool for the Job
  9. 9. What We Have Now … Analytical Transactional Datamart OLAP Database Hadoop DBMS Reporting and High Transactional Dashboarding Data Performance Streaming DBMS DBMS Warehouse Column-Store Analytical DBMSWeb DBMS (like Web Logs NoSQL NewSQL MySQL)
  10. 10. What We Have Now … Analytical Transactional Datamart OLAP Database Hadoop DBMS Reporting and High Transactional Dashboarding Data Performance Streaming DBMS DBMS Warehouse Column-Store Analytical DBMSWeb DBMS (like Web Logs NoSQL NewSQL MySQL)
  11. 11. What We Have Now … Analytical Transactional Datamart OLAP Database Hadoop DBMS Reporting and High Transactional Dashboarding Data Performance Streaming DBMS DBMS Warehouse Column-Store Analytical DBMSWeb DBMS (like Web Logs NoSQL NewSQL MySQL)
  12. 12. What This Leads To…Very little data provenanceData silosNon identical data copiesNot even close to a single version of thetruth
  13. 13. A Potential Way Towards a Solution Data Analysis Data DBMS Streaming (Hive, (Hstreaming Hadoop Hadapt) Flume) NoSQL & Simple Xacts & Short Request Processing (HBase, Brisk)
  14. 14. What this has Potential to EnableFewer data silosIncreased data provenanceReduced systems management overheadBetter resource utilization andmanagement
  15. 15. But we still needHadoop-based data integration toolsMDM and data governance tools forHadoopData provenance tracking across Hadoopprojects

×