Hadoop and Netezza - Co-existence or Competition?

18,500 views

Published on

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.

Published in: Technology

Hadoop and Netezza - Co-existence or Competition?

  1. 1. Hadoop and Netezza<br />Co-existence or competition?<br />Krishnan Parasuraman, CTO - Digital Media, Netezza<br />@kparasuraman<br />Tweet about EnzeeUniverse using #enzee11<br />
  2. 2. The Buzz<br />2<br />
  3. 3. 3<br />
  4. 4. Fuelling the debate<br />4<br />
  5. 5. A brief history of wannabe RDBMS killers<br />5<br />
  6. 6. 6<br />Open Source Distributed Storage and Processing Engine<br />Manage complex data – relational and non relational – in a single repository<br />Fault tolerant distributed processing <br />Self healing, distributed storage<br />Abstraction for parallel computing<br />+<br />Store source data forever and analyze as and when needed<br />Commodity hardware – inexpensive storage<br />Process at source – eliminate data movement<br />Oozie<br />Workflow<br />Sqoop<br />Integration<br />Zookeeper<br />Service coordination<br />Flume, Chukwa, Scribe<br />Data collection<br />
  7. 7. Hadoop: Origin and evolution<br />7<br />Apache: Hadoop project<br />Google: MapReduce paper<br />Apache: HBase project<br />Apache: Lucene subproject<br />Netezza : Hadoop Connector, MapReduce support<br />Google: GFS paper<br />Yahoo: 10K core cluster<br />Google: Bigtable paper<br />2003<br />2009<br />2010<br />2004<br />2007<br />2008<br />2011<br />2005<br />2006<br />Open source dev momentum<br />Early Research<br />Initial success stories<br />Commercialization<br />
  8. 8. Common Perceptions<br />8<br />Cloud<br />Large Volumes<br />Ad-hoc queries<br />Low cost<br />Complex Analytics<br />Unstructured<br />
  9. 9. Parallel data warehouse systems<br />9<br />SQL<br />Host controllers<br />Network fabric<br />Hosts<br />FPGA<br />CPU<br />FPGA<br />CPU<br />FPGA<br />CPU<br />Massively parallel compute nodes<br />Memory<br />Memory<br />Memory<br />Storage Units<br />
  10. 10. Hadoop<br />10<br />Map Reduce<br />Master Node<br />Job Tracker<br />Name Node<br />Network fabric<br />Parallel compute nodes<br />Task Tracker<br />Task Tracker<br />Data Node<br />Data Node<br />Task Tracker<br />Data Node<br />Storage Units<br />
  11. 11. The similarities<br />11<br />Map Reduce<br />Job Tracker<br />Name Node<br />Massive parallelism<br />Execute code & algorithms next to data<br />Task Tracker<br />Task Tracker<br />Data Node<br />Data Node<br />Task Tracker<br />Data Node<br />Scalable<br />Highly Available<br />
  12. 12. The differences<br />12<br />Map Reduce<br />Schema on Read – Data loading is fast<br />Job Tracker<br />Name Node<br />Batch mode data access<br />Not intended for real time access<br />Task Tracker<br />Task Tracker<br />Task Tracker<br />Data Node<br />Data Node<br />Data Node<br />Doesn’t support Random Access<br />No joins, no query engine, no types, no SQL<br />Data Loading = File copy Look Ma, No ETL<br />
  13. 13. Where does it work well?<br />1. Queryable Archive: Moving computation is cheaper than moving data<br />2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema<br />3. Complex data: Parallel ETL in Java<br />13<br />
  14. 14. Imperatives for co-existence<br />14<br /><ul><li>Fast data loading - flexible schema till we figure out what we want to do
  15. 15. Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce
  16. 16. Low cost of storing and analyzing not-so-hot data
  17. 17. Parse and analyze complex data such as video and images</li></ul>Data: Point of origination<br />Files, Structured & Unstructured sources<br />
  18. 18. Netezza-Hadoop: Co-existence use cases<br />Create context (classification, text mining)<br />Analyze<br />unstructured data<br />Analyze, report<br />Parse, aggregate<br />semi-structured data<br />Active archival<br />Long running queries<br />Analyze, report<br />structured data<br />
  19. 19. Pattern 1: Data ingestion<br />Hadoop Cluster<br />Netezza Environment<br />3<br />4<br />2<br />NameNode<br />JobTracker<br />1<br />Raw Weblogs<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />
  20. 20. Pattern 2: Low cost storage and dynamic provisioning<br />Amazon Cloud<br />2<br />3<br />Elastic MapReduce<br />1<br />Amazon S3<br />
  21. 21. Pattern 3: Queryable archive<br />1<br />2<br />Data Sources<br />
  22. 22. Pattern 4: Support low interaction partners<br />1<br />3<br />Data Sources<br />2<br />
  23. 23. Netezza and Hadoop integration<br />Hadoop/HDFS integration<br /><ul><li>Move data back and forth between Netezza and Hadoop cluster
  24. 24. Use Hadoop for ingesting/parsing web logs, offline analytics</li></ul>High speed data loader<br />(bidirectional)<br />weblogs<br />
  25. 25. Summary: Leveraging best of both worlds<br />21<br />1. Hadoop is not a replacement to a parallel datawarehouse<br />2. Hadoop and Netezza are complementary technologies<br />3. Don’t let the hype drive the need<br />4. We have only solved the integration problem<br />

×