Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Hadoop and Netezza - Co-existence or Competition?
1. Hadoop and Netezza Co-existence or competition? Krishnan Parasuraman, CTO - Digital Media, Netezza @kparasuraman Tweet about EnzeeUniverse using #enzee11
6. 6 Open Source Distributed Storage and Processing Engine Manage complex data – relational and non relational – in a single repository Fault tolerant distributed processing Self healing, distributed storage Abstraction for parallel computing + Store source data forever and analyze as and when needed Commodity hardware – inexpensive storage Process at source – eliminate data movement Oozie Workflow Sqoop Integration Zookeeper Service coordination Flume, Chukwa, Scribe Data collection
7. Hadoop: Origin and evolution 7 Apache: Hadoop project Google: MapReduce paper Apache: HBase project Apache: Lucene subproject Netezza : Hadoop Connector, MapReduce support Google: GFS paper Yahoo: 10K core cluster Google: Bigtable paper 2003 2009 2010 2004 2007 2008 2011 2005 2006 Open source dev momentum Early Research Initial success stories Commercialization
8. Common Perceptions 8 Cloud Large Volumes Ad-hoc queries Low cost Complex Analytics Unstructured
9. Parallel data warehouse systems 9 SQL Host controllers Network fabric Hosts FPGA CPU FPGA CPU FPGA CPU Massively parallel compute nodes Memory Memory Memory Storage Units
10. Hadoop 10 Map Reduce Master Node Job Tracker Name Node Network fabric Parallel compute nodes Task Tracker Task Tracker Data Node Data Node Task Tracker Data Node Storage Units
11. The similarities 11 Map Reduce Job Tracker Name Node Massive parallelism Execute code & algorithms next to data Task Tracker Task Tracker Data Node Data Node Task Tracker Data Node Scalable Highly Available
12. The differences 12 Map Reduce Schema on Read – Data loading is fast Job Tracker Name Node Batch mode data access Not intended for real time access Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Doesn’t support Random Access No joins, no query engine, no types, no SQL Data Loading = File copy Look Ma, No ETL
13. Where does it work well? 1. Queryable Archive: Moving computation is cheaper than moving data 2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema 3. Complex data: Parallel ETL in Java 13
16. Low cost of storing and analyzing not-so-hot data
17. Parse and analyze complex data such as video and imagesData: Point of origination Files, Structured & Unstructured sources
18. Netezza-Hadoop: Co-existence use cases Create context (classification, text mining) Analyze unstructured data Analyze, report Parse, aggregate semi-structured data Active archival Long running queries Analyze, report structured data
24. Use Hadoop for ingesting/parsing web logs, offline analyticsHigh speed data loader (bidirectional) weblogs
25. Summary: Leveraging best of both worlds 21 1. Hadoop is not a replacement to a parallel datawarehouse 2. Hadoop and Netezza are complementary technologies 3. Don’t let the hype drive the need 4. We have only solved the integration problem