Hadoop and Netezza - Co-existence or Competition?

Hadoop and NetezzaCo-existence or competition?Krishnan Parasuraman, CTO - Digital Media, Netezza@kparasuramanTweet about EnzeeUniverse using #enzee11

A brief history of wannabe RDBMS killers5

6Open Source Distributed Storage and Processing EngineManage complex data – relational and non relational – in a single repositoryFault tolerant distributed processing Self healing, distributed storageAbstraction for parallel computing+Store source data forever and analyze as and when neededCommodity hardware – inexpensive storageProcess at source – eliminate data movementOozieWorkflowSqoopIntegrationZookeeperService coordinationFlume, Chukwa, ScribeData collection

Hadoop: Origin and evolution7Apache: Hadoop projectGoogle: MapReduce paperApache: HBase projectApache: Lucene subprojectNetezza : Hadoop Connector, MapReduce supportGoogle: GFS paperYahoo: 10K core clusterGoogle: Bigtable paper200320092010200420072008201120052006Open source dev momentumEarly ResearchInitial success storiesCommercialization

Common Perceptions8CloudLarge VolumesAd-hoc queriesLow costComplex AnalyticsUnstructured

Parallel data warehouse systems9SQLHost controllersNetwork fabricHostsFPGACPUFPGACPUFPGACPUMassively parallel compute nodesMemoryMemoryMemoryStorage Units

Hadoop10Map ReduceMaster NodeJob TrackerName NodeNetwork fabricParallel compute nodesTask TrackerTask TrackerData NodeData NodeTask TrackerData NodeStorage Units

The similarities11Map ReduceJob TrackerName NodeMassive parallelismExecute code & algorithms next to dataTask TrackerTask TrackerData NodeData NodeTask TrackerData NodeScalableHighly Available

The differences12Map ReduceSchema on Read – Data loading is fastJob TrackerName NodeBatch mode data accessNot intended for real time accessTask TrackerTask TrackerTask TrackerData NodeData NodeData NodeDoesn’t support Random AccessNo joins, no query engine, no types, no SQLData Loading = File copy Look Ma, No ETL

Where does it work well?1. Queryable Archive: Moving computation is cheaper than moving data2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema3. Complex data: Parallel ETL in Java13

Imperatives for co-existence14Fast data loading - flexible schema till we figure out what we want to do

Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce

Low cost of storing and analyzing not-so-hot data

Parse and analyze complex data such as video and imagesData: Point of originationFiles, Structured & Unstructured sources

Netezza-Hadoop: Co-existence use casesCreate context (classification, text mining)Analyzeunstructured dataAnalyze, reportParse, aggregatesemi-structured dataActive archivalLong running queriesAnalyze, reportstructured data

Pattern 1: Data ingestionHadoop ClusterNetezza Environment342NameNodeJobTracker1Raw WeblogsDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker

Pattern 2: Low cost storage and dynamic provisioningAmazon Cloud23Elastic MapReduce1Amazon S3

Pattern 3: Queryable archive12Data Sources

Hadoop and Netezza - Co-existence or Competition?

More Related Content

What's hot

Similar to Hadoop and Netezza - Co-existence or Competition?

More from Krishnan Parasuraman

Recently uploaded

Hadoop and Netezza - Co-existence or Competition?