Hadoop and NetezzaCo-existence or competition?Krishnan Parasuraman, CTO - Digital Media, Netezza@kparasuramanTweet about EnzeeUniverse using #enzee11
The Buzz2
3
Fuelling the debate4
A brief history of wannabe RDBMS killers5
6Open Source Distributed Storage and Processing EngineManage complex data – relational and non relational – in a single repositoryFault tolerant distributed processing Self healing, distributed storageAbstraction for parallel computing+Store source data forever and analyze as and when neededCommodity hardware – inexpensive storageProcess at source – eliminate data movementOozieWorkflowSqoopIntegrationZookeeperService coordinationFlume, Chukwa, ScribeData collection
Hadoop: Origin and evolution7Apache: Hadoop projectGoogle: MapReduce paperApache: HBase projectApache: Lucene subprojectNetezza : Hadoop Connector, MapReduce supportGoogle: GFS paperYahoo: 10K core clusterGoogle: Bigtable paper200320092010200420072008201120052006Open source dev momentumEarly ResearchInitial success storiesCommercialization
Common Perceptions8CloudLarge VolumesAd-hoc queriesLow costComplex AnalyticsUnstructured
Parallel data warehouse systems9SQLHost controllersNetwork fabricHostsFPGACPUFPGACPUFPGACPUMassively parallel compute nodesMemoryMemoryMemoryStorage Units
Hadoop10Map ReduceMaster NodeJob TrackerName  NodeNetwork fabricParallel compute nodesTask TrackerTask TrackerData NodeData NodeTask TrackerData NodeStorage Units
The similarities11Map ReduceJob TrackerName  NodeMassive parallelismExecute code & algorithms next to dataTask TrackerTask TrackerData NodeData NodeTask TrackerData NodeScalableHighly Available
The differences12Map ReduceSchema on Read – Data loading is fastJob TrackerName  NodeBatch mode data accessNot intended for real time accessTask TrackerTask TrackerTask TrackerData NodeData NodeData NodeDoesn’t support Random AccessNo joins, no query engine, no types, no SQLData Loading = File copy Look Ma, No ETL
Where does it work well?1. Queryable Archive: Moving computation is cheaper than moving data2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema3. Complex data: Parallel ETL in Java13
Imperatives for co-existence14Fast data loading - flexible schema till we figure out what we want to do
Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce
Low cost of storing and analyzing not-so-hot data
Parse and analyze complex data such as video and imagesData:  Point of originationFiles, Structured & Unstructured sources
Netezza-Hadoop: Co-existence use casesCreate context (classification, text mining)Analyzeunstructured dataAnalyze, reportParse, aggregatesemi-structured dataActive archivalLong running queriesAnalyze, reportstructured data
Pattern 1: Data ingestionHadoop ClusterNetezza Environment342NameNodeJobTracker1Raw WeblogsDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
Pattern 2: Low cost storage and dynamic provisioningAmazon Cloud23Elastic MapReduce1Amazon S3
Pattern 3: Queryable archive12Data Sources

Hadoop and Netezza - Co-existence or Competition?

  • 1.
    Hadoop and NetezzaCo-existenceor competition?Krishnan Parasuraman, CTO - Digital Media, Netezza@kparasuramanTweet about EnzeeUniverse using #enzee11
  • 2.
  • 3.
  • 4.
  • 5.
    A brief historyof wannabe RDBMS killers5
  • 6.
    6Open Source DistributedStorage and Processing EngineManage complex data – relational and non relational – in a single repositoryFault tolerant distributed processing Self healing, distributed storageAbstraction for parallel computing+Store source data forever and analyze as and when neededCommodity hardware – inexpensive storageProcess at source – eliminate data movementOozieWorkflowSqoopIntegrationZookeeperService coordinationFlume, Chukwa, ScribeData collection
  • 7.
    Hadoop: Origin andevolution7Apache: Hadoop projectGoogle: MapReduce paperApache: HBase projectApache: Lucene subprojectNetezza : Hadoop Connector, MapReduce supportGoogle: GFS paperYahoo: 10K core clusterGoogle: Bigtable paper200320092010200420072008201120052006Open source dev momentumEarly ResearchInitial success storiesCommercialization
  • 8.
    Common Perceptions8CloudLarge VolumesAd-hocqueriesLow costComplex AnalyticsUnstructured
  • 9.
    Parallel data warehousesystems9SQLHost controllersNetwork fabricHostsFPGACPUFPGACPUFPGACPUMassively parallel compute nodesMemoryMemoryMemoryStorage Units
  • 10.
    Hadoop10Map ReduceMaster NodeJobTrackerName NodeNetwork fabricParallel compute nodesTask TrackerTask TrackerData NodeData NodeTask TrackerData NodeStorage Units
  • 11.
    The similarities11Map ReduceJobTrackerName NodeMassive parallelismExecute code & algorithms next to dataTask TrackerTask TrackerData NodeData NodeTask TrackerData NodeScalableHighly Available
  • 12.
    The differences12Map ReduceSchemaon Read – Data loading is fastJob TrackerName NodeBatch mode data accessNot intended for real time accessTask TrackerTask TrackerTask TrackerData NodeData NodeData NodeDoesn’t support Random AccessNo joins, no query engine, no types, no SQLData Loading = File copy Look Ma, No ETL
  • 13.
    Where does itwork well?1. Queryable Archive: Moving computation is cheaper than moving data2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema3. Complex data: Parallel ETL in Java13
  • 14.
    Imperatives for co-existence14Fastdata loading - flexible schema till we figure out what we want to do
  • 15.
    Expressability of SQLcoupled with flexibility of procedural code i.e. MapReduce
  • 16.
    Low cost ofstoring and analyzing not-so-hot data
  • 17.
    Parse and analyzecomplex data such as video and imagesData: Point of originationFiles, Structured & Unstructured sources
  • 18.
    Netezza-Hadoop: Co-existence usecasesCreate context (classification, text mining)Analyzeunstructured dataAnalyze, reportParse, aggregatesemi-structured dataActive archivalLong running queriesAnalyze, reportstructured data
  • 19.
    Pattern 1: DataingestionHadoop ClusterNetezza Environment342NameNodeJobTracker1Raw WeblogsDataNodeTaskTrackerDataNodeTaskTrackerDataNodeTaskTracker
  • 20.
    Pattern 2: Lowcost storage and dynamic provisioningAmazon Cloud23Elastic MapReduce1Amazon S3
  • 21.
    Pattern 3: Queryablearchive12Data Sources