Your SlideShare is downloading. ×
Hadoop and Netezza - Co-existence or Competition?
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop and Netezza - Co-existence or Competition?

15,757
views

Published on

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant …

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.

Published in: Technology

1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
15,757
On Slideshare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
411
Comments
1
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop and Netezza
    Co-existence or competition?
    Krishnan Parasuraman, CTO - Digital Media, Netezza
    @kparasuraman
    Tweet about EnzeeUniverse using #enzee11
  • 2. The Buzz
    2
  • 3. 3
  • 4. Fuelling the debate
    4
  • 5. A brief history of wannabe RDBMS killers
    5
  • 6. 6
    Open Source Distributed Storage and Processing Engine
    Manage complex data – relational and non relational – in a single repository
    Fault tolerant distributed processing
    Self healing, distributed storage
    Abstraction for parallel computing
    +
    Store source data forever and analyze as and when needed
    Commodity hardware – inexpensive storage
    Process at source – eliminate data movement
    Oozie
    Workflow
    Sqoop
    Integration
    Zookeeper
    Service coordination
    Flume, Chukwa, Scribe
    Data collection
  • 7. Hadoop: Origin and evolution
    7
    Apache: Hadoop project
    Google: MapReduce paper
    Apache: HBase project
    Apache: Lucene subproject
    Netezza : Hadoop Connector, MapReduce support
    Google: GFS paper
    Yahoo: 10K core cluster
    Google: Bigtable paper
    2003
    2009
    2010
    2004
    2007
    2008
    2011
    2005
    2006
    Open source dev momentum
    Early Research
    Initial success stories
    Commercialization
  • 8. Common Perceptions
    8
    Cloud
    Large Volumes
    Ad-hoc queries
    Low cost
    Complex Analytics
    Unstructured
  • 9. Parallel data warehouse systems
    9
    SQL
    Host controllers
    Network fabric
    Hosts
    FPGA
    CPU
    FPGA
    CPU
    FPGA
    CPU
    Massively parallel compute nodes
    Memory
    Memory
    Memory
    Storage Units
  • 10. Hadoop
    10
    Map Reduce
    Master Node
    Job Tracker
    Name Node
    Network fabric
    Parallel compute nodes
    Task Tracker
    Task Tracker
    Data Node
    Data Node
    Task Tracker
    Data Node
    Storage Units
  • 11. The similarities
    11
    Map Reduce
    Job Tracker
    Name Node
    Massive parallelism
    Execute code & algorithms next to data
    Task Tracker
    Task Tracker
    Data Node
    Data Node
    Task Tracker
    Data Node
    Scalable
    Highly Available
  • 12. The differences
    12
    Map Reduce
    Schema on Read – Data loading is fast
    Job Tracker
    Name Node
    Batch mode data access
    Not intended for real time access
    Task Tracker
    Task Tracker
    Task Tracker
    Data Node
    Data Node
    Data Node
    Doesn’t support Random Access
    No joins, no query engine, no types, no SQL
    Data Loading = File copy Look Ma, No ETL
  • 13. Where does it work well?
    1. Queryable Archive: Moving computation is cheaper than moving data
    2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema
    3. Complex data: Parallel ETL in Java
    13
  • 14. Imperatives for co-existence
    14
    • Fast data loading - flexible schema till we figure out what we want to do
    • 15. Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce
    • 16. Low cost of storing and analyzing not-so-hot data
    • 17. Parse and analyze complex data such as video and images
    Data: Point of origination
    Files, Structured & Unstructured sources
  • 18. Netezza-Hadoop: Co-existence use cases
    Create context (classification, text mining)
    Analyze
    unstructured data
    Analyze, report
    Parse, aggregate
    semi-structured data
    Active archival
    Long running queries
    Analyze, report
    structured data
  • 19. Pattern 1: Data ingestion
    Hadoop Cluster
    Netezza Environment
    3
    4
    2
    NameNode
    JobTracker
    1
    Raw Weblogs
    DataNode
    TaskTracker
    DataNode
    TaskTracker
    DataNode
    TaskTracker
  • 20. Pattern 2: Low cost storage and dynamic provisioning
    Amazon Cloud
    2
    3
    Elastic MapReduce
    1
    Amazon S3
  • 21. Pattern 3: Queryable archive
    1
    2
    Data Sources
  • 22. Pattern 4: Support low interaction partners
    1
    3
    Data Sources
    2
  • 23. Netezza and Hadoop integration
    Hadoop/HDFS integration
    • Move data back and forth between Netezza and Hadoop cluster
    • 24. Use Hadoop for ingesting/parsing web logs, offline analytics
    High speed data loader
    (bidirectional)
    weblogs
  • 25. Summary: Leveraging best of both worlds
    21
    1. Hadoop is not a replacement to a parallel datawarehouse
    2. Hadoop and Netezza are complementary technologies
    3. Don’t let the hype drive the need
    4. We have only solved the integration problem