Hadoop and Netezza - Co-existence or Competition?
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop and Netezza - Co-existence or Competition?

on

  • 14,737 views

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant ...

Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.

Statistics

Views

Total Views
14,737
Views on SlideShare
11,864
Embed Views
2,873

Actions

Likes
6
Downloads
391
Comments
1

20 Embeds 2,873

http://www.datadrivesmedia.com 2178
http://www.bradterrell.com 608
http://datadrivesmedia.com 42
http://feeds2.feedburner.com 19
http://edit.optimizely.com 4
http://translate.googleusercontent.com 3
http://www.bigdatabigfun.com 3
http://datadrivessales.com 2
http://datadrivesmedia.com. 2
http://bigdatabigfun.com 2
http://www.eltropy.org 1
http://feeds.feedburner.com 1
http://www.slashdocs.com 1
http://eltropy.org 1
http://localtropy.com 1
http://bpssv2.buzzmetrics.com 1
http://theoldreader.com 1
http://datadrivesmarketing.com 1
http://webcache.googleusercontent.com 1
http://www.eltropy.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Will columnar storage be implemented inside SPU in the future?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and Netezza - Co-existence or Competition? Presentation Transcript

  • 1. Hadoop and Netezza
    Co-existence or competition?
    Krishnan Parasuraman, CTO - Digital Media, Netezza
    @kparasuraman
    Tweet about EnzeeUniverse using #enzee11
  • 2. The Buzz
    2
  • 3. 3
  • 4. Fuelling the debate
    4
  • 5. A brief history of wannabe RDBMS killers
    5
  • 6. 6
    Open Source Distributed Storage and Processing Engine
    Manage complex data – relational and non relational – in a single repository
    Fault tolerant distributed processing
    Self healing, distributed storage
    Abstraction for parallel computing
    +
    Store source data forever and analyze as and when needed
    Commodity hardware – inexpensive storage
    Process at source – eliminate data movement
    Oozie
    Workflow
    Sqoop
    Integration
    Zookeeper
    Service coordination
    Flume, Chukwa, Scribe
    Data collection
  • 7. Hadoop: Origin and evolution
    7
    Apache: Hadoop project
    Google: MapReduce paper
    Apache: HBase project
    Apache: Lucene subproject
    Netezza : Hadoop Connector, MapReduce support
    Google: GFS paper
    Yahoo: 10K core cluster
    Google: Bigtable paper
    2003
    2009
    2010
    2004
    2007
    2008
    2011
    2005
    2006
    Open source dev momentum
    Early Research
    Initial success stories
    Commercialization
  • 8. Common Perceptions
    8
    Cloud
    Large Volumes
    Ad-hoc queries
    Low cost
    Complex Analytics
    Unstructured
  • 9. Parallel data warehouse systems
    9
    SQL
    Host controllers
    Network fabric
    Hosts
    FPGA
    CPU
    FPGA
    CPU
    FPGA
    CPU
    Massively parallel compute nodes
    Memory
    Memory
    Memory
    Storage Units
  • 10. Hadoop
    10
    Map Reduce
    Master Node
    Job Tracker
    Name Node
    Network fabric
    Parallel compute nodes
    Task Tracker
    Task Tracker
    Data Node
    Data Node
    Task Tracker
    Data Node
    Storage Units
  • 11. The similarities
    11
    Map Reduce
    Job Tracker
    Name Node
    Massive parallelism
    Execute code & algorithms next to data
    Task Tracker
    Task Tracker
    Data Node
    Data Node
    Task Tracker
    Data Node
    Scalable
    Highly Available
  • 12. The differences
    12
    Map Reduce
    Schema on Read – Data loading is fast
    Job Tracker
    Name Node
    Batch mode data access
    Not intended for real time access
    Task Tracker
    Task Tracker
    Task Tracker
    Data Node
    Data Node
    Data Node
    Doesn’t support Random Access
    No joins, no query engine, no types, no SQL
    Data Loading = File copy Look Ma, No ETL
  • 13. Where does it work well?
    1. Queryable Archive: Moving computation is cheaper than moving data
    2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema
    3. Complex data: Parallel ETL in Java
    13
  • 14. Imperatives for co-existence
    14
    • Fast data loading - flexible schema till we figure out what we want to do
    • 15. Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce
    • 16. Low cost of storing and analyzing not-so-hot data
    • 17. Parse and analyze complex data such as video and images
    Data: Point of origination
    Files, Structured & Unstructured sources
  • 18. Netezza-Hadoop: Co-existence use cases
    Create context (classification, text mining)
    Analyze
    unstructured data
    Analyze, report
    Parse, aggregate
    semi-structured data
    Active archival
    Long running queries
    Analyze, report
    structured data
  • 19. Pattern 1: Data ingestion
    Hadoop Cluster
    Netezza Environment
    3
    4
    2
    NameNode
    JobTracker
    1
    Raw Weblogs
    DataNode
    TaskTracker
    DataNode
    TaskTracker
    DataNode
    TaskTracker
  • 20. Pattern 2: Low cost storage and dynamic provisioning
    Amazon Cloud
    2
    3
    Elastic MapReduce
    1
    Amazon S3
  • 21. Pattern 3: Queryable archive
    1
    2
    Data Sources
  • 22. Pattern 4: Support low interaction partners
    1
    3
    Data Sources
    2
  • 23. Netezza and Hadoop integration
    Hadoop/HDFS integration
    • Move data back and forth between Netezza and Hadoop cluster
    • 24. Use Hadoop for ingesting/parsing web logs, offline analytics
    High speed data loader
    (bidirectional)
    weblogs
  • 25. Summary: Leveraging best of both worlds
    21
    1. Hadoop is not a replacement to a parallel datawarehouse
    2. Hadoop and Netezza are complementary technologies
    3. Don’t let the hype drive the need
    4. We have only solved the integration problem