Webinar: The Future of Hadoop

The Future of Hadoop 

Doug Cutting | A Founder of Apache Hadoop
Jeff Hammerbacher | Chief Scientist, Cloudera

Welcome to the webinar!

Audio/Telephone: +1 (215) 383-1016
Access Code: 421-634-457
Audio Pin: Shown after joining the Webinar
Hadoop, Hbase, Pig, Hive, Bigtop, Avro, Flume & Whirr are trademark of the Apache Software Foundation

Housekeeping
▪ All lines are on mute

▪ Ask questions at any time using the Questions panel on GoToMeeting

▪ Slides and recording will be available on www.cloudera.com/events

©2011 Cloudera, Inc. All Rights Reserved.

Presentation Outline
▪ 1. Context
▪ 2. Apache Bigtop
▪ 3. Apache Hadoop Core
▪ 4. Apache HBase, Hive, and Pig
▪ 5. Other components

▪ Questions and Discussion


Context
Data
▪ 1.8 ZB will be created and replicated in 2011
▪ Up 9x in the last five years
▪ More than 90% of this data is unstructured
▪ Enterprises have some liability for 80% of this data
▪ Enterprises will spend $4T on managing data in 2011

▪ Source: IDC Digital Universe Report 2011


Context
Hadoop
▪ Apache Hadoop and related software are designed for this world

▪ Volume
▪ Commodity hardware and open source software lowers cost and increases capacity

▪ Velocity
▪ Data ingest speed aided by append-only and schema-on-read design

▪ Variety
▪ Multiple tools to structure, process, and access data


Context
HDFS and MapReduce
▪ Apache Hadoop = HDFS + MapReduce
▪ Similar to kernel of an operating system
▪ Referred to as “Hadoop Core”

▪ Related components are often deployed with Hadoop
▪ For example: HBase, Hive, Pig, Oozie, Flume, Sqoop
▪ Together, these components form a “Hadoop Stack”
▪ Not all components must be deployed

Context
Bigtop
▪ What standards should all components follow?

▪ How can we ensure all components of the stack work together?

▪ How can we find the right version of each component?

▪ How can we make it easy to install an additional component?

Apache Bigtop
▪ Now incubating at Apache
▪ Hadoop ecosystem-wide project, including:
▪ Interoperability testing of components
▪ Packaging of compatible versions of components
▪ Like a Fedora, Debian or CentOS for Hadoop ecosystem
▪ Releases are not a single artifact
▪ Rather a set of interdependent, compatible components


Apache Bigtop
▪ Current components
▪ Hadoop
▪ HBase
▪ Hive
▪ Pig
▪ Oozie
▪ Sqoop
▪ Flume
▪ ZooKeeper
▪ Whirr

Apache Bigtop
▪ Outputs
▪ Source
▪ RPM
▪ Deb
▪ Tests
▪ Integration
▪ Package
▪ Smoke
▪ Release 0.1.0 under vote now!

Apache Hadoop Core
▪ Current stable releases based on branches from 0.20
▪ Upcoming release: 0.22
▪ Includes both security and new implementation of append
▪ Not expected to be run at scale or commercially supported
▪ Nearly ready for vote

▪ Build and dependency management moved to Maven
▪ Branch to happen soon

HDFS
▪ Robustness
▪ HDFS-1073: Checkpointing of image and edits log

▪ Availability
▪ HDFS-1623: High availability

▪ Performance
▪ HDFS-941: Faster random reads
▪ HDFS-2080: Faster checksums


HDFS
▪ Scalability
▪ HDFS-1052: Federation of the NameNode

▪ Source of diagram: http://www.hortonworks.com/an-introduction-to-hdfs-federation/

MapReduce
▪ Modularity
▪ MAPREDUCE-279: MapReduce 2.0
▪ Break JobTracker into ResourceManager and ApplicationMaster
▪ Replace TaskTracker with NodeManager

▪ Source of diagram: http://www.odbms.org/download/dean-keynote-ladis2009.pdf

MapReduce
▪ Potential New Frameworks
▪ MAPREDUCE-2719: Distributed shell
▪ MAPREDUCE-2720: Distributed Java commands
▪ MPI: Communication-intensive parallelism
▪ Fast scans and aggregations
▪ OpenDremel
▪ Bulk Synchronous Parallel
▪ Giraph, Golden Orb, Hama, et al.
▪ Actor Model (streaming)
▪ S4, Akka, Storm, et al.

Apache HBase
▪ Upcoming release: 0.92.0
▪ Server-side triggers
▪ HBASE-2000: Coprocessors
▪ Availability
▪ HBASE-1730/4213: Online schema changes
▪ Performance
▪ HBASE-3857: HFile 2.0

▪ HBase book in September!


Apache Hive
▪ Data transfer
▪ HIVE-306: INSERT INTO
▪ HIVE-1918: EXPORT/IMPORT
▪ Indexes
▪ HIVE-1644: Automatically use indexes
▪ HIVE-1803: Bitmap indexes
▪ Data formats
▪ HIVE-895: Avro support


Apache Pig
▪ Recent release: 0.9
▪ Scripting
▪ PIG-1479: Embedding Pig in Python
▪ PIG-1793: Macro expansion
▪ Debugging
▪ PIG-1712: ILLUSTRATE rework
▪ Data formats
▪ PIG-1748: Avro support


Other Components
▪ Apache Incubator
▪ Sqoop, Flume, and Oozie now incubating
▪ Whirr graduated to a top-level Apache project
▪ Apache Avro
▪ Interoperability with Protocol Buffers and Thrift
▪ Column-oriented file format
▪ Python MapReduce implementation
▪ Apache ZooKeeper
▪ Multi-update
▪ Kerberos authentication of clients


Q&A
Visit www.hadoopworld.com
• November 8-9, 2011 in New York City
• Early bird discount ends September 5, 2011

Enter Today: www.facebook.com/cloudera
• Click the “Be a Cloudera Hero for Apache
Hadoop” tab
• Share what you think Apache Hadoop can
do for you
• Win a personal hackathon with Doug Cutting
in San Francisco, CA

Webinar: The Future of Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Webinar: The Future of Hadoop

Similar to Webinar: The Future of Hadoop (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Webinar: The Future of Hadoop