Your SlideShare is downloading. ×
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Eric Baldeschwieler Keynote from Storage Developers Conference

5,900

Published on

Eric Baldeschwieler's keynote presentation from Storage Developer's Conference 2011. Eric is co-founder and CEO of Hortonworks.

Eric Baldeschwieler's keynote presentation from Storage Developer's Conference 2011. Eric is co-founder and CEO of Hortonworks.

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,900
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
203
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Introduce self, background. Thanks SDC, attendees, etc.
  • Tell inception story, plan to differentiate Yahoo, recruit talent, insure that Y! was not built on legacy private system From YST
  • Look for Analyst/Gartner data on growing cost…..
  • The data nodes not have RAID, just JBOD
  • The data nodes not have RAID, just JBOD
  • The data nodes not have RAID, just JBOD
  • Arun to develop
  • Our commitment to Apache has already changed the market! Ultimately contributing the code that maters and making it work is the currency in open source
  • Transcript

    • 1. Apache Hadoop Today & Tomorrow
      • Eric Baldeschwieler, CEO
      • Hortonworks, Inc.
        • twitter: @jeric14 (@hortonworks)
        • www.hortonworks.com
    • 2. Agenda
      • Brief Overview of Apache Hadoop
      • Where Apache Hadoop is Used
      • Apache Hadoop Core
        • Hadoop Distributed File System (HDFS)
        • Map/Reduce
      • Where Apache Hadoop Is Going
      • Q&A
    • 3. What is Apache Hadoop?
      • A set of open source projects owned by the Apache Foundation that
      • transforms commodity computers and network into a distributed service
      • HDFS – Stores petabytes of data reliably
      • Map-Reduce – Allows huge distributed computations
      • Key Attributes
      • Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails
      • Simple and Flexible APIs – Our rocket scientists use it directly!
      • Very powerful – Harnesses huge clusters, supports best of breed analytics
      • Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases
    • 4. What is it used for?
      • Internet scale data
        • Web logs – Years of logs at many TB/day
        • Web Search – All the web pages on earth
        • Social data – All message traffic on facebook
      • Cutting edge analytics
        • Machine learning, data mining…
      • Enterprise apps
        • Network instrumentation, Mobil logs
        • Video and Audio processing
        • Text mining
      • And lots more!
    • 5. Apache Hadoop Projects Programming Languages Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache Projects HDFS (Hadoop Distributed File System) MapReduce (Distributed Programing Framework) HMS (Management) Table Storage Hive (SQL) Pig (Data Flow) HBase (Columnar Storage) HCatalog (Meta Data)
    • 6. Where Hadoop is Used
    • 7. Everywhere! 2006 2008 2009 2010 The Datagraph Blog 2007
    • 8. HADOOP @ YAHOO! 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011
    • 9.
        • twice the engagement
      CASE STUDY YAHOO! HOMEPAGE
        • Personalized
        • for each visitor
        • Result:
        • twice the engagement
      © Yahoo 2011 +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News Interests Top Searches
    • 10. CASE STUDY YAHOO! HOMEPAGE
        • Serving Maps
          • Users - Interests
        • Five Minute Production
        • Weekly Categorization models
      USER BEHAVIOR CATEGORIZATION MODELS (weekly) SERVING MAPS (every 5 minutes) USER BEHAVIOR
        • » Identify user interests using Categorization models
        • » Machine learning to build ever better categorization models
        • Build customized home pages with latest data (thousands / second)
      © Yahoo 2011 SCIENCE HADOOP CLUSTER SERVING SYSTEMS PRODUCTION HADOOP CLUSTER ENGAGED USERS
    • 11.
        • Enabling quick response in the spam arms race
      CASE STUDY YAHOO! MAIL
        • 450M mail boxes
        • 5B+ deliveries/day
        • Antispam models retrained
        • every few hours on Hadoop
      © Yahoo 2011
        • 40% less spam than Hotmail and 55% less spam than Gmail
      SCIENCE PRODUCTION
    • 12. A Brief History 2006 – present
        • , early adopters
        • Scale and productize Hadoop
      Apache Hadoop
        • Wide Enterprise Adoption
        • Funds further development, enhancements
      Nascent / 2011
        • Other Internet Companies
        • Add tools / frameworks, enhance Hadoop
      2008 – present …
        • Service Providers
        • Provide training, support, hosting
      2010 – present …
    • 13. Traditional Enterprise Architecture Data Silos + ETL Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Serving Logs Social Media Sensor Data Text Systems …
    • 14. Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics Serving Logs Social Media Sensor Data Text Systems …
    • 15. Serving Applications Unstructured Systems 80-90% of data produced today is unstructured Gartner predicts 800% data growth over next 5 years Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Serving Logs Social Media Sensor Data Text Systems … Apache Hadoop EsTsL (s = Store) Custom Analytics
    • 16. What is Driving Adoption?
      • Business drivers
        • Identified high value projects that require use of more data
        • Belief that there is great ROI in mastering big data
      • Financial drivers
        • Growing cost of data systems as proportion of IT spend
        • Cost advantage of commodity hardware + open source
          • Enables departmental-level big data strategies
      • Technical drivers
        • Existing solutions failing under growing requirements
          • 3Vs - Volume, velocity, variety
        • Proliferation of unstructured data
    • 17. Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source:
    • 18. Apache Hadoop Core
    • 19. Overview
      • Frameworks share commodity hardware
        • Storage - HDFS
        • Processing - MapReduce
      2 * 10GigE 2 * 10GigE 2 * 10GigE 2 * 10GigE
      • 20-40 nodes / rack
      • 16 Cores
      • 48G RAM
      • 6-12 * 2TB disk
      • 1-2 GigE to node
      … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server …
    • 20. Map/Reduce
      • Map/Reduce is a distributed computing programming model
      • It works like a Unix pipeline:
        • cat input | grep | sort | uniq -c > output
        • Input | Map | Shuffle & Sort | Reduce | Output
      • Strengths:
        • Easy to use! Developer just writes a couple of functions
        • Moves compute to data
          • Schedules work on HDFS node with data if possible
        • Scans through data, reducing seeks
        • Automatic reliability and re-execution on failure
    • 21. HDFS: Scalable, Reliable, Managable
      • Scale IO, Storage, CPU
      • Add commodity servers & JBODs
      • 4K nodes in cluster, 80
      • Fault Tolerant & Easy management
        • Built in redundancy
        • Tolerate disk and node failures
        • Automatically manage addition/removal of nodes
        • One operator per 8K nodes!!
      • Storage server used for computation
        • Move computation to data
      • Not a SAN
        • But high-bandwidth network access to data via Ethernet
      • Immutable file system
        • Read, Write, sync/flush
          • No random writes
      ` Switch … Switch … Switch … Core Switch Core Switch …
    • 22. HDFS Use Cases
      • Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware
      • Solve problems that cannot be solved using traditional systems at a cheaper price
        • Large storage capacity ( >100PB raw)
        • Large IO/Computation bandwidth (>4K servers)
          • > 4 Terabit bandwidth to disk! (conservatively)
        • Scale by adding commodity hardware
        • Cost per GB ~= $1.5, includes MapReduce cluster
    • 23. HDFS Architecture Persistent Namespace Metadata & Journal Namespace State Datanodes
        • Block ID  Data
        • Hierarchal Namespace
        • File Name  BlockIDs
      Horizontally Scale IO and Storage Block Storage Namespace Namenode Block Map Heartbeats & Block Reports
        • Block ID  Block Locations
      NFS b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
    • 24. Client Read & Write Directly from Closest Server Namespace State Horizontally Scale IO and Storage Client 1 open 2 read Client 2 write 1 create End-to-end checksum Namenode Block Map b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
    • 25. Actively maintain data reliability Namespace State Bad/lost block replica Periodically check block checksums Namenode Block Map b1 b5 b3 JBOD b2 b3 b4 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD 2. copy 3. blockReceived 1. replicate
    • 26. HBase
      • Hadoop ecosystem Database, based on Google BigTable
      • Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware.
        • Multidimensional sorted Map
          • Table => Row => Column => Version => Value
        • Distributed column-oriented store
        • Scale – Sharding etc. done automatically
          • No SQL, CRUD etc.
    • 27. What’s Next
    • 28. About Hortonworks – Basics
      • Founded – July 1 st , 2011
        • 22 Architects & committers from Yahoo!
      • Mission – Architect the future of Big Data
        • Revolutionize and commoditize the storage and processing of Big Data via open source
      • Vision – Half of the worlds data will be stored in Hadoop within five years
    • 29. Game Plan
      • Support the growth of a huge Apache Hadoop ecosystem
        • Invest in ease of use, management, and other enterprise features
        • Define APIs for ISVs, OEMs and others to integrate with Apache Hadoop
        • Continue to invest in advancing the Hadoop core, remain the experts
        • Contribute all of our work to Apache
      • Profit by providing training & support to the Hadoop community
    • 30. Lines of Code Contributed to Apache Hadoop
    • 31. Apache Hadoop Roadmap
      • Phase 1 – Making Apache Hadoop Accessible
      • Release the most stable version of Hadoop ever (Hadoop 0.20.205)
        • Frequent sustaining releases
      • Release directly usable code via Apache (RPMs, .debs…)
      • Improve project integration (HBase support)
      2011
      • Phase 2 – Next-Generation Apache Hadoop
      • Address key product gaps (HA, Management…)
      • Enable partner innovation via open APIs
      • Enable community innovation via modular architecture
      2012 (Alphas in Q4 2011)
    • 32. Next-Generation Hadoop
      • Core
        • HDFS Federation – Scale out and innovation via new APIs
          • Will run on 6000 node clusters with 24TB disk / node = 144PB in next release
        • Next Gen MapReduce – Support for MPI and many other programing models
        • HA (no SPOF) and Wire compatibility
      • Data - HCatalog 0.3
        • Pig, Hive, MapReduce and Streaming as clients
        • HDFS and HBase as storage systems
        • Performance and storage improvements
      • Management & Ease of use
        • Ambari – A Apache Hadoop Management & Monitoring System
        • Stack installation and centralized config management
        • REST and GUI for user & administrator tasks
    • 33. Thank You!
      • Questions
        • Twitter: @jeric14 (@hortonworks)
        • www.hortonworks.com

    ×