Hadoop: today and tomorrow


Published on

Presentation on where Hadoop is today -and where it is going, at the London Hadoop Users group, April 2012

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Picking on what is really new in this release compared to just merges and stability, webhdfs is something interesting.Set one config option and the DNs and NNs become web servers (using the chosen auth mechanism), offering read and write access to the data.This is integral to the cluster -you ask the NN for data, which triggers a 307 redirect to a DN with the data, which serves up up. A redirect that is handled transparently by all HTTP clients set up to handle redirects.
  • This is what we're going to be shipping based on Hadoop 1.0, a packaging of the core Hadoop stack with management tooling
  • There's a set of nosql databases running on or near HadoopApache HBase is the key one -look at the facebook papers on FB chat to see how this works in the field.Cassandra -not directly dependent on Hadoop but you can run Pig and Hive queries against its data, and it implements the HDFS filesystem API so you can host TTs on the same nodes as your cassandra data and get data-local work.Accumulo is going to be mentioned as it is in incubation, donated to the ASF by the NSA. Apparently it has good security on access to keys and values, which shows that some orgs put security ahead of other features in nosql-land, and that government orgs are starting to play in this space -and contribute code back.
  • -don’t' write at the Java level if you can help it, both Pig and Hive are a lot more productive.SQL houses should play with Hive.Pig is very good for experimentation, and is ability to call User Defined Functions lets you re-use tuned Java libraries -such as LinkedIn's DataFu
  • Lots of ways to get data in. Most are focused on streaming from other servers in the same datacentre -like web servers, and collecting the logs.Scribe is designed to scale up well, with the option of discarding data under heavy loadKafka is from LinkedIn, nice code which can hook up behind log4j.
  • If you are doing anything w/ social networks, connecting events, locations together etc, the graph layer should be of interest -it's up and coming as the next layer in the stack.There are two projects in the apache incubatorHama: graph layer with a big driver being a telcoGiraph -ex Y!, LinkedIn are using this.There's a workshop after Berlin Buzzwords on "beyond MR" that I'm co-organising; Giraph will be one of the topics there (along w/ YARN and Stratosphere)
  • Hadoop
  • This is the architecture of HDFS HA, skipping bits of the details and the roadmap of when features come out. Active/Standby HA, not shared-write (much, much harder). Failover initially manual, moving to automated.Failure controllers monitor NN health, and heartbeat to ZK so that others in the ZK farm can detect failures. DNs report to both, but only listen to one
  • Hadoop 2 breaks up the JT into two tasks: the Resource Manager, which manages allocation of resources on servers, with the JT, which now becomes one of the possible “Application Masters” that can be deployed in a cluster. Breaking this up allows you toRun different JT's for different users & different versions of the MR APIs. (Facebook do this in their clusters with a static striping of TT's today)Run other topology-aware applications
  • The NoSQL business plan is a key issue here -politics and marketing not technologyDB business pricing always put an upper financial limit on big dataOracle liked to own the customer data (and had loyal DBA support)Move to vertical solutions promised best hardware and discounting opportunities, but removed flexibility ('the IBM model')Hadoop challenges this: generic servers with many HDD, open source softwareThey will need to add something to Hadoop/HDFS that stops you moving away or getting support for others. Looking at the hardware, that could either be very-low latency IPC (benefits?) or something integrating SSD into the system (preheating SSD caching of queued job data, …?//)Closing on a brighter note, my colleagues and I have tales of terror from playing w/ JVM options on a big cluster, as you can be confident of reaching all corner cases within a short period of time. If Oracle start using Hadoop as a driver for JVM performance and qualification -and return those tweaks to openjdk, we all benefit.
  • Last but very much not least, there's growing integration of Hadoop in the OSS world. App levelSpring has a Spring Data for Hadoop project in Beta, which lets you integrate HDFS, MR and Pig jobs within a Spring application -as well as Cascading. You can do workflows here and really integrate with enterprise apps, especially if you use Spring already.Cascading, the Hadoop workflow language, has moved to an Apache License, to remove worries about GPL contamination of your codeAlso of interest is the fact that the Linux vendors are taking Hadoop seriously -which can only improve testing and stability of Hadoop.Finally, off this sheet: R connector for Hadoop-the statisticians get integration from their World -R- to the new datasets
  • Facebook, Prineville, 45MB, one single cluster. Yahoo!, 180 PBIt means that Hadoop installations are becoming the largest known storage and compute systems in the planet. It's unlikely anyone in this audience's storage or B/W requirements will be as big, but for those in the audience who want them to become as big, Hadoop makes it possible both technically and financially
  • The other thing it means is this: nothing else has the momentum and the support.People may say "ours is better", but that's like saying Solaris was better than Linux, or the 68K was better than the intel 8086. Better doesn't win. More valuable does, and because of its growing support, layers above, adoption and the ecosystem, it has the edge.This isn't an excuse to get complacent: Spring killed Java EE, even though EJB once had everything going for it.
  • Hadoop: today and tomorrow

    1. Hadoop: Today and TomorrowSteve Loughran– Hortonworksstevel at hortonworks.com@steveloughranLondon, April 2012© Hortonworks Inc. 2012
    2. About me:• HP Labs: –Deployment, cloud infrastructure, Hadoop-in-Cloud• Apache – member and committer –Ant (author, Ant in Action), Axis 2 –Hadoop –Dynamic deployments –Diagnostics on failures –Cloud infrastructure integration• Joined Hortonworks in 2012 –UK based: R&D + customer engagement Page 2 © Hortonworks Inc. 2012
    3. About Hortonworks From developing and running the worlds largest Hadoop clusters to advancing open source Apache Hadoop for the broader market Hadoop at Yahoo! 40K+ Servers 170PB Storage 5M+ Monthly Jobs 1000+ Active Users 2011 HDP, training & support Page 3 © Hortonworks Inc. 2012
    4. Where is Hadoop?• Today: Hadoop 1.x –Status & Roadmap• Tomorrow: Hadoop 2.x –YARN –HDFS HA• Enterprise integration Page 4 © Hortonworks Inc. 2012
    5. Releases slowed with Hadoop take up 0.20.0 0.20.1 0.20.2 0.21.0 0.20.20{3,4,5}.0• 64 Releases• Branches from the last 2.5 years: –0.20.{0,1,2} – Stable release without security –0.20.2xx.y – Stable release with security –0.21.0 – released, unstable, deprecated –0.22.0 – orphan, unstable, lack of community Page 5
    6. Now: two release branches, one devHadoop 1.x• Stable, used in production systems• The one to use todayHadoop 2.0• The successor• Not quite ready for useHadoop 2.x "trunk"• Where features & fixes first go in• If you want to help –start here Page 6
    7. Today: Hadoop 1.x• A stable Hadoop release from the ASF –Merges various Hadoop 0.20.* branches (security, HBase support, …) –A stable branch for patching and back-porting• Highlights: –Security –HBase support (“append” operation) –WebHDFS –“new” MapReduce APIs complete & usable –Distribution packaging includes RPM files Page 7 © Hortonworks Inc. 2012
    8. WebHDFS: fast direct HTTP access~:$ GET http://nnode:50070/webhdfs/v1/results/part-r-00000.csv?op=openGATE4,eb8bd736445f415e18886ba037f84829,55000,2007-01-14,14:01:54,GATE4,ec58edcce1049fa665446dc1fa690638,8030803000,2007-01-14,13:52:31,GATE4,b6f07ce00f09035a6683c5e93e3c04b8,30000,2007-01-28,12:41:11,GATE4,a1bc345b756090854e9dd0011087c6c0,30000,2007-01-28,12:59:33,... Potential Uses: Out of cluster access to HDFS Cross-cluster, cross version HDFS access Native filesystem clients dfs.webhdfs.enabled=true Page 8 © Hortonworks Inc. 2012
    9. Hortonworks Data Platform HDP1Based on Hadoop 1.0, adds –HCatalog for table and schema management –Open APIs for metadata, data movement, app & job management –Consumable “standard Hadoop” stack: Hadoop 1.0.x core (HDFS, MapReduce) Pig 0.9.x data flow programming language Hive 0.8.x SQL-like language HBase 0.92.x column table datastore HCatalog 0.3.x table and schema management ZooKeeper 3.4.x coordinator Page 9 © Hortonworks Inc. 2012
    10. Post-SQL KVS & Column TablesProject Voldemort Page 10 © Hortonworks Inc. 2012
    11. Analysis tooling maturing Pig DataFu Page 11 © Hortonworks Inc. 2012
    12. Ingress Kafka Fluentd facebook / scribe Page 12 © Hortonworks Inc. 2012
    13. Keep an eye on the graph layer Apache Giraph Hama Workshop: Beyond MapReduce Page 13 © Hortonworks Inc. 2012
    14. Tomorrow: Hadoop 2.0• HDFS Federation – Clear separation of Namespace and Block Storage – Snapshots – Improved scalability and isolation• HDFS HA – Active/Standby failover of Namenodes• Next Generation MapReduce architecture (aka YARN) – New architecture enables other application types to plug in – Resource Manager a foundation for HA and fault tolerance• Performance! In beta 2012 Page 14 © Hortonworks Inc. 2012
    15. HDFS HA ZK ZK ZK Heartbeat Heartbeat FailoverController FailoverController Active Standby CmdsMonitor Health Monitor Healthof NN. OS, HW of NN. OS, HW NN NN Active StandbyBlock Reports to Active & StandbyDN fencing: Update cmds from one DN DN DN © Hortonworks Inc. 2012
    16. YARN: foundation of a datacentre OS Node Manager Container App Mstr Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container Multiple topology-aware applications in a single cluster © Hortonworks Inc. 2012
    17. Microsoft embraces Hadoop Good for enterprises & developers Great for end users! Page 17 © Hortonworks Inc. 2012
    18. Oracle accepts NoSQLMay 2011: “Dont be risking your data on NoSQL databases.”Sept 2011: “Oracle NoSQL Database provides network-accessiblemulti-terabyte distributed key/value pair storage withpredictable latency. ”• Oracle need compatible SQL & NoSQL business plans• & to justify high-end servers over “commodity” x86 boxes• Could drive Hadoop-centric JVM development 18 © Hortonworks Inc. 2012
    19. Open Source “Enterprise” ToolingApplication Layer• Spring Data for Hadoop in Beta• Cascading → Apache 2.0 LicenseOS Layer• RedHat building Hadoop story• Canonical assisting Hadoop packaging Page 19 © Hortonworks Inc. 2012
    20. What does all this mean? Page 20 © Hortonworks Inc. 2012
    21. facebook: 45 PB, Yahoo! 180+PB Page 21 © Hortonworks Inc. 2012
    22. Hadoop has the momentum• Platform: stable version & evolving version• Tooling & layers: ecosystem• Commercial training and support• Adoption by enterprise vendors Page 22 © Hortonworks Inc. 2012
    23. Hadoop is the Big Data Platform Page 23© Hortonworks Inc. 2011
    24. Get involved with the Apache project!•Join the -user mailing lists – common-user@hadoop.apache.org – hdfs-user@hadoop.apache.org – mapreduce-user@hadoop.apache.org•File bug reports in JIRA•Contribute to the documentation•Add: patches, tests, features, … Page 24 © Hortonworks Inc. 2012
    25. Questions?hortonworks.com Page 25 © Hortonworks Inc. 2012
    26. hortonworks.com Page 26 © Hortonworks Inc. 2012