  1. 1. Hadoop Talk
  2. 2. Brief background on me Phil has over 16 years experience in data-centric system development. His work has flowed from simulation and video- game-like systems, to high-performance computing (HPC), to traditional database (Oracle, SQL Server, Postgres, MySQL) and CRM (warehouse/analytical) systems, and most recently to the Hadoop stack. Recently, as an employee at TripAdvisor he led the research into Hadoop/Hive which resulted in the successful migration from the traditional RDBMS platform to a system which is based on Hadoop/Hive and is integrated with MS SQL Server/SSAS. Currently, hes focused on the Hadoop stack and is creating a solution which involves integrating Hadoop in a more traditional enterprise environment.
  3. 3. Agenda To make you as excited about Hadoop as I am What is Hadoop (high-level) ? What have we actually done with it?  How does “it” (HDFS, M/R, Hive, and HBase) work?  Future of Hadoop
  4. 4. What is Hadoop?
  5. 5. Q: What is Hadoop: A#1 - The thing that empowers Yahoo, FB, and others Yahoo has >25k Hadoop nodes…wow…
  6. 6. Q: What is Hadoop A#2 - Last year’s revolution (sort of)The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
  7. 7. Q: What is HadoopA#3 – the revolution of 5+ years ago
  8. 8. “Success has many fathers”And you can look them up, because it’s FOSS !People are fighting to contribute, and to get credit… be a contributor…(http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
  9. 9. What is Hadoop:A#4 – the wave everyone is riding Nearly all the big players (and many smaller ones) are on board…
  10. 10. In fact, beware of this http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
  11. 11. What have we actually done with it?
  12. 12. Hadoop projects performed by BlueMetal Architects  Hadoop at a Web 2.0 company (prior to BMA)  Ported traditional 30TB Warehouse to Hive  Big transform jobs in Hive  E.G. Joins 50M rows to 12B rows  Big Data jobs, e.g. Social Graph processing with many “Cartesians” to empower emails  Hadoop in HealthCare (at BMA)  Applied HBase as part of a new system  Feeds data (via WS) to:  E.D.  Patient Web Portal  Other HealthCare affiliates Note: Both projects include Hadoop as part of larger systems.
  13. 13. Warehouse Goals Use the right tool for the right job –Hadoop (M/R, Hive) is a batch system • Inherently high-latency –RDBMS (& other tools) are still needed Empower users –Minimize complexity • Eliminate joins (almost) • Eliminate “dimensions” (maybe) –Expose *all* data –Provide low-latency options –Provide self-service options
  14. 14. A strategy for MASSIVE processing:Best tool for the jobThis is what we implemented and, it turns out, is also what Yahoo has done.Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
  15. 15. Focus back to Hadoop …
  16. 16. High-level descriptions are good,but not enough. How does it work? (From: http://blog.nahurst.com/visual-guide-to-nosql-systems)
  17. 17. Here we go…
  18. 18. Map-Reduce (M/R) exampleNote: this job is not optimizedTake home message: “Simple API - Mappers read theinput and emit K/V pairs. Framework sends ReducersK/V pairs partitioned and ordered* by Key” (From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
  19. 19. Hadoop M/R with some details:Note: Partition, Combine and Shuffle (From: http://www.lecturemaker.com/2011/02/rhipe/)
  20. 20. Hadoop M/R PrimerLet’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”(From: Yahoo)
  21. 21. Hadoop Terasort Job Profile- or “hey, I thought it was just M/R” (fromhttp://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s orts_a_petabyte_in_162/)
  22. 22. Why Hadoop?Because you don’t want to handle this…This is actually a profile of a job running on an old version of Hadoop, but jobswith many failures look similar. This also shows improvement in Hadoop. (From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
  23. 23. Hadoop M/R executive summaryDistributed storage system, with distributed processingcapability, on commodity hardware (or in the cloud).Moves the computation to the data !That, in turn, saves network which is the limiting factor indistributed apps.The same code can run on data of any size. The cluster isscaled with the data, not the code.
  24. 24. Hadoop Stack Key Components(http://hortonworks.com/technology/hortonworksdataplatform/)HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas. Hadoop is not just about non/semi structured data !
  25. 25. Hive= HDFS+ Metadata+ HQL-> (efficient) M/R+ more= RDBMS- low-latency (usually)- (row-level) updates- other (e.g. constraints)+ HUGE scalability+ POWERFUL distributed processing
  26. 26. Common RDBMS warehouse queryselect top 10 t.*from ( select ip_address, count(*) as cnt from f_pageviews pv join d_ipaddress ip on (pv.ip_key = ip.id) where date_key = 2992 group by ip_address)torder by cnt desc– wait a few minutes- time is usually 1-4x nominal time depending on load- … assumes the job can succeed at all !
  27. 27. Hive Version…The luxury of Hadoop space/power, means dimensional processing might not berequiredNOTE: Hive does support “column-oriented” storage, which is very efficient.select t.*from ( select ip_address, count(*) as cnt from f_lookback where ds = 2011-03-11 group by ip_address)torder by cnt descLimit 10– BUT – runtime is trickierTime to run your job = HQL parse + M/R Job Submit + [ waitin the queue for availability ] + M/R Job Runtime
  28. 28. What else can Hadoop do? FB: Invented Cassandra but went with HBase for their new messaging system. Does that mean HBase is ”better”? – no, it’s about using the right tool for the job. http://www.facebook.com/note.php?note_id=454991608919That’s to hold 135B messages per month !http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.htmlScale is relative (to your hardware and load),but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,consider Hbase.
  29. 29. HBase ArchitectureNot shown: HM, ZK and HDFS (From: http://www.larsgeorge.com/2009/10/hbase-architecture-101- storage.html)
  30. 30. HBase: a more detailed view (http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  31. 31. HBase: one way to look at itA BigTable Implementation: memcached + LSM + framework (From: http://java.dzone.com/news/bigtable-model-cassandra-and)
  32. 32. HBase: Hadoop BigTableNot just a CRUD back-end:…coprocessors, versioned cells, range scans, optimization (e.g.selective compression) via column families, etc. The most important of these is distributed processing.
  33. 33. Hadoop in (pre*) action Hadoop indexed “THE DATA” for Watson http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/ *Runtime processing used Apache JMS + UIMA .
  34. 34. Future of Hadoop
  35. 35. Overlapping EcosystemsHadoop (usage and contributions) will be“shared” between FOSS and Closed Sourcecommunities. Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
  36. 36. False Conflicts, with Solutions Sodium(explosive) + Chlorine(poison) => Salt(vital) From http://strangetimes.lastsuperpower.net/?p=1663Closed Source + Open Source =>Free + Enterprise + Support+ IntegrationVisit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
  37. 37. IMO, an important message from abrilliant man Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A http://www.youtube.com/watch?v=IVS__xF3BygAdd value by fostering the ecosystem.Do not fragment Hadoop (as Unix did).There is room for folks from many areas to contribute and benefit.
  38. 38. Hadoop “option” (MapR) that plays nicely
  39. 39. MS embraced Hadoop despite having developedtechnology similar to NextGen Hadoop. Wow.Hadoop release on Azure is 3/12. BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please contact us as we’ll be blogging about it.
  40. 40. Hadoop NextGen: NN-HA, performance gains, more
  41. 41. Hadoop NextGen:A Brave New (!?) worldHadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
  42. 42. Hadoop >> (un)structured data store.Why do this (except ad-hoc) …?RDBMS and Hadoop have strengths, use them, don’t negate both.See the above Warehouse Architecture diagram… From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
  43. 43. Q&A
  45. 45. Additional Slides
