Hadoop Talk
Brief background on me

    Phil has over 16 years experience in data-centric system
    development. His work has flowed from simulation and video-
    game-like systems, to high-performance computing (HPC), to
    traditional database (Oracle, SQL Server, Postgres, MySQL)
    and CRM (warehouse/analytical) systems, and most recently to
    the Hadoop stack. Recently, as an employee at TripAdvisor he
    led the research into Hadoop/Hive which resulted in the
    successful migration from the traditional RDBMS platform to a
    system which is based on Hadoop/Hive and is integrated with
    MS SQL Server/SSAS. Currently, he's focused on the Hadoop
    stack and is creating a solution which involves integrating
    Hadoop in a more traditional enterprise environment.
Agenda

    To make you as excited about Hadoop as I am


   What is Hadoop (high-level) ?

   What have we actually done with it?
  
    How does “it” (HDFS, M/R, Hive, and HBase) work?
  
    Future of Hadoop
What is Hadoop?
Q: What is Hadoop:
   A#1 - The thing that empowers
      Yahoo, FB, and others
         Yahoo has >25k Hadoop nodes…wow…
Q: What is Hadoop
   A#2 - Last year’s revolution (sort of)
The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
Q: What is Hadoop
A#3 – the revolution of 5+ years ago
“Success has many fathers”
And you can look them up, because it’s FOSS !
People are fighting to contribute, and to get credit… be a contributor…
(http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
What is Hadoop:
A#4 – the wave everyone is riding

 Nearly all the big players (and many smaller ones) are on board…
In fact, beware of this




  http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
What have we actually done with it?
Hadoop projects performed by BlueMetal Architects




  
    Hadoop at a Web 2.0 company (prior to BMA)
    
      Ported traditional 30TB Warehouse to Hive
    
      Big transform jobs in Hive
      
        E.G. Joins 50M rows to 12B rows
      
        Big Data jobs, e.g. Social Graph processing with
        many “Cartesians” to empower emails
  
     Hadoop in HealthCare (at BMA)
    
      Applied HBase as part of a new system
    
      Feeds data (via WS) to:
      
        E.D.
      
        Patient Web Portal
      
        Other HealthCare affiliates
  Note: Both projects include Hadoop as part of larger systems.
Warehouse Goals





   Use the right tool for the right job
  –Hadoop (M/R, Hive) is a batch system
   • Inherently high-latency
  –RDBMS (& other tools) are still needed

   Empower users
  –Minimize complexity
   • Eliminate joins (almost)
   • Eliminate “dimensions” (maybe)
  –Expose *all* data
  –Provide low-latency options
  –Provide self-service options
A strategy for MASSIVE processing:
Best tool for the job
This is what we implemented and, it turns out, is also what Yahoo has done.
Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
Focus back to Hadoop …
High-level descriptions are good,
but not enough. How does it work?
    (From: http://blog.nahurst.com/visual-guide-to-nosql-systems)
Here we go…
Map-Reduce (M/R) example
Note: this job is not optimized
Take home message: “Simple API - Mappers read the
input and emit K/V pairs. Framework sends Reducers
K/V pairs partitioned and ordered* by Key”
    (From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
Hadoop M/R with some details:
Note: Partition, Combine and Shuffle
                (From: http://www.lecturemaker.com/2011/02/rhipe/)
Hadoop M/R Primer
Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”
(From: Yahoo)
Hadoop Terasort Job Profile
- or “hey, I thought it was just M/R”
                              (from
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s
                     orts_a_petabyte_in_162/)
Why Hadoop?
Because you don’t want to handle this…
This is actually a profile of a job running on an old version of Hadoop, but jobs
with many failures look similar. This also shows improvement in Hadoop.
                (From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
Hadoop M/R executive summary

Distributed storage system, with distributed processing
capability, on commodity hardware (or in the cloud).

Moves the computation to the data !
That, in turn, saves network which is the limiting factor in
distributed apps.

The same code can run on data of any size. The cluster is
scaled with the data, not the code.
Hadoop Stack Key Components
(http://hortonworks.com/technology/hortonworksdataplatform/)
HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas.
                Hadoop is not just about non/semi structured data !
Hive
= HDFS
+ Metadata
+ HQL-> (efficient) M/R
+ more
= RDBMS
- low-latency (usually)
- (row-level) updates
- other (e.g. constraints)
+ HUGE scalability
+ POWERFUL distributed processing
Common RDBMS warehouse query




select top 10
  t.*
from (
  select ip_address, count(*) as cnt
  from f_pageviews pv
  join d_ipaddress ip on (pv.ip_key = ip.id)
  where date_key = 2992
  group by ip_address
)t
order by cnt desc

– wait a few minutes
- time is usually 1-4x nominal time depending on load
- … assumes the job can succeed at all !
Hive Version…
The luxury of Hadoop space/power, means dimensional processing might not be
required
NOTE: Hive does support “column-oriented” storage, which is very efficient.


select t.*
from (
  select ip_address, count(*) as cnt
  from f_lookback
  where ds = '2011-03-11'
  group by ip_address
)t
order by cnt desc
Limit 10

– BUT – runtime is trickier
Time to run your job = HQL parse + M/R Job Submit + [ wait
in the queue for availability ] + M/R Job Runtime
What else can Hadoop do?



   FB: Invented Cassandra but went with HBase for their new messaging system.
   Does that mean HBase is ”better”? – no, it’s about using the right tool for the job.
   http://www.facebook.com/note.php?note_id=454991608919



That’s to hold 135B messages per month !
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html


Scale is relative (to your hardware and load),
but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,
consider Hbase.
HBase Architecture
Not shown: HM, ZK and HDFS
      (From: http://www.larsgeorge.com/2009/10/hbase-architecture-101-
                                 storage.html)
HBase: a more detailed view
 (http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
HBase: one way to look at it
A BigTable Implementation: memcached + LSM + framework
     (From: http://java.dzone.com/news/bigtable-model-cassandra-and)
HBase: Hadoop BigTable
Not just a CRUD back-end:
…coprocessors, versioned cells, range scans, optimization (e.g.
selective compression) via column families, etc.
       The most important of these is distributed processing.
Hadoop in (pre*) action
                    Hadoop indexed “THE DATA” for Watson
  http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/

                        *Runtime processing used Apache JMS + UIMA .
Future of Hadoop
Overlapping Ecosystems

Hadoop (usage and contributions) will be
“shared” between FOSS and Closed Source
communities.




        Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
False Conflicts, with Solutions
             Sodium(explosive) + Chlorine(poison) =>
                           Salt(vital)




                                        From http://strangetimes.lastsuperpower.net/?p=1663




Closed Source + Open Source =>
Free + Enterprise + Support
+ Integration
Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
IMO, an important message from a
brilliant man
    Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A

 http://www.youtube.com/watch?v=IVS__xF3Byg


Add value by fostering the ecosystem.
Do not fragment Hadoop (as Unix did).
There is room for folks from many areas to contribute and benefit.
Hadoop “option” (MapR) that plays nicely
MS embraced Hadoop despite having developed
technology similar to NextGen Hadoop. Wow.
Hadoop release on Azure is 3/12.
 BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please
                      contact us as we’ll be blogging about it.
Hadoop NextGen:
 NN-HA, performance gains, more
Hadoop NextGen:
A Brave New (!?) world
Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”
BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
Hadoop >> (un)structured data store.
Why do this        (except ad-hoc)   …?
RDBMS and Hadoop have strengths, use them, don’t negate both.
See the above Warehouse Architecture diagram…
       From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
Q&A
Useful/Supporting Links
Bing crawls the web for Yahoo (for US, Canada, and some other countries)
http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html
World’s largest SSAS Cube: 14TB/quarter, 3B rows/day
http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS-
Engineer/22735283

http://hadoop.apache.org/

http://www.docstoc.com/docs/66356954/Advanced-HBase

https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial

http://wiki.apache.org/hadoop/WordCount

https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s
Additional Slides
Fun Links
http://www.youtube.com/watch?v=tIrBVjVfjNY

Hadoop demo ppt

  • 1.
  • 2.
    Brief background onme  Phil has over 16 years experience in data-centric system development. His work has flowed from simulation and video- game-like systems, to high-performance computing (HPC), to traditional database (Oracle, SQL Server, Postgres, MySQL) and CRM (warehouse/analytical) systems, and most recently to the Hadoop stack. Recently, as an employee at TripAdvisor he led the research into Hadoop/Hive which resulted in the successful migration from the traditional RDBMS platform to a system which is based on Hadoop/Hive and is integrated with MS SQL Server/SSAS. Currently, he's focused on the Hadoop stack and is creating a solution which involves integrating Hadoop in a more traditional enterprise environment.
  • 3.
    Agenda  To make you as excited about Hadoop as I am  What is Hadoop (high-level) ?  What have we actually done with it?  How does “it” (HDFS, M/R, Hive, and HBase) work?  Future of Hadoop
  • 4.
  • 5.
    Q: What isHadoop: A#1 - The thing that empowers Yahoo, FB, and others Yahoo has >25k Hadoop nodes…wow…
  • 6.
    Q: What isHadoop A#2 - Last year’s revolution (sort of) The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
  • 7.
    Q: What isHadoop A#3 – the revolution of 5+ years ago
  • 8.
    “Success has manyfathers” And you can look them up, because it’s FOSS ! People are fighting to contribute, and to get credit… be a contributor… (http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
  • 9.
    What is Hadoop: A#4– the wave everyone is riding Nearly all the big players (and many smaller ones) are on board…
  • 10.
    In fact, bewareof this http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
  • 11.
    What have weactually done with it?
  • 12.
    Hadoop projects performedby BlueMetal Architects  Hadoop at a Web 2.0 company (prior to BMA)  Ported traditional 30TB Warehouse to Hive  Big transform jobs in Hive  E.G. Joins 50M rows to 12B rows  Big Data jobs, e.g. Social Graph processing with many “Cartesians” to empower emails  Hadoop in HealthCare (at BMA)  Applied HBase as part of a new system  Feeds data (via WS) to:  E.D.  Patient Web Portal  Other HealthCare affiliates Note: Both projects include Hadoop as part of larger systems.
  • 13.
    Warehouse Goals  Use the right tool for the right job –Hadoop (M/R, Hive) is a batch system • Inherently high-latency –RDBMS (& other tools) are still needed  Empower users –Minimize complexity • Eliminate joins (almost) • Eliminate “dimensions” (maybe) –Expose *all* data –Provide low-latency options –Provide self-service options
  • 14.
    A strategy forMASSIVE processing: Best tool for the job This is what we implemented and, it turns out, is also what Yahoo has done. Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
  • 15.
    Focus back toHadoop …
  • 16.
    High-level descriptions aregood, but not enough. How does it work? (From: http://blog.nahurst.com/visual-guide-to-nosql-systems)
  • 17.
  • 18.
    Map-Reduce (M/R) example Note:this job is not optimized Take home message: “Simple API - Mappers read the input and emit K/V pairs. Framework sends Reducers K/V pairs partitioned and ordered* by Key” (From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
  • 19.
    Hadoop M/R withsome details: Note: Partition, Combine and Shuffle (From: http://www.lecturemaker.com/2011/02/rhipe/)
  • 20.
    Hadoop M/R Primer Let’sdiscuss HDFS: (blocks, replication) and how that helps “data local tasks” (From: Yahoo)
  • 21.
    Hadoop Terasort JobProfile - or “hey, I thought it was just M/R” (from http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s orts_a_petabyte_in_162/)
  • 22.
    Why Hadoop? Because youdon’t want to handle this… This is actually a profile of a job running on an old version of Hadoop, but jobs with many failures look similar. This also shows improvement in Hadoop. (From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
  • 23.
    Hadoop M/R executivesummary Distributed storage system, with distributed processing capability, on commodity hardware (or in the cloud). Moves the computation to the data ! That, in turn, saves network which is the limiting factor in distributed apps. The same code can run on data of any size. The cluster is scaled with the data, not the code.
  • 24.
    Hadoop Stack KeyComponents (http://hortonworks.com/technology/hortonworksdataplatform/) HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas. Hadoop is not just about non/semi structured data !
  • 25.
    Hive = HDFS + Metadata +HQL-> (efficient) M/R + more = RDBMS - low-latency (usually) - (row-level) updates - other (e.g. constraints) + HUGE scalability + POWERFUL distributed processing
  • 26.
    Common RDBMS warehousequery select top 10 t.* from ( select ip_address, count(*) as cnt from f_pageviews pv join d_ipaddress ip on (pv.ip_key = ip.id) where date_key = 2992 group by ip_address )t order by cnt desc – wait a few minutes - time is usually 1-4x nominal time depending on load - … assumes the job can succeed at all !
  • 27.
    Hive Version… The luxuryof Hadoop space/power, means dimensional processing might not be required NOTE: Hive does support “column-oriented” storage, which is very efficient. select t.* from ( select ip_address, count(*) as cnt from f_lookback where ds = '2011-03-11' group by ip_address )t order by cnt desc Limit 10 – BUT – runtime is trickier Time to run your job = HQL parse + M/R Job Submit + [ wait in the queue for availability ] + M/R Job Runtime
  • 28.
    What else canHadoop do? FB: Invented Cassandra but went with HBase for their new messaging system. Does that mean HBase is ”better”? – no, it’s about using the right tool for the job. http://www.facebook.com/note.php?note_id=454991608919 That’s to hold 135B messages per month ! http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html Scale is relative (to your hardware and load), but when you want a consistent “OLTP” solution that doesn’t require redesign to scale, consider Hbase.
  • 29.
    HBase Architecture Not shown:HM, ZK and HDFS (From: http://www.larsgeorge.com/2009/10/hbase-architecture-101- storage.html)
  • 30.
    HBase: a moredetailed view (http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  • 31.
    HBase: one wayto look at it A BigTable Implementation: memcached + LSM + framework (From: http://java.dzone.com/news/bigtable-model-cassandra-and)
  • 32.
    HBase: Hadoop BigTable Notjust a CRUD back-end: …coprocessors, versioned cells, range scans, optimization (e.g. selective compression) via column families, etc. The most important of these is distributed processing.
  • 33.
    Hadoop in (pre*)action Hadoop indexed “THE DATA” for Watson http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/ *Runtime processing used Apache JMS + UIMA .
  • 34.
  • 35.
    Overlapping Ecosystems Hadoop (usageand contributions) will be “shared” between FOSS and Closed Source communities. Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
  • 36.
    False Conflicts, withSolutions Sodium(explosive) + Chlorine(poison) => Salt(vital) From http://strangetimes.lastsuperpower.net/?p=1663 Closed Source + Open Source => Free + Enterprise + Support + Integration Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
  • 37.
    IMO, an importantmessage from a brilliant man Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A http://www.youtube.com/watch?v=IVS__xF3Byg Add value by fostering the ecosystem. Do not fragment Hadoop (as Unix did). There is room for folks from many areas to contribute and benefit.
  • 38.
    Hadoop “option” (MapR)that plays nicely
  • 39.
    MS embraced Hadoopdespite having developed technology similar to NextGen Hadoop. Wow. Hadoop release on Azure is 3/12. BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please contact us as we’ll be blogging about it.
  • 40.
    Hadoop NextGen: NN-HA,performance gains, more
  • 41.
    Hadoop NextGen: A BraveNew (!?) world Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph” BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
  • 42.
    Hadoop >> (un)structureddata store. Why do this (except ad-hoc) …? RDBMS and Hadoop have strengths, use them, don’t negate both. See the above Warehouse Architecture diagram… From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
  • 43.
  • 44.
    Useful/Supporting Links Bing crawlsthe web for Yahoo (for US, Canada, and some other countries) http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html World’s largest SSAS Cube: 14TB/quarter, 3B rows/day http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS- Engineer/22735283 http://hadoop.apache.org/ http://www.docstoc.com/docs/66356954/Advanced-HBase https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial http://wiki.apache.org/hadoop/WordCount https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s
  • 45.
  • 46.