What is Hadoop? Oct 17 2013

  • 820 views
Uploaded on

What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction …

What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
820
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
24
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Adam  Muise  –  Hortonworks   WELCOME  TO  HADOOP  
  • 2. Who  am  I?  
  • 3. Why  are  we  here?  
  • 4. Data  
  • 5. “Big  Data”  is  the  marke=ng  term   of  the  decade  
  • 6. What  lurks  behind  the  marke=ng   and  hype  is  a  legi=mate  movement   forward  in  dealing  with  data  
  • 7. You  need  to  deal  with  Data  
  • 8. Put  it  away,  delete  it,  tweet  it,   compress  it,  shred  it,  wikileak-­‐it,  put   it  in  a  database,  put  it  in  SAN/NAS,   put  in  the  cloud,  hide  it  in  tape…  
  • 9. Let’s  talk  challenges…  
  • 10. Volume   Volume   Volume   Volume  
  • 11. Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume  
  • 12. Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume  
  • 13. Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  
  • 14. Storage,  Management,  Processing   all  become  challenges  with  Data  at   Volume  
  • 15. Tradi=onal  technologies  adopt  a   divide,  drop,  and  conquer  approach  
  • 16. Another  EDW   Analy=cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   The  solu=on?   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  • 17. Another  EDW   Analy=cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Ummm…you   dropped  something   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  • 18. Analyzing  the  data  usually  raises   more  interes=ng  ques=ons…  
  • 19. …which  leads  to  more  data  
  • 20. Wait,  you’ve  seen  this  before.   Data   Data   Data   …   Sausage  Factory   Data   Data   Data   Data   Data   Data   Data   Data   Data   …   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  • 21. Your  data  silos  are  lonely  places.   EDW   Accounts   Customers   Web  Proper=es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  • 22. …  Data  likes  to  be  together.   EDW   Accounts   Customers   Data   Data   Web  Proper=es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  • 23. New  types  of  data  don’t  quite  fit   your  pris=ne  view  of  the  world   Logs   Data   Data   Data   Data   Data  Data   Data   CDR/SIP   Data   Data   Data   Data   Data  Data   Data   My  LiYle  Data  Empire   Data   ?   Data   ?   Data   Data   Data   Data   Data   ?  ?   Data   Data  
  • 24. To  resolve  this,  some  people  take   hints  from  Lord  Of  The  Rings..  
  • 25. …and  create  One-­‐Schema-­‐To-­‐ Rule-­‐Them-­‐All…   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  
  • 26. ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data   …but  that  has  its  problems  too.   ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  
  • 27. So  what  is  the  answer?  
  • 28. Enter  the  Hadoop.   ………   hYp://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐stories/  
  • 29. Hadoop  was  created  because  Big  IT   never  cut  it  for  the  Internet   Proper=es  like  Google,  Yahoo,   Facebook,  TwiYer,  LinkedIn  
  • 30. Tradi=onal  architecture  didn’t   scale  enough…   App   App   App   App   App   App   App   App   DB   DB   DB   SAN   App   App   App   App   DB   DB   DB   SAN   DB   DB   DB   SAN  
  • 31. $upercompu=ng   Tradi=onal  architectures  cost  too   much  at  that  volume…   $/TB   $pecial   Hardware  
  • 32. So  what  is  the  answer?  
  • 33. If  you  could  design  a  system  that   would  handle  this,  what  would  it   look  like?  
  • 34. It  would  probably  need  a  highly   resilient,  self-­‐healing,  cost-­‐efficient,   distributed  file  system…   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage  
  • 35. It  would  probably  need  a  completely   parallel  processing  framework  that   took  tasks  to  the  data…   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  
  • 36. It  would  probably  run  on  commodity   hardware,  virtualized  machines,  and   common  OS  pladorms   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  
  • 37. It  would  probably  be  open  source  so   innova=on  could  happen  as  quickly   as  possible  
  • 38. It  would  need  a  cri=cal  mass  of   users  
  • 39. {Processing  +  Storage}   =   {MapReduce/YARN+  HDFS}  
  • 40. HDFS  stores  data  in  blocks  and   replicates  those  blocks   block1   Processing   Processing  Processing   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   block3  
  • 41. If  a  block  fails  then  HDFS  always  has   the  other  copies  and  heals  itself   block1   Processing   Processing  Processing   block3   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   X
  • 42. MapReduce  is  a  programming   paradigm  that  completely  parallel   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  
  • 43. MapReduce  has  three  phases:   Map,  Sort/Shuffle,  Reduce   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Mapper   Key,  Value   Key,  Value   Key,  Value   Reducer   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Key,  Value   Key,  Value   Key,  Value  
  • 44. MapReduce  applies  to  a  lot  of   data  processing  problems   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  
  • 45. Introducing  YARN  
  • 46. YARN  =  Yet  Another  Resource   Nego=ator  
  • 47. YARN  abstracts  resource   management  so  you  can  run  more   than  just  MapReduce   MapReduce  V2   MapReduce  V?   STORM   Giraph   Tez   YARN   HDFS2   MPI   HBase   …  and   more  
  • 48. YARN  turns  Hadoop  into  a  smart   phone:  An  App  Ecosystem   hortonworks.com/yarn/  
  • 49. Check  out  the  book  too…   Preview  at:   hortonworks.com/yarn/  
  • 50. YARN  is  an  essen=al  part  of  a   balanced  breakfast  in  Hadoop  2.0   Oct  15  2013:  Apache  Community   releases  Hadoop  2.2.0   Halloween  2013:  Hortonworks   releases  HDP  2.0  GA  
  • 51. pict  
  • 52. Hadoop  has  other  open  source   projects…  
  • 53. Hive  =  {SQL  -­‐>  MapReduce}   SQL-­‐IN-­‐HADOOP  
  • 54. Pig  =  {PigLa=n  -­‐>  MapReduce}  
  • 55. HCatalog  =  {metadata*  for   MapReduce,  Hive,  Pig,  Hbase,  etc}   *metadata  =  tables,  columns,  par==ons,  types  
  • 56. Oozie  =  Job::{Task,  Task,  if  Task,   then  Task,  final  Task}  
  • 57. Falcon   Feed   Feed   Feed   Feed   Hadoop   DR   Feed   Replica=on   Feed   Feed   Hadoop   Feed  
  • 58. Flume   Files   Flume   JMS   Weblogs   Events   Flume   Flume   Flume   Flume   Flume   Hadoop  
  • 59. Sqoop   DB   DB   Sqoop   Hadoop   Sqoop  
  • 60. Ambari  =  {install,  manage,   monitor}  
  • 61. HBase  =  {real-­‐=me,  distributed-­‐ map,  big-­‐tables}  
  • 62. Storm  =  {Complex  Event  Processing,   Near-­‐Real-­‐Time,  Provisioned  by   YARN  }  
  • 63. Storm   HDFS   YARN   Pig   MapReduce   Apache  Hadoop   HCatalog   Hive   HBase   Ambari   Sqoop   Falcon   Flume  
  • 64. Storm   Pig   HDFS   YARN   MapReduce   Hortonworks  Data  Pladorm   HCatalog   Hive   HBase   Ambari   Sqoop   Falcon   Flume  
  • 65. What  else  are  we  working  on?   hortonworks.com/labs/  
  • 66. Hadoop  is  the  new  Data  Opera=ng   System  for  the  Enterprise  
  • 67. There is NO second place Hortonworks   …the  Bull  Elephant  of  Hadoop  Innova@on   © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page  67