0
Adam	
  Muise	
  –	
  Hortonworks	
  

WELCOME	
  TO	
  HADOOP	
  
Who	
  am	
  I?	
  
Why	
  are	
  we	
  here?	
  
Data	
  
“Big	
  Data”	
  is	
  the	
  marke=ng	
  term	
  
of	
  the	
  decade	
  
What	
  lurks	
  behind	
  the	
  marke=ng	
  
and	
  hype	
  is	
  a	
  legi=mate	
  movement	
  
forward	
  in	
  dealin...
You	
  need	
  to	
  deal	
  with	
  Data	
  
Put	
  it	
  away,	
  delete	
  it,	
  tweet	
  it,	
  
compress	
  it,	
  shred	
  it,	
  wikileak-­‐it,	
  put	
  
it	
 ...
Let’s	
  talk	
  challenges…	
  
Volume	
  
Volume	
  

Volume	
  

Volume	
  
Volume	
   Volume	
  
Volume	
  
Volume	
  
Volume	
  Volume	
  

Volume	
  
Volume	
  
Volume	
  Volume	
  
Volume	
  

V...
Volume	
   Volume	
  
Volume	
  
Volume	
  
Volume	
  Volume	
  
Volume	
  
Volume	
  
Volume	
  Volume	
  
Volume	
  
Vol...
Volume	
  
Volume	
  
Volume	
  
Volume	
  
Volume	
  
Volume	
   Volume	
  
Volume	
  
Volume	
  
Volume	
   Volume	
   V...
Storage,	
  Management,	
  Processing	
  
all	
  become	
  challenges	
  with	
  Data	
  at	
  
Volume	
  
Tradi=onal	
  technologies	
  adopt	
  a	
  
divide,	
  drop,	
  and	
  conquer	
  approach	
  
Another	
  EDW	
  

Analy=cal	
  DB	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Data	
  
Data	
  
Data	
   Data	
  
...
Another	
  EDW	
  

Analy=cal	
  DB	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Data	
  
Data	
  
Data	
   Data	
  
...
Analyzing	
  the	
  data	
  usually	
  raises	
  
more	
  interes=ng	
  ques=ons…	
  
…which	
  leads	
  to	
  more	
  data	
  
Wait,	
  you’ve	
  seen	
  this	
  before.	
  

Data	
  
Data	
  
Data	
  

…	
  

Sausage	
  Factory	
  

Data	
  
Data	
...
Your	
  data	
  silos	
  are	
  lonely	
  places.	
  
EDW	
  

Accounts	
  

Customers	
  

Web	
  Proper=es	
  

Data	
  ...
…	
  Data	
  likes	
  to	
  be	
  together.	
  
EDW	
  

Accounts	
  

Customers	
  
Data	
  
Data	
  
Web	
  Proper=es	
 ...
New	
  types	
  of	
  data	
  don’t	
  quite	
  fit	
  
your	
  pris=ne	
  view	
  of	
  the	
  world	
  
Logs	
  

Data	
 ...
To	
  resolve	
  this,	
  some	
  people	
  take	
  
hints	
  from	
  Lord	
  Of	
  The	
  Rings..	
  
…and	
  create	
  One-­‐Schema-­‐To-­‐
Rule-­‐Them-­‐All…	
  
EDW	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Schema...
ETL	
  
Data	
  
Data	
  
Data	
  

ETL	
  

ETL	
  

ETL	
  

EDW	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Schem...
So	
  what	
  is	
  the	
  answer?	
  
Enter	
  the	
  Hadoop.	
  

………	
  
hYp://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐sto...
Hadoop	
  was	
  created	
  because	
  Big	
  IT	
  
never	
  cut	
  it	
  for	
  the	
  Internet	
  
Proper=es	
  like	
 ...
Tradi=onal	
  architecture	
  didn’t	
  
scale	
  enough…	
  
App	
   App	
   App	
   App	
  

App	
   App	
   App	
   App...
$upercompu=ng	
  

Tradi=onal	
  architectures	
  cost	
  too	
  
much	
  at	
  that	
  volume…	
  

$/TB	
  

$pecial	
  ...
So	
  what	
  is	
  the	
  answer?	
  
If	
  you	
  could	
  design	
  a	
  system	
  that	
  
would	
  handle	
  this,	
  what	
  would	
  it	
  
look	
  like?	...
It	
  would	
  probably	
  need	
  a	
  highly	
  
resilient,	
  self-­‐healing,	
  cost-­‐efficient,	
  
distributed	
  file...
It	
  would	
  probably	
  need	
  a	
  completely	
  
parallel	
  processing	
  framework	
  that	
  
took	
  tasks	
  to...
It	
  would	
  probably	
  run	
  on	
  commodity	
  
hardware,	
  virtualized	
  machines,	
  and	
  
common	
  OS	
  pla...
It	
  would	
  probably	
  be	
  open	
  source	
  so	
  
innova=on	
  could	
  happen	
  as	
  quickly	
  
as	
  possible...
It	
  would	
  need	
  a	
  cri=cal	
  mass	
  of	
  
users	
  
{Processing	
  +	
  Storage}	
  
=	
  
{MapReduce/YARN+	
  HDFS}	
  
HDFS	
  stores	
  data	
  in	
  blocks	
  and	
  
replicates	
  those	
  blocks	
  
block1	
  
Processing	
   Processing	
...
If	
  a	
  block	
  fails	
  then	
  HDFS	
  always	
  has	
  
the	
  other	
  copies	
  and	
  heals	
  itself	
  
block1...
MapReduce	
  is	
  a	
  programming	
  
paradigm	
  that	
  completely	
  parallel	
  
Data	
  
Data	
  
Data	
  
Data	
  ...
MapReduce	
  has	
  three	
  phases:	
  
Map,	
  Sort/Shuffle,	
  Reduce	
  
Key,	
  Value	
  
Key,	
  Value	
  
Key,	
  Val...
MapReduce	
  applies	
  to	
  a	
  lot	
  of	
  
data	
  processing	
  problems	
  
Data	
  
Data	
  
Data	
  
Data	
  
Da...
Introducing	
  YARN	
  
YARN	
  =	
  Yet	
  Another	
  Resource	
  
Nego=ator	
  
YARN	
  abstracts	
  resource	
  
management	
  so	
  you	
  can	
  run	
  more	
  
than	
  just	
  MapReduce	
  
MapReduc...
YARN	
  turns	
  Hadoop	
  into	
  a	
  smart	
  
phone:	
  An	
  App	
  Ecosystem	
  
hortonworks.com/yarn/	
  
Check	
  out	
  the	
  book	
  too…	
  

Preview	
  at:	
  
hortonworks.com/yarn/	
  
YARN	
  is	
  an	
  essen=al	
  part	
  of	
  a	
  
balanced	
  breakfast	
  in	
  Hadoop	
  2.0	
  
Oct	
  15	
  2013:	
 ...
pict	
  
Hadoop	
  has	
  other	
  open	
  source	
  
projects…	
  
Hive	
  =	
  {SQL	
  -­‐>	
  MapReduce}	
  
SQL-­‐IN-­‐HADOOP	
  
Pig	
  =	
  {PigLa=n	
  -­‐>	
  MapReduce}	
  
HCatalog	
  =	
  {metadata*	
  for	
  
MapReduce,	
  Hive,	
  Pig,	
  Hbase,	
  etc}	
  
*metadata	
  =	
  tables,	
  colu...
Oozie	
  =	
  Job::{Task,	
  Task,	
  if	
  Task,	
  
then	
  Task,	
  final	
  Task}	
  
Falcon	
  
Feed	
   Feed	
  
Feed	
  

Feed	
  

Hadoop	
  

DR	
  

Feed	
  

Replica=on	
  

Feed	
  

Feed	
  

Hadoop	...
Flume	
  
Files	
  

Flume	
  
JMS	
  

Weblogs	
  

Events	
  

Flume	
  

Flume	
  

Flume	
  

Flume	
  

Flume	
  

Ha...
Sqoop	
  
DB	
  

DB	
  

Sqoop	
  
Hadoop	
  

Sqoop	
  
Ambari	
  =	
  {install,	
  manage,	
  
monitor}	
  
HBase	
  =	
  {real-­‐=me,	
  distributed-­‐
map,	
  big-­‐tables}	
  
Storm	
  =	
  {Complex	
  Event	
  Processing,	
  
Near-­‐Real-­‐Time,	
  Provisioned	
  by	
  
YARN	
  }	
  
Storm	
  
HDFS	
  

YARN	
  

Pig	
  

MapReduce	
  

Apache	
  Hadoop	
  

HCatalog	
  

Hive	
  
HBase	
  

Ambari	
  

...
Storm	
  

Pig	
  

HDFS	
  

YARN	
  
MapReduce	
  

Hortonworks	
  Data	
  Pladorm	
  
HCatalog	
  

Hive	
  
HBase	
  
...
What	
  else	
  are	
  we	
  working	
  on?	
  
hortonworks.com/labs/	
  
Hadoop	
  is	
  the	
  new	
  Data	
  Opera=ng	
  
System	
  for	
  the	
  Enterprise	
  
There is NO second place

Hortonworks	
  

…the	
  Bull	
  Elephant	
  of	
  Hadoop	
  Innova@on	
  
© Hortonworks Inc. 20...
Upcoming SlideShare
Loading in...5
×

What is Hadoop? Oct 17 2013

1,301

Published on

What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,301
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
26
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "What is Hadoop? Oct 17 2013"

  1. 1. Adam  Muise  –  Hortonworks   WELCOME  TO  HADOOP  
  2. 2. Who  am  I?  
  3. 3. Why  are  we  here?  
  4. 4. Data  
  5. 5. “Big  Data”  is  the  marke=ng  term   of  the  decade  
  6. 6. What  lurks  behind  the  marke=ng   and  hype  is  a  legi=mate  movement   forward  in  dealing  with  data  
  7. 7. You  need  to  deal  with  Data  
  8. 8. Put  it  away,  delete  it,  tweet  it,   compress  it,  shred  it,  wikileak-­‐it,  put   it  in  a  database,  put  it  in  SAN/NAS,   put  in  the  cloud,  hide  it  in  tape…  
  9. 9. Let’s  talk  challenges…  
  10. 10. Volume   Volume   Volume   Volume  
  11. 11. Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume  
  12. 12. Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume  
  13. 13. Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  
  14. 14. Storage,  Management,  Processing   all  become  challenges  with  Data  at   Volume  
  15. 15. Tradi=onal  technologies  adopt  a   divide,  drop,  and  conquer  approach  
  16. 16. Another  EDW   Analy=cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   The  solu=on?   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  17. 17. Another  EDW   Analy=cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Ummm…you   dropped  something   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  18. 18. Analyzing  the  data  usually  raises   more  interes=ng  ques=ons…  
  19. 19. …which  leads  to  more  data  
  20. 20. Wait,  you’ve  seen  this  before.   Data   Data   Data   …   Sausage  Factory   Data   Data   Data   Data   Data   Data   Data   Data   Data   …   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  21. 21. Your  data  silos  are  lonely  places.   EDW   Accounts   Customers   Web  Proper=es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  22. 22. …  Data  likes  to  be  together.   EDW   Accounts   Customers   Data   Data   Web  Proper=es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  23. 23. New  types  of  data  don’t  quite  fit   your  pris=ne  view  of  the  world   Logs   Data   Data   Data   Data   Data  Data   Data   CDR/SIP   Data   Data   Data   Data   Data  Data   Data   My  LiYle  Data  Empire   Data   ?   Data   ?   Data   Data   Data   Data   Data   ?  ?   Data   Data  
  24. 24. To  resolve  this,  some  people  take   hints  from  Lord  Of  The  Rings..  
  25. 25. …and  create  One-­‐Schema-­‐To-­‐ Rule-­‐Them-­‐All…   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  
  26. 26. ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data   …but  that  has  its  problems  too.   ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  
  27. 27. So  what  is  the  answer?  
  28. 28. Enter  the  Hadoop.   ………   hYp://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐stories/  
  29. 29. Hadoop  was  created  because  Big  IT   never  cut  it  for  the  Internet   Proper=es  like  Google,  Yahoo,   Facebook,  TwiYer,  LinkedIn  
  30. 30. Tradi=onal  architecture  didn’t   scale  enough…   App   App   App   App   App   App   App   App   DB   DB   DB   SAN   App   App   App   App   DB   DB   DB   SAN   DB   DB   DB   SAN  
  31. 31. $upercompu=ng   Tradi=onal  architectures  cost  too   much  at  that  volume…   $/TB   $pecial   Hardware  
  32. 32. So  what  is  the  answer?  
  33. 33. If  you  could  design  a  system  that   would  handle  this,  what  would  it   look  like?  
  34. 34. It  would  probably  need  a  highly   resilient,  self-­‐healing,  cost-­‐efficient,   distributed  file  system…   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage  
  35. 35. It  would  probably  need  a  completely   parallel  processing  framework  that   took  tasks  to  the  data…   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  
  36. 36. It  would  probably  run  on  commodity   hardware,  virtualized  machines,  and   common  OS  pladorms   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  
  37. 37. It  would  probably  be  open  source  so   innova=on  could  happen  as  quickly   as  possible  
  38. 38. It  would  need  a  cri=cal  mass  of   users  
  39. 39. {Processing  +  Storage}   =   {MapReduce/YARN+  HDFS}  
  40. 40. HDFS  stores  data  in  blocks  and   replicates  those  blocks   block1   Processing   Processing  Processing   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   block3  
  41. 41. If  a  block  fails  then  HDFS  always  has   the  other  copies  and  heals  itself   block1   Processing   Processing  Processing   block3   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   X
  42. 42. MapReduce  is  a  programming   paradigm  that  completely  parallel   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  
  43. 43. MapReduce  has  three  phases:   Map,  Sort/Shuffle,  Reduce   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Mapper   Key,  Value   Key,  Value   Key,  Value   Reducer   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Key,  Value   Key,  Value   Key,  Value  
  44. 44. MapReduce  applies  to  a  lot  of   data  processing  problems   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  
  45. 45. Introducing  YARN  
  46. 46. YARN  =  Yet  Another  Resource   Nego=ator  
  47. 47. YARN  abstracts  resource   management  so  you  can  run  more   than  just  MapReduce   MapReduce  V2   MapReduce  V?   STORM   Giraph   Tez   YARN   HDFS2   MPI   HBase   …  and   more  
  48. 48. YARN  turns  Hadoop  into  a  smart   phone:  An  App  Ecosystem   hortonworks.com/yarn/  
  49. 49. Check  out  the  book  too…   Preview  at:   hortonworks.com/yarn/  
  50. 50. YARN  is  an  essen=al  part  of  a   balanced  breakfast  in  Hadoop  2.0   Oct  15  2013:  Apache  Community   releases  Hadoop  2.2.0   Halloween  2013:  Hortonworks   releases  HDP  2.0  GA  
  51. 51. pict  
  52. 52. Hadoop  has  other  open  source   projects…  
  53. 53. Hive  =  {SQL  -­‐>  MapReduce}   SQL-­‐IN-­‐HADOOP  
  54. 54. Pig  =  {PigLa=n  -­‐>  MapReduce}  
  55. 55. HCatalog  =  {metadata*  for   MapReduce,  Hive,  Pig,  Hbase,  etc}   *metadata  =  tables,  columns,  par==ons,  types  
  56. 56. Oozie  =  Job::{Task,  Task,  if  Task,   then  Task,  final  Task}  
  57. 57. Falcon   Feed   Feed   Feed   Feed   Hadoop   DR   Feed   Replica=on   Feed   Feed   Hadoop   Feed  
  58. 58. Flume   Files   Flume   JMS   Weblogs   Events   Flume   Flume   Flume   Flume   Flume   Hadoop  
  59. 59. Sqoop   DB   DB   Sqoop   Hadoop   Sqoop  
  60. 60. Ambari  =  {install,  manage,   monitor}  
  61. 61. HBase  =  {real-­‐=me,  distributed-­‐ map,  big-­‐tables}  
  62. 62. Storm  =  {Complex  Event  Processing,   Near-­‐Real-­‐Time,  Provisioned  by   YARN  }  
  63. 63. Storm   HDFS   YARN   Pig   MapReduce   Apache  Hadoop   HCatalog   Hive   HBase   Ambari   Sqoop   Falcon   Flume  
  64. 64. Storm   Pig   HDFS   YARN   MapReduce   Hortonworks  Data  Pladorm   HCatalog   Hive   HBase   Ambari   Sqoop   Falcon   Flume  
  65. 65. What  else  are  we  working  on?   hortonworks.com/labs/  
  66. 66. Hadoop  is  the  new  Data  Opera=ng   System  for  the  Enterprise  
  67. 67. There is NO second place Hortonworks   …the  Bull  Elephant  of  Hadoop  Innova@on   © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page  67  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×