Your SlideShare is downloading. ×
0
Adam	
  Muise	
  –	
  Solu/on	
  Architect,	
  Hortonworks	
  

ELEPHANT	
  AT	
  THE	
  DOOR:	
  
MODERN	
  DATA	
  ARCHI...
Who	
  am	
  I?	
  
Who	
  is	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ?	
  
100%	
  Open	
  Source	
  –	
  
Democra/zed	
  Access	
  to	
  
Data	
  

The	
  leaders	
  of	
  Hadoop’s	
  
development...
We	
  do	
  Hadoop	
  successfully.	
  
Support	
  	
  
Training	
  
Professional	
  Services	
  
What	
  is	
  Hadoop?	
  	
  
What	
  is	
  everyone	
  talking	
  about?	
  
Data	
  
“Big	
  Data”	
  is	
  the	
  marke/ng	
  term	
  
of	
  the	
  decade	
  in	
  IT	
  
What	
  lurks	
  behind	
  the	
  hype	
  is	
  
the	
  democra/za/on	
  of	
  Data.	
  
You	
  need	
  data.	
  	
  
Data	
  fuels	
  analy/cs.	
  Analy/cs	
  
fuels	
  business	
  decisions.	
  
So	
  we	
  save	
  the	
  data	
  because	
  we	
  
think	
  we	
  need	
  it,	
  but	
  oTen	
  we	
  
really	
  don’t	
...
We	
  put	
  away	
  data,	
  delete	
  it,	
  tweet	
  
it,	
  compress	
  it,	
  shred	
  it,	
  wikileak-­‐it,	
  
put	...
You	
  need	
  value	
  from	
  your	
  data.	
  You	
  
need	
  to	
  make	
  decisions	
  from	
  your	
  
data.	
  
So	
  what	
  are	
  the	
  problems	
  with	
  
Big	
  Data?	
  
Let’s	
  talk	
  challenges…	
  
Volume	
  
Volume	
  

Volume	
  

Volume	
  
Volume	
   Volume	
  
Volume	
  
Volume	
  
Volume	
  Volume	
  

Volume	
  
Volume	
  
Volume	
  Volume	
  
Volume	
  

V...
Volume	
   Volume	
  
Volume	
  
Volume	
  
Volume	
  Volume	
  
Volume	
  
Volume	
  
Volume	
  Volume	
  
Volume	
  
Vol...
Volume	
  
Volume	
  
Volume	
  
Volume	
  
Volume	
  
Volume	
   Volume	
  
Volume	
  
Volume	
  
Volume	
   Volume	
   V...
Storage,	
  Management,	
  Processing	
  
all	
  become	
  challenges	
  with	
  Data	
  at	
  
Volume	
  
Tradi/onal	
  technologies	
  adopt	
  a	
  
divide,	
  drop,	
  and	
  conquer	
  approach	
  
Another	
  EDW	
  

Analy/cal	
  DB	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Data	
  
Data	
  
Data	
   Data	
  
...
Another	
  EDW	
  

Analy/cal	
  DB	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Data	
  
Data	
  
Data	
   Data	
  
...
Analyzing	
  the	
  data	
  usually	
  raises	
  
more	
  interes/ng	
  ques/ons…	
  
…which	
  leads	
  to	
  more	
  data	
  
Wait,	
  you’ve	
  seen	
  this	
  before.	
  

…	
  

Data	
  
Data	
  
Data	
  

Analy/cs	
  Sausage	
  Factory	
  

Dat...
Data	
  begets	
  Data.	
  
What	
  keeps	
  us	
  from	
  our	
  Data?	
  
“Prices,	
  Stupid	
  passwords,	
  and	
  
Boring	
  Sta/s/cs.”	
  	
  
-­‐	
  Hans	
  Rosling	
  

h)p://www.youtube.com...
Your	
  data	
  silos	
  are	
  lonely	
  places.	
  
EDW	
  

Accounts	
  

Customers	
  

Web	
  Proper/es	
  

Data	
  ...
…	
  Data	
  likes	
  to	
  be	
  together.	
  
EDW	
  

Accounts	
  

Customers	
  
Data	
  
Data	
  
Web	
  Proper/es	
 ...
CDR	
  

Data	
  
Data	
   Data	
   Machine	
  Data	
  
Facebook	
  
Data	
  
Data	
   Data	
  
Data	
  
Data	
  
Data	
  ...
New	
  types	
  of	
  data	
  don’t	
  quite	
  fit	
  into	
  
your	
  pris/ne	
  view	
  of	
  the	
  world.	
  
Logs	
  ...
To	
  resolve	
  this,	
  some	
  people	
  take	
  
hints	
  from	
  Lord	
  Of	
  The	
  Rings...	
  
…and	
  create	
  One-­‐Schema-­‐To-­‐
Rule-­‐Them-­‐All…	
  
EDW	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Schema...
ETL	
  
Data	
  
Data	
  
Data	
  

ETL	
  

ETL	
  

ETL	
  

EDW	
  

Data	
  
Data	
   Data	
  
Data	
   Data	
  
Schem...
What	
  if	
  the	
  data	
  was	
  processed	
  and	
  
stored	
  centrally?	
  What	
  if	
  you	
  didn’t	
  
need	
  t...
A	
  Data	
  Lake	
  Architecture	
  enables:	
  

-­‐	
  Landing	
  data	
  without	
  forcing	
  a	
  single	
  schema	
...
In	
  most	
  cases,	
  more	
  data	
  is	
  be^er.	
  
Work	
  with	
  the	
  popula/on,	
  not	
  just	
  a	
  
sample....
Town/City	
  
Middle	
  Income	
  Band	
  

Your	
  view	
  of	
  a	
  client	
  today.	
  
Female	
  
Age:	
  25-­‐30	
  ...
GPS	
  coordinates	
  
Looking	
  to	
  start	
  a	
  
business	
  	
  

Walking	
  into	
  
Starbucks	
  right	
  now…	
 ...
Pick	
  up	
  all	
  of	
  that	
  data	
  that	
  was	
  
prohibi/vely	
  expensive	
  to	
  store	
  and	
  
use.	
  	
 ...
Why	
  do	
  viewer	
  surveys…	
  
…when	
  raw	
  data	
  can	
  tell	
  you	
  what	
  
bu^on	
  on	
  the	
  remote	
  was	
  pressed	
  
during	
  what	
...
Why	
  make	
  separate	
  risk	
  
assessments	
  in	
  separate	
  data	
  silos…	
  
…when	
  you	
  can	
  do	
  a	
  risk	
  
assessment	
  on	
  the	
  en/re	
  data	
  
footprint	
  of	
  the	
  client?	...
To	
  approach	
  these	
  use	
  cases	
  you	
  
need	
  an	
  affordable	
  plaForm	
  that	
  
stores,	
  processes,	
 ...
So	
  what	
  is	
  the	
  answer?	
  
Enter	
  the	
  Hadoop.	
  

………	
  
h^p://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐sto...
Hadoop	
  was	
  created	
  because	
  
tradi/onal	
  technologies	
  never	
  cut	
  it	
  
for	
  the	
  Internet	
  pro...
Tradi/onal	
  architecture	
  didn’t	
  
scale	
  enough…	
  
App	
   App	
   App	
   App	
  

App	
   App	
   App	
   App...
Databases	
  can	
  become	
  bloated	
  
and	
  useless	
  
$upercompu/ng	
  

Tradi/onal	
  architectures	
  cost	
  too	
  
much	
  at	
  that	
  volume…	
  

$/TB	
  

$pecial	
  ...
So	
  what	
  is	
  the	
  answer?	
  
If	
  you	
  could	
  design	
  a	
  system	
  that	
  
would	
  handle	
  this,	
  what	
  would	
  it	
  
look	
  like?	...
It	
  would	
  probably	
  need	
  a	
  highly	
  
resilient,	
  self-­‐healing,	
  cost-­‐efficient,	
  
distributed	
  file...
It	
  would	
  probably	
  need	
  a	
  completely	
  
parallel	
  processing	
  framework	
  that	
  
took	
  tasks	
  to...
It	
  would	
  probably	
  run	
  on	
  commodity	
  
hardware,	
  virtualized	
  machines,	
  and	
  
common	
  OS	
  pla...
It	
  would	
  probably	
  be	
  open	
  source	
  so	
  
innova/on	
  could	
  happen	
  as	
  quickly	
  
as	
  possible...
It	
  would	
  need	
  a	
  cri/cal	
  mass	
  of	
  
users	
  
{Processing	
  +	
  Storage}	
  
=	
  
{MapReduce/Tez/YARN+	
  HDFS}	
  
HDFS	
  stores	
  data	
  in	
  blocks	
  and	
  
replicates	
  those	
  blocks	
  
block1	
  
Processing	
   Processing	
...
If	
  a	
  block	
  fails	
  then	
  HDFS	
  always	
  has	
  
the	
  other	
  copies	
  and	
  heals	
  itself	
  
block1...
MapReduce	
  is	
  a	
  programming	
  
paradigm	
  that	
  completely	
  parallel	
  
Data	
  
Data	
  
Data	
  
Data	
  ...
MapReduce	
  has	
  three	
  phases:	
  
Map,	
  Sort/Shuffle,	
  Reduce	
  
Key,	
  Value	
  
Key,	
  Value	
  
Key,	
  Val...
MapReduce	
  applies	
  to	
  a	
  lot	
  of	
  
data	
  processing	
  problems	
  
Data	
  
Data	
  
Data	
  
Data	
  
Da...
MapReduce	
  goes	
  a	
  long	
  way,	
  but	
  
not	
  all	
  data	
  processing	
  and	
  analy/cs	
  
are	
  solved	
 ...
Some/mes	
  your	
  data	
  applica/on	
  
needs	
  parallel	
  processing	
  and	
  inter-­‐
process	
  communica/on	
  
...
…like	
  Complex	
  Event	
  Processing	
  
in	
  Apache	
  Storm	
  
Some/mes	
  your	
  machine	
  learning	
  
data	
  applica/on	
  needs	
  to	
  process	
  in	
  
memory	
  and	
  iterat...
…like	
  in	
  Machine	
  Learning	
  in	
  
Spark	
  
Introducing	
  Tez	
  
Tez	
  is	
  a	
  YARN	
  applica/on,	
  like	
  
MapReduce	
  is	
  a	
  YARN	
  applica/on	
  
Tez	
  is	
  the	
  Lego	
  set	
  for	
  your	
  data	
  
applica/on	
  
Tez	
  provides	
  a	
  layer	
  for	
  abstract	
  
tasks,	
  these	
  could	
  be	
  mappers,	
  
reducers,	
  customize...
Tez	
  can	
  chain	
  tasks	
  together	
  into	
  one	
  
job	
  to	
  get	
  Map	
  –	
  Reduce	
  –	
  Reduce	
  jobs	...
Tez	
  can	
  provide	
  long-­‐running	
  
containers	
  for	
  applica/ons	
  like	
  Hive	
  
to	
  side-­‐step	
  batc...
Introducing	
  YARN	
  
YARN:	
  	
  
Yeah,	
  we	
  did	
  that	
  too.	
  
hortonworks.com/yarn/	
  
YARN	
  =	
  Yet	
  Another	
  Resource	
  
Nego/ator	
  
Node	
  Manager	
  

Resource	
  Manager	
  

Container	
  

Scheduler	
  
Pig	
  

AppMaster	
  
Container	
  

Resource	...
YARN	
  abstracts	
  resource	
  
management	
  so	
  you	
  can	
  run	
  more	
  
than	
  just	
  MapReduce	
  
MapReduc...
Hadoop	
  has	
  other	
  open	
  source	
  
projects…	
  
Hive	
  =	
  {SQL	
  -­‐>	
  Tez	
  ||	
  MapReduce}	
  
SQL-­‐IN-­‐HADOOP	
  
Pig	
  =	
  {PigLa/n	
  -­‐>	
  Tez	
  ||	
  
MapReduce}	
  
HCatalog	
  =	
  {metadata*	
  for	
  
MapReduce,	
  Hive,	
  Pig,	
  HBase}	
  

*metadata	
  =	
  tables,	
  columns,	
 ...
Oozie	
  =	
  Job::{Task,	
  Task,	
  if	
  Task,	
  
then	
  Task,	
  final	
  Task}	
  
Falcon	
  
Late	
  Data	
  
Arrival	
  
Data	
  
Set	
  

Archival	
   Data	
  
Data	
  
Set	
  

Set	
  

Lineage	
  

Ha...
Knox	
  
REST	
  
Client	
  
REST	
  
Client	
  

Knox	
  Gateway	
  
REST	
  
Client	
  

Hadoop	
  
Cluster	
  
Hadoop	
...
Flume	
  
Files	
  

Flume	
  
JMS	
  

Weblogs	
  

Events	
  

Flume	
  

Flume	
  

Flume	
  

Flume	
  

Flume	
  

Ha...
Sqoop	
  
DB	
  

DB	
  

Sqoop	
  
Hadoop	
  

Sqoop	
  
Ambari	
  =	
  {install,	
  manage,	
  
monitor}	
  
HBase	
  =	
  {real-­‐/me,	
  distributed-­‐
map,	
  big-­‐tables}	
  
Storm	
  =	
  {Complex	
  Event	
  Processing,	
  
Near-­‐Real-­‐Time,	
  Provisioned	
  by	
  
YARN	
  }	
  
Tez	
  

Storm	
  

YARN	
  

Pig	
  

HDFS	
  

MapReduce	
  

Apache	
  Hadoop	
  

HCatalog	
  

Hive	
  
HBase	
  

Am...
Storm	
  

Tez	
  
Pig	
  

YARN	
  

HDFS	
  

MapReduce	
  

Hortonworks	
  Data	
  PlaForm	
  
HCatalog	
  

Hive	
  
H...
What	
  else	
  are	
  we	
  working	
  on?	
  
hortonworks.com/labs/	
  
Hadoop	
  is	
  the	
  new	
  Modern	
  Data	
  
Architecture	
  for	
  the	
  Enterprise	
  
There is NO second place

Hortonworks	
  

…the	
  Bull	
  Elephant	
  of	
  Hadoop	
  InnovaCon	
  
© Hortonworks Inc. 20...
Upcoming SlideShare
Loading in...5
×

2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture

235

Published on

An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
235
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitecture"

  1. 1. Adam  Muise  –  Solu/on  Architect,  Hortonworks   ELEPHANT  AT  THE  DOOR:   MODERN  DATA  ARCHITECTURE  
  2. 2. Who  am  I?  
  3. 3. Who  is                                        ?  
  4. 4. 100%  Open  Source  –   Democra/zed  Access  to   Data   The  leaders  of  Hadoop’s   development   We  do  Hadoop   Drive  Innova/on  in   the  plaForm  –  We   lead  the  roadmap     Community  driven,     Enterprise  Focused  
  5. 5. We  do  Hadoop  successfully.   Support     Training   Professional  Services  
  6. 6. What  is  Hadoop?     What  is  everyone  talking  about?  
  7. 7. Data  
  8. 8. “Big  Data”  is  the  marke/ng  term   of  the  decade  in  IT  
  9. 9. What  lurks  behind  the  hype  is   the  democra/za/on  of  Data.  
  10. 10. You  need  data.    
  11. 11. Data  fuels  analy/cs.  Analy/cs   fuels  business  decisions.  
  12. 12. So  we  save  the  data  because  we   think  we  need  it,  but  oTen  we   really  don’t  know  what  to  do   with  it.  
  13. 13. We  put  away  data,  delete  it,  tweet   it,  compress  it,  shred  it,  wikileak-­‐it,   put  it  in  a  database,  put  it  in  SAN/ NAS,  put  it  in  the  cloud,  hide  it  in   tape…  
  14. 14. You  need  value  from  your  data.  You   need  to  make  decisions  from  your   data.  
  15. 15. So  what  are  the  problems  with   Big  Data?  
  16. 16. Let’s  talk  challenges…  
  17. 17. Volume   Volume   Volume   Volume  
  18. 18. Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume  
  19. 19. Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume  
  20. 20. Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  
  21. 21. Storage,  Management,  Processing   all  become  challenges  with  Data  at   Volume  
  22. 22. Tradi/onal  technologies  adopt  a   divide,  drop,  and  conquer  approach  
  23. 23. Another  EDW   Analy/cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   The  solu/on?   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  24. 24. Another  EDW   Analy/cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Ummm…you   dropped  something   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  25. 25. Analyzing  the  data  usually  raises   more  interes/ng  ques/ons…  
  26. 26. …which  leads  to  more  data  
  27. 27. Wait,  you’ve  seen  this  before.   …   Data   Data   Data   Analy/cs  Sausage  Factory   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   …   Data   Data   Data   Data  Data   Data   Data   Data  Data   Data   Data   Data   Data  
  28. 28. Data  begets  Data.  
  29. 29. What  keeps  us  from  our  Data?  
  30. 30. “Prices,  Stupid  passwords,  and   Boring  Sta/s/cs.”     -­‐  Hans  Rosling   h)p://www.youtube.com/watch?v=hVimVzgtD6w  
  31. 31. Your  data  silos  are  lonely  places.   EDW   Accounts   Customers   Web  Proper/es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  32. 32. …  Data  likes  to  be  together.   EDW   Accounts   Customers   Data   Data   Web  Proper/es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  33. 33. CDR   Data   Data   Data   Machine  Data   Facebook   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Weather  Data   Twi^er   Data   Data  likes  to  socialize  too.   Data   Data   EDW   Data   Data   Data   Data   Data   Data   Accounts   Data   Web  Proper/es   Data   Data   Data   Customers   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  34. 34. New  types  of  data  don’t  quite  fit  into   your  pris/ne  view  of  the  world.   Logs   Data   Data   Data   Data   Data  Data   Data   Machine  Data   Data   Data   Data   Data   Data  Data   Data   My  Li^le  Data  Empire   Data   ?   Data   ?   Data   Data   Data   Data   Data   ?  ?   Data   Data  
  35. 35. To  resolve  this,  some  people  take   hints  from  Lord  Of  The  Rings...  
  36. 36. …and  create  One-­‐Schema-­‐To-­‐ Rule-­‐Them-­‐All…   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  
  37. 37. ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data   …but  that  has  its  problems  too.   ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  
  38. 38. What  if  the  data  was  processed  and   stored  centrally?  What  if  you  didn’t   need  to  force  it  into  a  single   schema?     We  call  it  a  Data  Lake.   BI  &  Analy/cs   Data   Data   Data   Data  Sources   Data   Data   Data   Data  Lake   Schema   Schema   Schema   Schema   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Process   Data   Process   Data   Data   Data   Data  Data  Sources   Data   Data   Data   EDW   Data   Data   Data   Data   Data   Schema  Data   Data  
  39. 39. A  Data  Lake  Architecture  enables:   -­‐  Landing  data  without  forcing  a  single  schema   -­‐  Landing  a  variety  and  large  volume  of  data    efficiently   -­‐  Retaining  data  for  a  long  period  of  /me  with  a  very    low  $/TB   -­‐  A  plaForm  to  feed  other  Analy/cal  DBs   -­‐  A  plaForm  to  execute  next  gen  data  analy/cs  and    processing  applica/ons  (SAS,  Informa/ca,    Graph  Analy/cs,  Machine  Learning,  SAP,    etc…)  
  40. 40. In  most  cases,  more  data  is  be^er.   Work  with  the  popula/on,  not  just  a   sample.  
  41. 41. Town/City   Middle  Income  Band   Your  view  of  a  client  today.   Female   Age:  25-­‐30   Male   Product  Category   Preferences  
  42. 42. GPS  coordinates   Looking  to  start  a   business     Walking  into   Starbucks  right  now…   Spent  25  minutes   looking  at  tea  cozies   Unhappy  with  his  cell   phone  plan   $65-­‐68k  per  year   Your  view  with  more  data.   Pregnant   Tea  Party   Hippie   A  depressed  Toronto   Maple  Leaf’s  Fan   Gene   Expression  for   Risk  Taker   Male   Female   Age:  27  but   feels  old   Product   recommenda/ons   Thinking  about   a  new  house   Products  leT  in   basket  indicate  drunk   amazon  shopper  
  43. 43. Pick  up  all  of  that  data  that  was   prohibi/vely  expensive  to  store  and   use.      
  44. 44. Why  do  viewer  surveys…  
  45. 45. …when  raw  data  can  tell  you  what   bu^on  on  the  remote  was  pressed   during  what  commercial  for  the   en/re  viewer  popula/on?  
  46. 46. Why  make  separate  risk   assessments  in  separate  data  silos…  
  47. 47. …when  you  can  do  a  risk   assessment  on  the  en/re  data   footprint  of  the  client?  
  48. 48. To  approach  these  use  cases  you   need  an  affordable  plaForm  that   stores,  processes,  and  analyzes  the   data.    
  49. 49. So  what  is  the  answer?  
  50. 50. Enter  the  Hadoop.   ………   h^p://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐stories/  
  51. 51. Hadoop  was  created  because   tradi/onal  technologies  never  cut  it   for  the  Internet  proper/es  like   Google,  Yahoo,  Facebook,  Twi^er,   and  LinkedIn  
  52. 52. Tradi/onal  architecture  didn’t   scale  enough…   App   App   App   App   App   App   App   App   DB   DB   DB   SAN   App   App   App   App   DB   DB   DB   SAN   DB   DB   DB   SAN  
  53. 53. Databases  can  become  bloated   and  useless  
  54. 54. $upercompu/ng   Tradi/onal  architectures  cost  too   much  at  that  volume…   $/TB   $pecial   Hardware  
  55. 55. So  what  is  the  answer?  
  56. 56. If  you  could  design  a  system  that   would  handle  this,  what  would  it   look  like?  
  57. 57. It  would  probably  need  a  highly   resilient,  self-­‐healing,  cost-­‐efficient,   distributed  file  system…   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage  
  58. 58. It  would  probably  need  a  completely   parallel  processing  framework  that   took  tasks  to  the  data…   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  
  59. 59. It  would  probably  run  on  commodity   hardware,  virtualized  machines,  and   common  OS  plaForms   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  
  60. 60. It  would  probably  be  open  source  so   innova/on  could  happen  as  quickly   as  possible  
  61. 61. It  would  need  a  cri/cal  mass  of   users  
  62. 62. {Processing  +  Storage}   =   {MapReduce/Tez/YARN+  HDFS}  
  63. 63. HDFS  stores  data  in  blocks  and   replicates  those  blocks   block1   Processing   Processing  Processing   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   block3  
  64. 64. If  a  block  fails  then  HDFS  always  has   the  other  copies  and  heals  itself   block1   Processing   Processing  Processing   block3   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   X
  65. 65. MapReduce  is  a  programming   paradigm  that  completely  parallel   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  
  66. 66. MapReduce  has  three  phases:   Map,  Sort/Shuffle,  Reduce   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Mapper   Key,  Value   Key,  Value   Key,  Value   Reducer   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Key,  Value   Key,  Value   Key,  Value  
  67. 67. MapReduce  applies  to  a  lot  of   data  processing  problems   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  
  68. 68. MapReduce  goes  a  long  way,  but   not  all  data  processing  and  analy/cs   are  solved  the  same  way  
  69. 69. Some/mes  your  data  applica/on   needs  parallel  processing  and  inter-­‐ process  communica/on   Data   Data   Data   Data   Data   Data   Process   Data   Data   Data   Process   Data   Data   Data   Data   Data   Data   Data   Data   Data   Process   Process   Data   Data   Data   Data   Data   Data   Data   Data   Data  
  70. 70. …like  Complex  Event  Processing   in  Apache  Storm  
  71. 71. Some/mes  your  machine  learning   data  applica/on  needs  to  process  in   memory  and  iterate     Data   Data   Data   Data   Data   Data   Process   Data   Data   Data   Process   Data   Data   Data   Data   Data   Data   Data   Data   Data   Process   Process   Process   Process   Process   Data   Data   Data   Data   Data   Data  
  72. 72. …like  in  Machine  Learning  in   Spark  
  73. 73. Introducing  Tez  
  74. 74. Tez  is  a  YARN  applica/on,  like   MapReduce  is  a  YARN  applica/on  
  75. 75. Tez  is  the  Lego  set  for  your  data   applica/on  
  76. 76. Tez  provides  a  layer  for  abstract   tasks,  these  could  be  mappers,   reducers,  customized  stream   processes,  in  memory  structures,   etc  
  77. 77. Tez  can  chain  tasks  together  into  one   job  to  get  Map  –  Reduce  –  Reduce  jobs   suitable  for  things  like  Hive  SQL   projec/ons,  group  by,  and  order  by   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   TezMap   TezMap   TezReduce   TezReduce   Data   Data   Data   TezMap   TezReduce   TezReduce   Data   Data   Data   TezReduce   TezReduce   TezMap   TezMap   Data   Data   Data  
  78. 78. Tez  can  provide  long-­‐running   containers  for  applica/ons  like  Hive   to  side-­‐step  batch  process  startups   you  would  have  with  MapReduce  
  79. 79. Introducing  YARN  
  80. 80. YARN:     Yeah,  we  did  that  too.   hortonworks.com/yarn/  
  81. 81. YARN  =  Yet  Another  Resource   Nego/ator  
  82. 82. Node  Manager   Resource  Manager   Container   Scheduler   Pig   AppMaster   Container   Resource  Manager   +   Node  Managers   =  YARN   Node  Manager   Container   Container   Storm   Node  Manager   Node  Manager   MapReduce   AppMaster   Container   Container   Container   Container   Container   AppMaster  
  83. 83. YARN  abstracts  resource   management  so  you  can  run  more   than  just  MapReduce   MapReduce  V2   MapReduce  V?   STORM   Giraph   Tez   YARN   HDFS2   MPI   HBase   …  and   more   Spark  
  84. 84. Hadoop  has  other  open  source   projects…  
  85. 85. Hive  =  {SQL  -­‐>  Tez  ||  MapReduce}   SQL-­‐IN-­‐HADOOP  
  86. 86. Pig  =  {PigLa/n  -­‐>  Tez  ||   MapReduce}  
  87. 87. HCatalog  =  {metadata*  for   MapReduce,  Hive,  Pig,  HBase}   *metadata  =  tables,  columns,  par//ons,  types  
  88. 88. Oozie  =  Job::{Task,  Task,  if  Task,   then  Task,  final  Task}  
  89. 89. Falcon   Late  Data   Arrival   Data   Set   Archival   Data   Data   Set   Set   Lineage   Hadoop   Data   Set   Audit   RetenAon   Policy   ReplicaAon   Data   Monitoring   Set   Hadoop   Data   Set   Data   Process   Set   Management  
  90. 90. Knox   REST   Client   REST   Client   Knox  Gateway   REST   Client   Hadoop   Cluster   Hadoop   Cluster   Enterprise   LDAP  
  91. 91. Flume   Files   Flume   JMS   Weblogs   Events   Flume   Flume   Flume   Flume   Flume   Hadoop  
  92. 92. Sqoop   DB   DB   Sqoop   Hadoop   Sqoop  
  93. 93. Ambari  =  {install,  manage,   monitor}  
  94. 94. HBase  =  {real-­‐/me,  distributed-­‐ map,  big-­‐tables}  
  95. 95. Storm  =  {Complex  Event  Processing,   Near-­‐Real-­‐Time,  Provisioned  by   YARN  }  
  96. 96. Tez   Storm   YARN   Pig   HDFS   MapReduce   Apache  Hadoop   HCatalog   Hive   HBase   Ambari   Knox   Sqoop   Falcon   Flume  
  97. 97. Storm   Tez   Pig   YARN   HDFS   MapReduce   Hortonworks  Data  PlaForm   HCatalog   Hive   HBase   Ambari   Knox   Sqoop   Falcon   Flume  
  98. 98. What  else  are  we  working  on?   hortonworks.com/labs/  
  99. 99. Hadoop  is  the  new  Modern  Data   Architecture  for  the  Enterprise  
  100. 100. There is NO second place Hortonworks   …the  Bull  Elephant  of  Hadoop  InnovaCon   © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page  100  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×