BIG	
  DATA	
  TRAINING	
  
Ranga	
  Vadlamudi	
  
March	
  2014	
  
What	
  is	
  Big	
  Data	
  
•  Volume:	
  Large	
  Amounts	
  of	
  Data	
  at	
  rest	
  
•  Velocity:	
  milliseconds	...
•  30	
  billion	
  pieces	
  of	
  content	
  a	
  month	
  
	
  
•  	
  1	
  Peta	
  byte	
  of	
  content	
  every	
  d...
CAP	
  THEOREM	
  
(Consistency,	
  Availability,	
  ParEEon)	
  
Big	
  Data	
  SoluEons	
  
Big	
  Data	
  
Real	
  Time	
  
Querying	
  	
  
Batch	
  	
  
Querying	
  	
  
Mining	
  &	
...
Technology	
  
Background	
  
•  Underlying	
  Technology	
  invented	
  by	
  Google	
  
•  Google	
  Big-­‐Table	
  &	
  Google	
  File...
Hadoop	
  	
  
•  Is	
  a	
  framework	
  
•  Built	
  on	
  commodity	
  hardware	
  
•  Implements	
  computaEonal	
  pa...
Data	
  Becomes	
  BoQleneck	
  
•  Geng	
  data	
  to	
  processors	
  is	
  expensive	
  
•  Typical	
  disk	
  data	
  ...
Hadoop	
  Solves	
  
•  Problems	
  where	
  you	
  have	
  lot	
  of	
  data	
  
•  Mixture	
  of	
  complex	
  and	
  st...
Hadoop	
  DistribuEons	
  
Hadoop	
  Architecture	
  
•  Master	
  Slave	
  philosophy	
  
•  Designed	
  to	
  run	
  on	
  large	
  number	
  of	
 ...
Hadoop	
  Architecture	
  
•  Data	
  is	
  divided	
  and	
  spread	
  across	
  servers	
  
•  Hadoop	
  keeps	
  track	...
Hadoop	
  Components	
  
HDFS	
  
(Hadoop	
  File	
  
System	
  	
  
HDFS	
  
•  Distributed	
  file	
  system	
  
•  Highly	
  fault	
  tolerant	
  
•  HDFS	
  instance	
  can	
  span	
  acro...
HDFS	
  Layout	
  
Cloudera	
  Manager	
  
•  Management	
  sogware	
  to	
  manage	
  Hadoop	
  
ecosystem	
  
•  Helps	
  install,	
  manag...
Cloudera	
  CapabiliEes	
  
Demo	
  Cloudera	
  
	
  
Demo	
  Cassandra	
  
	
  
Demo	
  Mongo	
  DB	
  
QuesEons?	
  
Big datatraining ranga_1
Big datatraining ranga_1
Big datatraining ranga_1
Upcoming SlideShare
Loading in …5
×

Big datatraining ranga_1

569 views
428 views

Published on

Big Data Training Slides

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
569
On SlideShare
0
From Embeds
0
Number of Embeds
130
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Big datatraining ranga_1

  1. 1. BIG  DATA  TRAINING   Ranga  Vadlamudi   March  2014  
  2. 2. What  is  Big  Data   •  Volume:  Large  Amounts  of  Data  at  rest   •  Velocity:  milliseconds  to  seconds  to  respond   •  Variety:  Data  in  many  forms  (Structured,   Unstructured,  MulEmedia,  Text  etc.)   •  Veracity:  Data  in  doubt  
  3. 3. •  30  billion  pieces  of  content  a  month     •   1  Peta  byte  of  content  every  day     •  2  Billion  videos  watched  everyday       •  3  Billion  people  will  be  online     •  Sharing  8  zeQabytes  of  data      
  4. 4. CAP  THEOREM   (Consistency,  Availability,  ParEEon)  
  5. 5. Big  Data  SoluEons   Big  Data   Real  Time   Querying     Batch     Querying     Mining  &   AnalyEcs   Machine   Learning   Storage  
  6. 6. Technology  
  7. 7. Background   •  Underlying  Technology  invented  by  Google   •  Google  Big-­‐Table  &  Google  File  System   •  Doug  Cung  created  NUTCH  and  Hadoop  was   spun  off  at  Yahoo   •  Yahoo  played  a  key  role  in  developing  Hadoop   for  enterprise  applicaEons  
  8. 8. Hadoop     •  Is  a  framework   •  Built  on  commodity  hardware   •  Implements  computaEonal  paradigm  called   Map-­‐Reduce   •  Provides  a  distributed  file  system  called  HDFS   to  store  data   •  Node  failures  are  automaEcally  handled  
  9. 9. Data  Becomes  BoQleneck   •  Geng  data  to  processors  is  expensive   •  Typical  disk  data  transfer  rate  75MB/sec   •  100GB  data  transfer  :  22mins  approx.   •  New  approach  is  needed    
  10. 10. Hadoop  Solves   •  Problems  where  you  have  lot  of  data   •  Mixture  of  complex  and  structured  data   •  Speeds  up  computaEons  by  distribuEon   •  Mantra  is  take  computaEon  to  the  data,  don’t   bring  data  to  computaEon  
  11. 11. Hadoop  DistribuEons  
  12. 12. Hadoop  Architecture   •  Master  Slave  philosophy   •  Designed  to  run  on  large  number  of  machines   •  Machines  don’t  share  memory  or  disk   •  Rack  them  up  and  run  Hadoop  on  each   machine  
  13. 13. Hadoop  Architecture   •  Data  is  divided  and  spread  across  servers   •  Hadoop  keeps  track  of  where  the  data  is   •  Hadoop  replicates  data  to  mulEple  copies  to   avoid  single  point  of  failure   •  MapReduce  is  a  programming  model    to  process   large  sets  of  data  in  parallel   •  Map  the  operaEon  out  to  all  servers   •  Shuffle  the  results   •  Reduce  the  results  back  into  one  result  set  
  14. 14. Hadoop  Components  
  15. 15. HDFS   (Hadoop  File   System    
  16. 16. HDFS   •  Distributed  file  system   •  Highly  fault  tolerant   •  HDFS  instance  can  span  across  many  servers   •  Has  large  datasets  into  terabytes  to  petabytes   •  Moving  computaEon  is  cheaper  than  moving   data   •  Large  block  sizes  (128MB  for  example)  
  17. 17. HDFS  Layout  
  18. 18. Cloudera  Manager   •  Management  sogware  to  manage  Hadoop   ecosystem   •  Helps  install,  manage  and  maintain  a  cluster   •  Resource  consumpEon  tracking   •  ProacEve  health  checks   •  AlerEng   •  Config  changes    
  19. 19. Cloudera  CapabiliEes  
  20. 20. Demo  Cloudera     Demo  Cassandra     Demo  Mongo  DB  
  21. 21. QuesEons?  

×