SlideShare a Scribd company logo
1 of 36
Download to read offline
1	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
IntroducAon	
  to	
  Data	
  Science	
  
with	
  Hadoop	
  
Glynn	
  Durham,	
  Senior	
  Instructor,	
  Cloudera	
  
glynn@cloudera.com	
  
2	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
I	
  will	
  cover:	
  
	
  	
  Hadoop,	
  Hadoop	
  ecosystem	
  
HDFS	
  
MapReduce	
  
Sqoop	
  
Flume	
  
Hive	
  
Pig	
  
Mahout	
  
Machine	
  learning	
  
Data	
  science	
  using	
  Hadoop	
  
Terms	
  
with	
  a	
  few	
  extras:	
  
	
  	
  	
  YARN	
  
HBase	
  
Impala	
  
Oozie	
  
data	
  products	
  
3	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Hadoop	
  is:	
  
 	
  a	
  plaLorm	
  for	
  big	
  data	
  
 	
  several	
  Apache	
  SoNware	
  	
  	
  
	
  FoundaOon	
  (ASF)	
  projects	
  	
  
 	
  free	
  open	
  source	
  soNware	
  
Major	
  parts:	
  
	
  	
  	
  Hadoop	
  Core	
  
	
  	
  Hadoop	
  ecosystem	
  
Hadoop	
  
4	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Hadoop	
  Core	
  Main	
  Features:	
  File	
  System	
  and	
  Batch	
  Programming	
  
5	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Hadoop	
  Core	
  consists	
  of:	
  
	
   	
  HDFS	
  	
  
– 	
  (Hadoop	
  Distributed	
  File	
  System),	
  for	
  storage	
  
	
   	
  MapReduce	
  
– 	
  for	
  batch	
  programming	
  
Hadoop	
  Core	
  
6	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
HDFS	
  Writes	
  
7	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
HDFS	
  Reads	
  
8	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
HDFS	
  is	
  good	
  at:	
  
– 	
  storing	
  enormous	
  files	
  
– 	
  storing	
  a	
  lot	
  of	
  data	
  reliably	
  
– 	
  throughput	
  on	
  sequenAal	
  writes	
  
– 	
  throughput	
  on	
  sequenAal	
  reads	
  of	
  a	
  file	
  or	
  part	
  of	
  a	
  file	
  
HDFS	
  is	
  not	
  good	
  at:	
  
– 	
  high	
  speed	
  random	
  reads	
  of	
  parts	
  of	
  a	
  file	
  
HDFS	
  cannot:	
  
– 	
  update	
  any	
  part	
  of	
  a	
  file	
  once	
  wri>en*	
  
– 	
  *	
  but	
  you	
  can	
  always	
  write	
  a	
  new	
  file,	
  and/or	
  delete,	
  move,	
  	
  	
  
	
  	
  	
  	
  and	
  rename	
  files	
  and	
  directories	
  
HDFS	
  Strengths	
  and	
  Weaknesses	
  
9	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
MapReduce:	
  Programming	
  with	
  Simple	
  FuncAons	
  
10	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
MapReduce	
  Chains	
  
11	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
MapReduce	
  at	
  Scale	
  
12	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
MapReduce	
  in	
  Hadoop	
  
13	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
MapReduce	
  is	
  good	
  at:	
  
– 	
  processing	
  enormous	
  amounts	
  of	
  data	
  
– 	
  scaling	
  out	
  as	
  you	
  add	
  more	
  machines	
  
– 	
  conAnuing	
  to	
  compleAon,	
  even	
  when	
  some	
  machines	
  die	
  
MapReduce	
  is	
  not	
  good	
  at:	
  
– 	
  running	
  any	
  algorithm	
  you	
  can	
  think	
  up	
  
– 	
  algorithms	
  that	
  require	
  shared	
  state	
  overall*	
  
– 	
  *	
  but	
  maybe	
  you	
  can	
  get	
  clever	
  with	
  your	
  algorithm	
  design	
  
MapReduce	
  cannot:	
  
– 	
  run	
  in	
  real	
  Ame:	
  MapReduce	
  jobs	
  are	
  batch	
  jobs	
  
MapReduce	
  Strengths	
  and	
  Weaknesses	
  
14	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Detour:	
  YARN,	
  Yet	
  Another	
  Resource	
  NegoAator—near	
  future	
  
15	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
	
  	
  	
  The	
  Hadoop	
  Ecosystem	
  consists	
  of	
  other	
  projects	
  that	
  round	
  
out	
  Hadoop	
  Core	
  to	
  make	
  it	
  a	
  useful	
  plaorm:	
  
– Sqoop,	
  for	
  RDBMS	
  integraAon	
  
– Flume,	
  for	
  event	
  ingesAon	
  	
  
– Hive,	
  for	
  "SQL"-­‐like	
  high-­‐level	
  programming	
  
– Pig,	
  another	
  high-­‐level	
  programming	
  paradigm	
  
– Mahout,	
  a	
  Java	
  library	
  for	
  machine	
  learning	
  in	
  Hadoop	
  
Plus:	
  
– HBase,	
  a	
  "NoSQL"	
  database	
  system	
  
– Oozie,	
  a	
  workflow	
  manager	
  for	
  Hadoop	
  acAons	
  
– ....	
  
Hadoop	
  Ecosystem	
  
16	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Sqoop:	
  RDBMS	
  to	
  Hadoop	
  and	
  Back	
  
17	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Flume:	
  IngesAng	
  ConAnuing	
  Event	
  Data	
  
18	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Detour:	
  General	
  File	
  Input/Output	
  
19	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Java	
  MapReduce	
  API	
  
MapReduce	
  revisited:	
  How	
  to	
  write	
  MapReduce	
  programs?	
  
•  The	
  most	
  expressive	
  technique	
  possible	
  
•  The	
  most	
  work,	
  by	
  far	
  
•  (Can	
  be	
  easier	
  with	
  Hadoop	
  Streaming:	
  a	
  way	
  to	
  use	
  streaming	
  programming	
  
such	
  as	
  shell	
  scripOng	
  or	
  Python)	
  
20	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Hive:	
  MapReduce	
  as	
  "SQL"	
  
•  Familiar	
  language	
  and	
  programming	
  paradigm	
  
•  Provides	
  interface	
  to	
  many	
  SQL-­‐compliant	
  tools	
  
21	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Detour:	
  Impala,	
  High	
  Speed	
  AnalyAcs	
  in	
  Hadoop	
  
•  5	
  to	
  30	
  Omes	
  faster	
  then	
  Hive	
  queries	
  (someOmes	
  100's	
  of	
  Omes	
  faster!)	
  
•  Cloudera	
  exclusive	
  offering,	
  but	
  Apache	
  licensed,	
  so	
  it's	
  free	
  and	
  open	
  source	
  
22	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Impala	
  Does	
  Not	
  Use	
  MapReduce	
  
23	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Detour:	
  HBase,	
  A	
  NoSQL	
  Database	
  System	
  
24	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
HBase	
  is	
  a	
  NoSQL	
  database	
  system:	
  
– 	
  programmers	
  create	
  and	
  use	
  database	
  tables	
  	
  
– 	
  high	
  volume,	
  high	
  performance	
  access	
  to	
  individual	
  cells	
  
– 	
  much	
  weaker	
  query	
  language	
  than	
  SQL	
  
– 	
  lacks	
  ACID-­‐compliant	
  transacAons	
  
HBase	
  is	
  not	
  strictly	
  needed	
  to	
  do	
  "data	
  science"	
  
– 	
  a	
  resource	
  hog;	
  competes	
  with	
  analyAcal	
  programs	
  
– 	
  ogen	
  deployed	
  on	
  its	
  own	
  separate	
  cluster	
  
– 	
  may	
  be	
  part	
  of	
  your	
  organizaAon's	
  data	
  storage	
  and	
  delivery,	
  
	
  so	
  you	
  may	
  need	
  to	
  get	
  or	
  put	
  data	
  into	
  an	
  HBase	
  system*	
  
– 	
  *	
  (or	
  other	
  NoSQL	
  system)	
  
Detour:	
  A	
  bit	
  more	
  about	
  HBase	
  
25	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Pig:	
  Another	
  Language	
  for	
  MapReduce	
  
26	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Mahout	
  is:	
  
 a	
  collecOon	
  of	
  algorithms,	
  mainly	
  focused	
  on	
  "the	
  three	
  C's"	
  of	
  	
  
machine	
  learning	
  
 wriden	
  in	
  Java	
  
 largely	
  implemented	
  over	
  Hadoop	
  MapReduce	
  
 invocable	
  from	
  the	
  command	
  line	
  
 extensible,	
  with	
  the	
  Java	
  API	
  
Mahout	
  is	
  not:	
  
 a	
  turnkey	
  soluOon	
  for	
  doing	
  machine	
  learning	
  
 always	
  user-­‐friendly	
  
Mahout:	
  Machine	
  Learning	
  in	
  MapReduce	
  
27	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
"The	
  three	
  C's"	
  of	
  machine	
  learning:	
  
 	
  ClassificaOon	
  
 	
  Clustering	
  
 	
  CollaboraOve	
  filtering	
  (recommenders)	
  
Machine	
  Learning	
  
28	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Supervised	
  Machine	
  Learning:	
  ClassificaAon	
  
29	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Machine	
  Learning:	
  Clustering	
  
30	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Machine	
  Learning:	
  CollaboraAve	
  Filtering	
  for	
  Recommenders	
  
31	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Simple	
  Enterprise	
  Deployment:	
  Hadoop	
  as	
  ETL	
  Appliance	
  
32	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Simple	
  workflow	
  within	
  Hadoop:	
  
1.  Clear	
  out	
  staging	
  directory	
  in	
  HDFS	
  
2.  Sqoop	
  import	
  from	
  OLTP	
  tables	
  
3.  Hive	
  (or	
  Pig)	
  script	
  to	
  transform	
  data	
  
4.  Sqoop	
  export	
  to	
  data	
  warehouse	
  
Detour:	
  Oozie,	
  Workflow	
  within	
  Hadoop	
  
33	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Hadoop:	
  The	
  Bigger	
  Picture	
  
34	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
A	
  data	
  scienOst	
  will:	
  
1.  IdenOfy	
  internal	
  and	
  external	
  data	
  for	
  potenOal	
  use	
  (general	
  data	
  wrangling	
  tools).	
  
2.  Help	
  build	
  ingesOon	
  pipelines	
  to	
  obtain	
  data	
  for	
  use	
  (Flume,	
  Sqoop,	
  other).	
  
3.  Examine,	
  clean,	
  and	
  anonymize	
  ingested	
  data	
  (Hive,	
  Impala,	
  Pig,	
  Hadoop	
  Streaming).	
  
4.  Shape	
  data	
  into	
  useful	
  formats	
  (Hive,	
  Pig).	
  
5.  Explore	
  data	
  sets	
  to	
  gain	
  understanding	
  of	
  problems,	
  trends,	
  reality	
  (Impala,	
  Hive,	
  Pig,	
  
staOsOcal	
  programming).	
  
6.  Build	
  predicOve	
  models	
  using	
  staOsOcal	
  programming,	
  machine	
  learning	
  (Mahout).	
  
7.  Contribute	
  to	
  data	
  products:	
  products	
  in	
  the	
  organizaOon	
  that	
  are	
  built	
  in	
  large	
  part	
  
from	
  the	
  data	
  itself	
  (Mahout,	
  Sqoop	
  export,	
  general	
  file	
  export).	
  
8.  Conduct	
  experiments	
  with	
  data	
  products,	
  quanOfying	
  benefits	
  and/or	
  tradeoffs	
  of	
  
system	
  changes	
  (Flume,	
  Sqoop,	
  staOsOcal	
  tests).	
  
9.  Communicate	
  results	
  and	
  insights	
  to	
  stakeholders	
  (visualizaOon*).	
  
Data	
  Science	
  with	
  Hadoop	
  
35	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
VisualizaAon:	
  Needs	
  VisualizaAon	
  Sogware	
  
36	
  of	
  36	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  without	
  prior	
  wri>en	
  consent.	
  
Thank	
  you!	
  
QuesAons?	
  	
  ContribuAons?	
  
Glynn	
  Durham,	
  Senior	
  Instructor,	
  Cloudera	
  
glynn@cloudera.com	
  

More Related Content

What's hot

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 

What's hot (19)

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 

Viewers also liked

Introduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsIntroduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsJay (Jianqiang) Wang
 
Introduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data AnalyticsIntroduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data AnalyticsIvan Khvostishkov
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Introduction to Data Science with H2O- Mountain View
Introduction to Data Science with H2O- Mountain ViewIntroduction to Data Science with H2O- Mountain View
Introduction to Data Science with H2O- Mountain ViewSri Ambati
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data ScienceInfoFarm
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningNik Spirin
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 

Viewers also liked (14)

Introduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsIntroduction to data science and candidate data science projects
Introduction to data science and candidate data science projects
 
Introduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data AnalyticsIntroduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data Analytics
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Science with H2O- Mountain View
Introduction to Data Science with H2O- Mountain ViewIntroduction to Data Science with H2O- Mountain View
Introduction to Data Science with H2O- Mountain View
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 

Similar to Introduction to Data Science with Hadoop

Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)DataWorks Summit
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSandish Kumar H N
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product pageJanu Jahnavi
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 

Similar to Introduction to Data Science with Hadoop (20)

Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Big data and hadoop product page
Big data and hadoop product pageBig data and hadoop product page
Big data and hadoop product page
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 

More from Dr. Volkan OBAN

Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...Dr. Volkan OBAN
 
Covid19py Python Package - Example
Covid19py  Python Package - ExampleCovid19py  Python Package - Example
Covid19py Python Package - ExampleDr. Volkan OBAN
 
Object detection with Python
Object detection with Python Object detection with Python
Object detection with Python Dr. Volkan OBAN
 
Python - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) ParametreleriPython - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) ParametreleriDr. Volkan OBAN
 
Linear Programming wi̇th R - Examples
Linear Programming wi̇th R - ExamplesLinear Programming wi̇th R - Examples
Linear Programming wi̇th R - ExamplesDr. Volkan OBAN
 
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ..."optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...Dr. Volkan OBAN
 
k-means Clustering in Python
k-means Clustering in Pythonk-means Clustering in Python
k-means Clustering in PythonDr. Volkan OBAN
 
Naive Bayes Example using R
Naive Bayes Example using  R Naive Bayes Example using  R
Naive Bayes Example using R Dr. Volkan OBAN
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with RDr. Volkan OBAN
 
Data Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingData Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingDr. Volkan OBAN
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Dr. Volkan OBAN
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonDr. Volkan OBAN
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Dr. Volkan OBAN
 
Pandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetPandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetDr. Volkan OBAN
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleDr. Volkan OBAN
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleDr. Volkan OBAN
 
R-ggplot2 package Examples
R-ggplot2 package ExamplesR-ggplot2 package Examples
R-ggplot2 package ExamplesDr. Volkan OBAN
 
R Machine Learning packages( generally used)
R Machine Learning packages( generally used)R Machine Learning packages( generally used)
R Machine Learning packages( generally used)Dr. Volkan OBAN
 
treemap package in R and examples.
treemap package in R and examples.treemap package in R and examples.
treemap package in R and examples.Dr. Volkan OBAN
 

More from Dr. Volkan OBAN (20)

Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
Conference Paper:IMAGE PROCESSING AND OBJECT DETECTION APPLICATION: INSURANCE...
 
Covid19py Python Package - Example
Covid19py  Python Package - ExampleCovid19py  Python Package - Example
Covid19py Python Package - Example
 
Object detection with Python
Object detection with Python Object detection with Python
Object detection with Python
 
Python - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) ParametreleriPython - Rastgele Orman(Random Forest) Parametreleri
Python - Rastgele Orman(Random Forest) Parametreleri
 
Linear Programming wi̇th R - Examples
Linear Programming wi̇th R - ExamplesLinear Programming wi̇th R - Examples
Linear Programming wi̇th R - Examples
 
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ..."optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
"optrees" package in R and examples.(optrees:finds optimal trees in weighted ...
 
k-means Clustering in Python
k-means Clustering in Pythonk-means Clustering in Python
k-means Clustering in Python
 
Naive Bayes Example using R
Naive Bayes Example using  R Naive Bayes Example using  R
Naive Bayes Example using R
 
R forecasting Example
R forecasting ExampleR forecasting Example
R forecasting Example
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with R
 
Data Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision MakingData Science and its Relationship to Big Data and Data-Driven Decision Making
Data Science and its Relationship to Big Data and Data-Driven Decision Making
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
Pandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheetPandas,scipy,numpy cheatsheet
Pandas,scipy,numpy cheatsheet
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an example
 
ReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an exampleReporteRs package in R. forming powerpoint documents-an example
ReporteRs package in R. forming powerpoint documents-an example
 
R-ggplot2 package Examples
R-ggplot2 package ExamplesR-ggplot2 package Examples
R-ggplot2 package Examples
 
R Machine Learning packages( generally used)
R Machine Learning packages( generally used)R Machine Learning packages( generally used)
R Machine Learning packages( generally used)
 
treemap package in R and examples.
treemap package in R and examples.treemap package in R and examples.
treemap package in R and examples.
 

Recently uploaded

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Introduction to Data Science with Hadoop

  • 1. 1  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   IntroducAon  to  Data  Science   with  Hadoop   Glynn  Durham,  Senior  Instructor,  Cloudera   glynn@cloudera.com  
  • 2. 2  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   I  will  cover:      Hadoop,  Hadoop  ecosystem   HDFS   MapReduce   Sqoop   Flume   Hive   Pig   Mahout   Machine  learning   Data  science  using  Hadoop   Terms   with  a  few  extras:        YARN   HBase   Impala   Oozie   data  products  
  • 3. 3  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Hadoop  is:      a  plaLorm  for  big  data      several  Apache  SoNware        FoundaOon  (ASF)  projects        free  open  source  soNware   Major  parts:        Hadoop  Core      Hadoop  ecosystem   Hadoop  
  • 4. 4  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Hadoop  Core  Main  Features:  File  System  and  Batch  Programming  
  • 5. 5  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Hadoop  Core  consists  of:      HDFS     –   (Hadoop  Distributed  File  System),  for  storage      MapReduce   –   for  batch  programming   Hadoop  Core  
  • 6. 6  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   HDFS  Writes  
  • 7. 7  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   HDFS  Reads  
  • 8. 8  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   HDFS  is  good  at:   –   storing  enormous  files   –   storing  a  lot  of  data  reliably   –   throughput  on  sequenAal  writes   –   throughput  on  sequenAal  reads  of  a  file  or  part  of  a  file   HDFS  is  not  good  at:   –   high  speed  random  reads  of  parts  of  a  file   HDFS  cannot:   –   update  any  part  of  a  file  once  wri>en*   –   *  but  you  can  always  write  a  new  file,  and/or  delete,  move,              and  rename  files  and  directories   HDFS  Strengths  and  Weaknesses  
  • 9. 9  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   MapReduce:  Programming  with  Simple  FuncAons  
  • 10. 10  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   MapReduce  Chains  
  • 11. 11  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   MapReduce  at  Scale  
  • 12. 12  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   MapReduce  in  Hadoop  
  • 13. 13  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   MapReduce  is  good  at:   –   processing  enormous  amounts  of  data   –   scaling  out  as  you  add  more  machines   –   conAnuing  to  compleAon,  even  when  some  machines  die   MapReduce  is  not  good  at:   –   running  any  algorithm  you  can  think  up   –   algorithms  that  require  shared  state  overall*   –   *  but  maybe  you  can  get  clever  with  your  algorithm  design   MapReduce  cannot:   –   run  in  real  Ame:  MapReduce  jobs  are  batch  jobs   MapReduce  Strengths  and  Weaknesses  
  • 14. 14  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Detour:  YARN,  Yet  Another  Resource  NegoAator—near  future  
  • 15. 15  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.        The  Hadoop  Ecosystem  consists  of  other  projects  that  round   out  Hadoop  Core  to  make  it  a  useful  plaorm:   – Sqoop,  for  RDBMS  integraAon   – Flume,  for  event  ingesAon     – Hive,  for  "SQL"-­‐like  high-­‐level  programming   – Pig,  another  high-­‐level  programming  paradigm   – Mahout,  a  Java  library  for  machine  learning  in  Hadoop   Plus:   – HBase,  a  "NoSQL"  database  system   – Oozie,  a  workflow  manager  for  Hadoop  acAons   – ....   Hadoop  Ecosystem  
  • 16. 16  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Sqoop:  RDBMS  to  Hadoop  and  Back  
  • 17. 17  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Flume:  IngesAng  ConAnuing  Event  Data  
  • 18. 18  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Detour:  General  File  Input/Output  
  • 19. 19  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MapReduce  API   MapReduce  revisited:  How  to  write  MapReduce  programs?   •  The  most  expressive  technique  possible   •  The  most  work,  by  far   •  (Can  be  easier  with  Hadoop  Streaming:  a  way  to  use  streaming  programming   such  as  shell  scripOng  or  Python)  
  • 20. 20  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Hive:  MapReduce  as  "SQL"   •  Familiar  language  and  programming  paradigm   •  Provides  interface  to  many  SQL-­‐compliant  tools  
  • 21. 21  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Detour:  Impala,  High  Speed  AnalyAcs  in  Hadoop   •  5  to  30  Omes  faster  then  Hive  queries  (someOmes  100's  of  Omes  faster!)   •  Cloudera  exclusive  offering,  but  Apache  licensed,  so  it's  free  and  open  source  
  • 22. 22  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Impala  Does  Not  Use  MapReduce  
  • 23. 23  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Detour:  HBase,  A  NoSQL  Database  System  
  • 24. 24  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   HBase  is  a  NoSQL  database  system:   –   programmers  create  and  use  database  tables     –   high  volume,  high  performance  access  to  individual  cells   –   much  weaker  query  language  than  SQL   –   lacks  ACID-­‐compliant  transacAons   HBase  is  not  strictly  needed  to  do  "data  science"   –   a  resource  hog;  competes  with  analyAcal  programs   –   ogen  deployed  on  its  own  separate  cluster   –   may  be  part  of  your  organizaAon's  data  storage  and  delivery,    so  you  may  need  to  get  or  put  data  into  an  HBase  system*   –   *  (or  other  NoSQL  system)   Detour:  A  bit  more  about  HBase  
  • 25. 25  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Pig:  Another  Language  for  MapReduce  
  • 26. 26  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Mahout  is:    a  collecOon  of  algorithms,  mainly  focused  on  "the  three  C's"  of     machine  learning    wriden  in  Java    largely  implemented  over  Hadoop  MapReduce    invocable  from  the  command  line    extensible,  with  the  Java  API   Mahout  is  not:    a  turnkey  soluOon  for  doing  machine  learning    always  user-­‐friendly   Mahout:  Machine  Learning  in  MapReduce  
  • 27. 27  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   "The  three  C's"  of  machine  learning:      ClassificaOon      Clustering      CollaboraOve  filtering  (recommenders)   Machine  Learning  
  • 28. 28  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Supervised  Machine  Learning:  ClassificaAon  
  • 29. 29  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Machine  Learning:  Clustering  
  • 30. 30  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Machine  Learning:  CollaboraAve  Filtering  for  Recommenders  
  • 31. 31  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Simple  Enterprise  Deployment:  Hadoop  as  ETL  Appliance  
  • 32. 32  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Simple  workflow  within  Hadoop:   1.  Clear  out  staging  directory  in  HDFS   2.  Sqoop  import  from  OLTP  tables   3.  Hive  (or  Pig)  script  to  transform  data   4.  Sqoop  export  to  data  warehouse   Detour:  Oozie,  Workflow  within  Hadoop  
  • 33. 33  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Hadoop:  The  Bigger  Picture  
  • 34. 34  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   A  data  scienOst  will:   1.  IdenOfy  internal  and  external  data  for  potenOal  use  (general  data  wrangling  tools).   2.  Help  build  ingesOon  pipelines  to  obtain  data  for  use  (Flume,  Sqoop,  other).   3.  Examine,  clean,  and  anonymize  ingested  data  (Hive,  Impala,  Pig,  Hadoop  Streaming).   4.  Shape  data  into  useful  formats  (Hive,  Pig).   5.  Explore  data  sets  to  gain  understanding  of  problems,  trends,  reality  (Impala,  Hive,  Pig,   staOsOcal  programming).   6.  Build  predicOve  models  using  staOsOcal  programming,  machine  learning  (Mahout).   7.  Contribute  to  data  products:  products  in  the  organizaOon  that  are  built  in  large  part   from  the  data  itself  (Mahout,  Sqoop  export,  general  file  export).   8.  Conduct  experiments  with  data  products,  quanOfying  benefits  and/or  tradeoffs  of   system  changes  (Flume,  Sqoop,  staOsOcal  tests).   9.  Communicate  results  and  insights  to  stakeholders  (visualizaOon*).   Data  Science  with  Hadoop  
  • 35. 35  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   VisualizaAon:  Needs  VisualizaAon  Sogware  
  • 36. 36  of  36  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Thank  you!   QuesAons?    ContribuAons?   Glynn  Durham,  Senior  Instructor,  Cloudera   glynn@cloudera.com