SlideShare a Scribd company logo
1 of 24
Download to read offline
DO	
  NOT	
  USE	
  PUBLICLY	
  
PRIOR	
  TO	
  10/23/12	
  

Building	
  ApplicaCons	
  on	
  Hadoop	
  
Headline	
  Goes	
  Here	
  
Mark	
  Grover	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
SoFware	
  Engineer,	
  Cloudera	
  
@mark_grover	
  
Jfokus	
  2014	
  (February	
  4th,	
  2014)	
  
	
  

1

©2014 Cloudera, Inc. All Rights
Reserved.
Agenda	
  
•  Brief	
  intro	
  to	
  Hadoop	
  and	
  the	
  ecosystem	
  
•  Developing	
  apps	
  on	
  Hadoop	
  
•  What’s	
  the	
  current	
  problem?	
  
•  How	
  are	
  we	
  fixing	
  it?	
  

2

©2014 Cloudera, Inc. All Rights
Reserved.
What	
  is	
  Apache	
  Hadoop?	
  
Apache Hadoop	
  is	
  an	
  open	
  source	
  
pla_orm	
  for	
  data	
  storage	
  and	
  processing	
  
that	
  is…	
  
ü  Scalable	
  
ü  Fault	
  tolerant	
  
ü  Distributed	
  
Has	
  the	
  Flexibility	
  to	
  Store	
  and	
  
Mine	
  Any	
  Type	
  of	
  Data	
  

	
  
§  Ask	
  quesCons	
  across	
  structured	
  and	
  
unstructured	
  data	
  that	
  were	
  previously	
  
impossible	
  to	
  ask	
  or	
  solve	
  
§  Not	
  bound	
  by	
  a	
  single	
  schema	
  
3

CORE	
  HADOOP	
  SYSTEM	
  COMPONENTS	
  
Hadoop	
  Distributed	
  
File	
  System	
  (HDFS)	
  
	
  
Self-­‐Healing,	
  High	
  
Bandwidth	
  Clustered	
  
Storage	
  

Excels	
  at	
  
Processing	
  Complex	
  Data	
  

	
  
	
  
MapReduce	
  
	
  

Distributed	
  CompuCng	
  
Framework	
  

Scales	
  
Economically	
  

	
  
§  Scale-­‐out	
  architecture	
  divides	
  workloads	
  
across	
  mulCple	
  nodes	
  

	
  
§  Can	
  be	
  deployed	
  on	
  commodity	
  
hardware	
  

§  Flexible	
  file	
  system	
  eliminates	
  ETL	
  
bo^lenecks	
  

§  Open	
  source	
  pla_orm	
  guards	
  against	
  
vendor	
  lock	
  

©2014 Cloudera, Inc. All Rights
Reserved.
Developing	
  apps	
  on	
  Hadoop	
  
Kite	
  SDK	
  

4

©2014 Cloudera, Inc. All Rights
Reserved.
“[I]t’s	
  not	
  enough	
  to	
  just	
  build	
  a	
  scalable	
  
and	
  stable	
  system;	
  the	
  system	
  also	
  has	
  to	
  
be	
  easy	
  enough	
  for	
  thousands	
  of	
  internal	
  
developers	
  of	
  all	
  types	
  and	
  all	
  skill	
  levels	
  to	
  
use.”	
  

2

5
h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/	
  
Hadoop	
  is	
  incredibly	
  powerful	
  

6

©2014 Cloudera, Inc. All Rights
Reserved.
Hadoop	
  is	
  incredibly	
  flexible	
  

7

©2014 Cloudera, Inc. All Rights
Reserved.
Hadoop	
  is	
  incredibly	
  low-­‐level	
  

8

©2014 Cloudera, Inc. All Rights
Reserved.
Hadoop	
  is	
  incredibly	
  complex	
  

9

©2014 Cloudera, Inc. All Rights
Reserved.
A	
  typical	
  system	
  (zoom	
  100:1)	
  

10

©2014 Cloudera, Inc. All Rights
Reserved.
A	
  typical	
  system	
  (zoom	
  10:1)	
  

11

©2014 Cloudera, Inc. All Rights
Reserved.
A	
  typical	
  system	
  (zoom	
  5:1)	
  

12

©2014 Cloudera, Inc. All Rights
Reserved.
What	
  you	
  actually	
  care	
  about	
  
•  Gelng	
  data	
  from	
  A	
  to	
  B	
  
•  Using	
  it	
  later	
  

13

©2014 Cloudera, Inc. All Rights
Reserved.
Infrastructure	
  details	
  
•  SerializaCon,	
  file	
  formats,	
  and	
  compression	
  
•  Metadata	
  capture	
  and	
  maintenance	
  
•  Dataset	
  organizaCon	
  and	
  parCConing	
  
•  Durability	
  and	
  delivery	
  guarantees	
  
•  Well-­‐defined	
  failure	
  semanCcs	
  
•  Performance	
  and	
  health	
  instrumentaCon	
  

14

©2014 Cloudera, Inc. All Rights
Reserved.
Kite	
  SDK	
  
•  Make	
  Hadoop	
  accessible	
  to	
  the	
  enterprise	
  developer	
  
•  Address	
  the	
  most	
  common	
  cases	
  
•  Codify	
  expert	
  pa^erns	
  and	
  pracCces	
  for	
  building	
  data-­‐oriented	
  

systems	
  and	
  applicaCons.	
  
•  Let	
  developers	
  focus	
  on	
  business	
  logic,	
  not	
  plumbing	
  or	
  
infrastructure.	
  
•  Provide	
  smart	
  defaults	
  for	
  pla_orm	
  choices.	
  
•  Support	
  piecemeal	
  adopCon	
  via	
  loosely-­‐coupled	
  modules	
  
15

©2014 Cloudera, Inc. All Rights
Reserved.
Kite	
  SDK	
  
•  An	
  open	
  source	
  set	
  of	
  libraries,	
  guides,	
  and	
  examples	
  for	
  

building	
  data-­‐oriented	
  systems	
  and	
  applicaCons	
  
•  Provides	
  higher	
  level	
  APIs	
  atop	
  exisCng	
  components	
  of	
  CDH	
  
•  Supports	
  piecemeal	
  adopCon	
  via	
  loosely	
  coupled	
  modules	
  

16

©2014 Cloudera, Inc. All Rights
Reserved.
Kite	
  SDK	
  
• 

Data	
  –	
  logical	
  abstracCons	
  of	
  records,	
  datasets	
  and	
  repositories	
  with	
  implementaCons	
  for	
  
HDFS	
  and	
  HBase	
  (upcoming)	
  
• 
• 
• 
• 
• 
• 

APIs	
  to	
  drasCcally	
  simplify	
  working	
  with	
  datasets	
  in	
  Hadoop	
  filesystems.	
  The	
  Data	
  module:	
  
	
  handles	
  automaCc	
  serializaCon	
  and	
  deserializaCon	
  of	
  Java	
  POJOs	
  as	
  well	
  as	
  Avro	
  Records.	
  
AutomaCc	
  compression.	
  
File	
  and	
  directory	
  layout	
  and	
  management.	
  
AutomaCc	
  parCConing	
  based	
  on	
  configurable	
  funcCons.	
  
A	
  metadata	
  provider	
  plugin	
  interface	
  to	
  integrate	
  with	
  centralized	
  metadata	
  management	
  
systems.	
  	
  

• 
• 
• 

17

Morphlines	
  –	
  declaraCve	
  ETL	
  stream	
  processing	
  library	
  	
  
Maven	
  Plugin	
  –	
  tools	
  for	
  working	
  with	
  datasets	
  and	
  running	
  jobs	
  
TODO:	
  Add	
  more!!!	
  

©2014 Cloudera, Inc. All Rights
Reserved.
Co-­‐authoring	
  O’Reilly	
  book	
  
•  Titled	
  ‘Hadoop	
  ApplicaCon	
  Architectures’	
  
•  How	
  to	
  build	
  end-­‐to-­‐end	
  soluCons	
  using	
  	
  

Apache	
  Hadoop	
  and	
  related	
  tools	
  
•  Updates	
  on	
  Twi^er:	
  @hadooparchbook	
  
•  h^p://www.hadooparchitecturebook.com/	
  

18

©2014 Cloudera, Inc. All Rights
Reserved.
Code	
  
DatasetRepository repo = new FileSystemDatasetRepository.Builder()	
.fileSystem(FileSystem.get(new Configuration()))	
.directory(new Path(“/data”))	
.get();	
	
Dataset events = repo.create(“events”,	
new DatasetDescriptor.Builder()	
.schema(new File(“event.avsc”))	
.partitionStrategy(	
new PartitionStrategy.Builder().hash(“userId”, 53).get()	
).get()	
);	
	
DatasetWriter<GenericRecord> writer = events.getWriter();	
writer.open();	
writer.write(	
new GenericRecordBuilder(schema)	
.set(“userId”, 1)	
.set(“timeStamp”, System.currentTimeMillis())	
.build()	
);	
writer.close();	

Data	
  

15

19

/data	
/events	
/.metadata	
/schema.avsc	
/descriptor.properties	
/userId=0	
/10000000.avro	
/10000001.avro	
/userId=1	
/20000000.avro	
/userId=2	
/30000000.avro
Kite	
  SDK	
  Morphlines	
  Module	
  
Pluggable,	
  configuraCon-­‐driven	
  data	
  transform	
  library	
  
Born	
  out	
  of	
  Cloudera	
  Search,	
  but	
  general	
  purpose	
  
Configure	
  record	
  transform	
  stages	
  in	
  a	
  container	
  library	
  
Use	
  the	
  library	
  in	
  Flume,	
  MapReduce	
  jobs,	
  Storm,	
  and	
  other	
  Java	
  
applicaCons	
  

14

20
Other	
  Modules	
  
Maven	
  plugin	
  
Package,	
  deploy,	
  and	
  execute	
  “apps”	
  
Execute	
  dataset	
  operaCons	
  

Examples	
  
POJO,	
  generic,	
  and	
  generated	
  enCty	
  ingest	
  
Dataset	
  administraCve	
  operaCons	
  
Crunch	
  and	
  MR	
  integraCon	
  
...	
  
14

21
Future	
  
HBase	
  
Extending	
  data	
  APIs	
  to	
  support	
  random	
  access	
  
Same	
  automaCc	
  serializaCon,	
  schema	
  management,	
  etc.	
  

Higher-­‐order	
  data	
  management	
  
Common	
  tasks	
  
Think	
  background	
  compacCon,	
  conversion,	
  etc.	
  

IntegraCon	
  with	
  exisCng	
  middleware	
  frameworks	
  
Give	
  us	
  all	
  your	
  good	
  ideas	
  (and	
  code)!	
  
14

22
Kite	
  SDK	
  Resources	
  
•  Docs	
  
• 

h^p://kitesdk.org/docs/current/	
  

•  Examples	
  
• 

h^ps://github.com/kite-­‐sdk/kite-­‐examples	
  

•  Source	
  code	
  
• 

h^ps://github.com/kite-­‐sdk/	
  

Binary	
  arCfacts	
  available	
  from	
  Cloudera’s	
  Maven	
  repository	
  
•  Twi^er:	
  @mark_grover	
  
•  Slides	
  at	
  	
  
•  LinkedIn:	
  linkedin.com/in/grovermark	
  
23

©2014 Cloudera, Inc. All Rights
Reserved.
18

24

More Related Content

What's hot

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 

What's hot (20)

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 

Similar to Applications on Hadoop

Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Hortonworks
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 

Similar to Applications on Hadoop (20)

Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Applications on Hadoop

  • 1. DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12   Building  ApplicaCons  on  Hadoop   Headline  Goes  Here   Mark  Grover   Speaker  Name  or  Subhead  Goes  Here   SoFware  Engineer,  Cloudera   @mark_grover   Jfokus  2014  (February  4th,  2014)     1 ©2014 Cloudera, Inc. All Rights Reserved.
  • 2. Agenda   •  Brief  intro  to  Hadoop  and  the  ecosystem   •  Developing  apps  on  Hadoop   •  What’s  the  current  problem?   •  How  are  we  fixing  it?   2 ©2014 Cloudera, Inc. All Rights Reserved.
  • 3. What  is  Apache  Hadoop?   Apache Hadoop  is  an  open  source   pla_orm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  quesCons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   3 CORE  HADOOP  SYSTEM  COMPONENTS   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage   Excels  at   Processing  Complex  Data       MapReduce     Distributed  CompuCng   Framework   Scales   Economically     §  Scale-­‐out  architecture  divides  workloads   across  mulCple  nodes     §  Can  be  deployed  on  commodity   hardware   §  Flexible  file  system  eliminates  ETL   bo^lenecks   §  Open  source  pla_orm  guards  against   vendor  lock   ©2014 Cloudera, Inc. All Rights Reserved.
  • 4. Developing  apps  on  Hadoop   Kite  SDK   4 ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. “[I]t’s  not  enough  to  just  build  a  scalable   and  stable  system;  the  system  also  has  to   be  easy  enough  for  thousands  of  internal   developers  of  all  types  and  all  skill  levels  to   use.”   2 5 h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/  
  • 6. Hadoop  is  incredibly  powerful   6 ©2014 Cloudera, Inc. All Rights Reserved.
  • 7. Hadoop  is  incredibly  flexible   7 ©2014 Cloudera, Inc. All Rights Reserved.
  • 8. Hadoop  is  incredibly  low-­‐level   8 ©2014 Cloudera, Inc. All Rights Reserved.
  • 9. Hadoop  is  incredibly  complex   9 ©2014 Cloudera, Inc. All Rights Reserved.
  • 10. A  typical  system  (zoom  100:1)   10 ©2014 Cloudera, Inc. All Rights Reserved.
  • 11. A  typical  system  (zoom  10:1)   11 ©2014 Cloudera, Inc. All Rights Reserved.
  • 12. A  typical  system  (zoom  5:1)   12 ©2014 Cloudera, Inc. All Rights Reserved.
  • 13. What  you  actually  care  about   •  Gelng  data  from  A  to  B   •  Using  it  later   13 ©2014 Cloudera, Inc. All Rights Reserved.
  • 14. Infrastructure  details   •  SerializaCon,  file  formats,  and  compression   •  Metadata  capture  and  maintenance   •  Dataset  organizaCon  and  parCConing   •  Durability  and  delivery  guarantees   •  Well-­‐defined  failure  semanCcs   •  Performance  and  health  instrumentaCon   14 ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. Kite  SDK   •  Make  Hadoop  accessible  to  the  enterprise  developer   •  Address  the  most  common  cases   •  Codify  expert  pa^erns  and  pracCces  for  building  data-­‐oriented   systems  and  applicaCons.   •  Let  developers  focus  on  business  logic,  not  plumbing  or   infrastructure.   •  Provide  smart  defaults  for  pla_orm  choices.   •  Support  piecemeal  adopCon  via  loosely-­‐coupled  modules   15 ©2014 Cloudera, Inc. All Rights Reserved.
  • 16. Kite  SDK   •  An  open  source  set  of  libraries,  guides,  and  examples  for   building  data-­‐oriented  systems  and  applicaCons   •  Provides  higher  level  APIs  atop  exisCng  components  of  CDH   •  Supports  piecemeal  adopCon  via  loosely  coupled  modules   16 ©2014 Cloudera, Inc. All Rights Reserved.
  • 17. Kite  SDK   •  Data  –  logical  abstracCons  of  records,  datasets  and  repositories  with  implementaCons  for   HDFS  and  HBase  (upcoming)   •  •  •  •  •  •  APIs  to  drasCcally  simplify  working  with  datasets  in  Hadoop  filesystems.  The  Data  module:    handles  automaCc  serializaCon  and  deserializaCon  of  Java  POJOs  as  well  as  Avro  Records.   AutomaCc  compression.   File  and  directory  layout  and  management.   AutomaCc  parCConing  based  on  configurable  funcCons.   A  metadata  provider  plugin  interface  to  integrate  with  centralized  metadata  management   systems.     •  •  •  17 Morphlines  –  declaraCve  ETL  stream  processing  library     Maven  Plugin  –  tools  for  working  with  datasets  and  running  jobs   TODO:  Add  more!!!   ©2014 Cloudera, Inc. All Rights Reserved.
  • 18. Co-­‐authoring  O’Reilly  book   •  Titled  ‘Hadoop  ApplicaCon  Architectures’   •  How  to  build  end-­‐to-­‐end  soluCons  using     Apache  Hadoop  and  related  tools   •  Updates  on  Twi^er:  @hadooparchbook   •  h^p://www.hadooparchitecturebook.com/   18 ©2014 Cloudera, Inc. All Rights Reserved.
  • 19. Code   DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get(); Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get() ); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build() ); writer.close(); Data   15 19 /data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro
  • 20. Kite  SDK  Morphlines  Module   Pluggable,  configuraCon-­‐driven  data  transform  library   Born  out  of  Cloudera  Search,  but  general  purpose   Configure  record  transform  stages  in  a  container  library   Use  the  library  in  Flume,  MapReduce  jobs,  Storm,  and  other  Java   applicaCons   14 20
  • 21. Other  Modules   Maven  plugin   Package,  deploy,  and  execute  “apps”   Execute  dataset  operaCons   Examples   POJO,  generic,  and  generated  enCty  ingest   Dataset  administraCve  operaCons   Crunch  and  MR  integraCon   ...   14 21
  • 22. Future   HBase   Extending  data  APIs  to  support  random  access   Same  automaCc  serializaCon,  schema  management,  etc.   Higher-­‐order  data  management   Common  tasks   Think  background  compacCon,  conversion,  etc.   IntegraCon  with  exisCng  middleware  frameworks   Give  us  all  your  good  ideas  (and  code)!   14 22
  • 23. Kite  SDK  Resources   •  Docs   •  h^p://kitesdk.org/docs/current/   •  Examples   •  h^ps://github.com/kite-­‐sdk/kite-­‐examples   •  Source  code   •  h^ps://github.com/kite-­‐sdk/   Binary  arCfacts  available  from  Cloudera’s  Maven  repository   •  Twi^er:  @mark_grover   •  Slides  at     •  LinkedIn:  linkedin.com/in/grovermark   23 ©2014 Cloudera, Inc. All Rights Reserved.
  • 24. 18 24