SlideShare a Scribd company logo
1 of 37
@nmotgi
Nitin	
  Motgi
Evolving	
  Hadoop	
  Into	
  An	
  Opera5onal
Pla7orm	
  With	
  Data	
  Applica5ons	
  
PROPRIETARY & CONFIDENTIAL2
• Introduction	
  to	
  operational	
  data	
  applications	
  
• Challenges	
  with	
  building	
  operational	
  data	
  applications	
  on	
  Hadoop	
  
• Goals	
  and	
  Motivation	
  for	
  CDAP	
  
• Introduction	
  to	
  CDAP	
  and	
  Architecture	
  Overview	
  
• Building	
  Blocks	
  
• Datasets	
  
• Programs	
  	
  
• Application	
  and	
  Application	
  Template	
  
• Use-­‐cases
Agenda
PROPRIETARY & CONFIDENTIAL3
Applications	
  that	
  use	
  data	
  insights	
  to	
  enhance	
  the	
  customers/user	
  
experience,	
  achieve	
  a	
  business	
  objective	
  or	
  improve	
  a	
  business	
  process.
What are Operational Data Applications?
PROPRIETARY & CONFIDENTIAL4
• 360-­‐Degree	
  Customer	
  View	
  
• Recommendation	
  Engine	
  
• Predictive	
  Modeling	
  
• Fraud	
  Analysis	
  
• Network	
  Threat	
  Detection	
  
• Telemetry	
  
• Time	
  Series	
  Analysis	
  
• And	
  many	
  more
Examples
Challenges
Technology Explosion
Core Hadoop
HDFS, MR
2006
Hbase
ZooKeeper
Core Hadoop
2008
Hive
Pig
Mahout
Hbase
ZooKeeper
Core Hadoop
2009
Sqoop
Whirr
Avro
Hive
Pig
Mahout
Hbase
Zookeeper
Core Hadoop
2010
Flume
Bigtop
Oozie
MRUnit
HCatalog
Sqoop
Whirr
Avro
Hive
Pig
Mahout
Hbase
Zookeeper
Core Hadoop
2011
Spark
Impala
Solr
Kafka
Flume
Bigtop
Oozie
MRUnit
HCatalog
Sqoop
Whirr
Avro
Hive
Pig
Mahout
Hbase
Zookeeper
Core Hadoop
2012
Sentry
Tez
Parquet
YARN
Spark
YARN
Impala
Solr
Kafka
Flume
Bigtop
Oozie
MRUnit
HCatalog
Sqoop
Whirr
Avro
Hive
Pig
Mahout
Hbase
Zookeeper
Core Hadoop
Knox
Present
APPLICATION
COMPLEXITY
MANY DOMAINS TO
BRIDGE
LOTS OF
BOILERPLATE
INCONSISTENT
APIS
NO
REUSABILITY LACK OF DEVELOPER
PRODUCTIVITY
Challenges
Application Complexity
Mo5va5on
Motivation
• Simple	
  yet	
  powerful	
  platform	
  for	
  developers	
  to	
  build	
  applications	
  on	
  
Hadoop	
  
• Expose	
  capabilities	
  rather	
  than	
  features	
  
• Make	
  Hadoop	
  	
  accessible	
  to	
  developers	
  with	
  no	
  Hadoop	
  knowledge
Goals
• Unified	
  platform	
  for	
  building	
  solutions	
  on	
  Hadoop	
  
• Simpler	
  application	
  development	
  lifecycle	
  
• Reusable	
  Data	
  and	
  Processing	
  Patterns	
  
• Framework	
  level	
  correctness	
  and	
  consistency
Introduc5on	
  to
Cask	
  Data	
  Applica5on	
  Pla7orm
An open source, integrated, distributed and extensible
platform for building data applications on Hadoop.
Cask Data Application Platform
Provides
Supports developers, operations, and organizations through
the entire enterprise data application lifecycle.
CASK DATA APP PLATFORM
Data
Lifecycle
Ingest
Explore
Transform
Serve
Application
Lifecycle
Develop
Test
Deploy
Scale
Enterprise
Lifecycle
Secure
Manage
Monitor
Operate
Supports
Datasets
Programs
Tools &
Experience
• Standardized containers providing
consistency for diverse processing
paradigms
• Services for developers to enable richer
apps with less hassle; and production to
enable application and data
management
• Libraries to build reusable data access
patterns spanning multiple storage
technologies
Runtime
Services
16
Programs
Batch Programs Realtime Programs
CASK DATA APPLICATION PLATFORM (CDAP)
Event /Data
Ingestion
Tools and
User Experience
Datasets
Runtime Services
BATCH
PROCESSING
(MapReduce,
Hive, pig)
ANALYTIC
SQL
(Impala)
SEARCH
ENGINE
(Cloudera Search)
MACHINE
LEARNING
(Spark, MapReduce,
Mahout)
STREAM
PROCESSING
(Spark)
3RD
PARTY
APPS
(Partners)
DATA
MANAGEMEN
Egress
Cloudera’s Enterprise Data Hub
Adapters
Data Application
Examples
Anomaly
Detection
360o
Consumer
profile
Network
Analytics
Multi-log
Correlation
Analytics
Architecture
HADOOP
• Maven Archetype, Testing Framework,
Debugging Tools, Monitoring Tool, Web
based Application Management
17
ServeTransformExploreIngest
Unification
ACID
Dataset
Streams
Realtime - Tigon
JDBC
Query
RPC
SparkMR Dataset
Dataset
MR
Spark
Ad-hoc
query
Dataset API, SPI & Management Services
Application Structure
18
Deployment
• Services
• Master
• Router
• Auth Server
CDAP Server
• Highly Available (HA)
• Installed on edge node(s)
• Supports Kerberos - Impersonation & Permitter Security
• Manager system services in YARN
CDAP Server
System Services (Twill Containers)
• Transactions (Tephra)
• Metrics Aggregation
• Log Aggregation
• Dataset Services
• Metadata Management Service
• Explore Service
• Stream Management Service & more
Building	
  Blocks
Building Blocks
Dataset Program
Encapsulated	
  data	
  access	
  
paBerns	
  and	
  data	
  model	
  in	
  a	
  
reusable,	
  domain-­‐specific	
  API
Standardized	
  containers	
  
for	
  processing	
  paradigms	
  
ProgramaTc	
  abstracTon	
  for	
  composing	
  mulTple	
  Datasets	
  	
  and	
  Programs	
  
that	
  integrates	
  ingesTon,	
  exploraTon,	
  transformaTon	
  and	
  serving
Application
Dataset ProgramProgramDataset
Dataset
PROPRIETARY & CONFIDENTIAL22
RDBMS	
  	
   Hadoop Dataset
Raw	
  Storage	
  Interfaces,	
  Data	
  
Modeling,	
  Data	
  Layout,	
  
OpTmizaTons	
  and	
  Schema
Raw	
  Storage
Raw	
  Distributed	
  Storage,	
  Model,	
  
Layout,	
  Op5miza5ons	
  and	
  
op5onal	
  Schema
• OpTmizaTons	
  are	
  pushed	
  
closer	
  to	
  storage	
  	
  
• ApplicaTons	
  use	
  SQL	
  to	
  access	
  
data	
  (store	
  or	
  retrieve)	
  
• Simpler	
  ApplicaTons!	
  
• Modeling,	
  layout	
  and	
  
opTmizaTons	
  are	
  embedded	
  
within	
  applicaTons	
  
• Hard	
  to	
  scale	
  -­‐	
  lack	
  of	
  
reusability
• Access	
  through	
  domain	
  
specific	
  APIs	
  with	
  opTonal	
  SQL	
  
Interface	
  
• OpTmizaTons	
  are	
  
encapsulated	
  within	
  datasets	
  
• Simpler	
  ApplicaTons!
Dataset Motivation
PROPRIETARY & CONFIDENTIAL23
• Encapsulate	
  a	
  data	
  access	
  paBern	
  and	
  data	
  model	
  in	
  a	
  reusable,	
  domain-­‐specific	
  API	
  
• Establishes	
  best	
  prac5ces	
  in	
  schema	
  definiTon	
  
• Abstract	
  away	
  underlying	
  storage	
  plaorm	
  
• Reusable	
  as	
  data	
  storage	
  templates	
  
• Easy	
  sharing	
  of	
  stored	
  data:	
  	
  
• Between	
  applicaTons	
  
• Batch	
  and	
  real-­‐Tme	
  processing	
  
• Integrated	
  with	
  TransacTons	
  for	
  consistency	
  
• Integrated	
  tes5ng	
  
• Extensible	
  to	
  create	
  your	
  own	
  soluTons	
  
• Transparent	
  Integra5on	
  with	
  
• Hive	
  metastore	
  
• MR	
  Input/Output	
  Formats	
  
• Spark	
  RDDs
Building Blocks - Dataset
PROPRIETARY & CONFIDENTIAL24
• Secondary	
  Indexes	
  	
  
• Example use case: Entity storage - store customer records indexed by location
• Object	
  Mapping	
  
• Example use case: Entity storage - easily store User instances for user profiles
• Timeseries	
  Data	
  
• Example use case: any data organized around a time dimension
• Data	
  Cube	
  
• Example use case: Retail product sales reports, web analytics
• ParTToned	
  Fileset	
  
• Example use case: Time partitioned processing of feeds
• And	
  many	
  more
Dataset - Types
PROPRIETARY & CONFIDENTIAL25
Dataset - Example
• A	
  Java	
  Library	
  
• Table	
  Dataset	
  
• First	
  Name,	
  Last	
  Name	
  and	
  Link	
  to	
  
Picture	
  in	
  a	
  Table	
  
• Fileset	
  Dataset	
  
• Pictures	
  in	
  a	
  Fileset	
  
• Instance	
  of	
  Dataset	
  as	
  
• HBase	
  Table	
  and	
  	
  
• HDFS	
  Directory	
  
• Access	
  using	
  SQL	
  (HIVE)	
  
• Tigon,	
  MR	
  &	
  Spark	
  can	
  access
public	
  class	
  ContactsDataset	
  extends	
  AbstractDataset	
  {	
  
	
  	
  private	
  ObjectMappedTable<Contact>	
  contacts;	
  
	
  	
  private	
  FileSet	
  pictures;	
  
	
  	
  public	
  ContactsDataset(DatasetSpecification	
  spec,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  @EmbeddedDataset("contacts")	
  ObjectMappedTable<Contact>	
  contacts,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  @EmbeddedDataset("pictures")	
  FileSet	
  pictures)	
  {	
  
	
  	
  	
  	
  super(spec.getName(),	
  contacts,	
  pictures);	
  
	
  	
  	
  	
  this.contacts	
  =	
  contacts;	
  
	
  	
  	
  	
  this.pictures	
  =	
  pictures;	
  
	
  	
  }	
  
	
  	
  public	
  void	
  addContact(String	
  nick,	
  Contact	
  contact)	
  {	
  
	
  	
  	
  	
  contacts.write(nick,	
  contact);	
  
	
  	
  }	
  
	
  	
  public	
  Contact	
  getContact(String	
  nick)	
  {	
  
	
  	
  	
  	
  return	
  contacts.read(nick);	
  
	
  	
  }	
  
	
  	
  //	
  continued...	
  
PROPRIETARY & CONFIDENTIAL26
Dataset - Composite
Embedded Datasets
PROPRIETARY & CONFIDENTIAL27
public	
  class	
  ContactsDataset	
  extends	
  AbstractDataset	
  {	
  
	
  	
  //	
  ...continued	
  
	
  	
  public	
  void	
  addPhoto(String	
  nick,	
  byte[]	
  photoBytes)	
  throws	
  IOException	
  {	
  
	
  	
  	
  	
  Contact	
  contact	
  =	
  getContact(nick);	
  
	
  	
  	
  	
  if	
  (contact.getPicturePath()	
  !=	
  null)	
  {	
  
	
  	
  	
  	
  	
  	
  //	
  delete	
  picture	
  path	
  
	
  	
  	
  	
  }	
  
	
  	
  	
  	
  String	
  picturePath	
  =	
  "pic."	
  +	
  nick;	
  
	
  	
  	
  	
  Location	
  location	
  =	
  pictures.getLocation(picturePath);	
  
	
  	
  	
  	
  try	
  {	
  
	
  	
  	
  	
  	
  	
  ByteStreams.copy(new	
  ByteArrayInputStream(photoBytes),	
  location.getOutputStream());	
  
	
  	
  	
  	
  	
  	
  contact.setPicturePath(picturePath);	
  
	
  	
  	
  	
  	
  	
  contacts.write(nick,	
  contact);	
  
	
  	
  	
  	
  }	
  catch	
  (IOException	
  e)	
  {	
  
	
  	
  	
  	
  	
  	
  LOG.error("Got	
  exception:	
  ",	
  e);	
  
	
  	
  	
  	
  	
  	
  //	
  delete	
  path	
  
	
  	
  	
  	
  	
  	
  throw	
  e;	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}	
  
Dataset - Transactional Update
PROPRIETARY & CONFIDENTIAL28
public	
  class	
  ContactsDataset	
  extends	
  AbstractDataset	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  implements	
  RecordScannable<StructuredRecord>	
  {	
  
	
  	
  //..	
  
	
  	
  @Override	
  
	
  	
  public	
  Type	
  getRecordType()	
  {	
  
	
  	
  	
  	
  return	
  StructuredRecord.class;	
  
	
  	
  }	
  
	
  	
  @Override	
  
	
  	
  public	
  List<Split>	
  getSplits()	
  {	
  
	
  	
  	
  	
  return	
  contacts.getSplits();	
  
	
  	
  }	
  
	
  	
  @Override	
  
	
  	
  public	
  RecordScanner<StructuredRecord>	
  createSplitRecordScanner(Split	
  split)	
  {	
  
	
  	
  	
  	
  return	
  contacts.createSplitRecordScanner(split);	
  
	
  	
  }	
  
}	
  
Dataset - Explorable
PROPRIETARY & CONFIDENTIAL29
Dataset Example - Usage
public	
  class	
  Contacts	
  extends	
  AbstractApplication	
  {	
  
	
  	
  @Override	
  
	
  	
  public	
  void	
  configure()	
  {	
  
	
  	
  	
  	
  try	
  {	
  
	
  	
  	
  	
  	
  	
  setName("Contacts");	
  
	
  	
  	
  	
  	
  	
  setDescription("An	
  application	
  to	
  manage	
  contacts	
  and	
  their	
  pictures");	
  
	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  createDataset("contacts",	
  ContactsDataset.class);	
  
	
  	
  	
  	
  	
  	
  //	
  Define	
  programs,	
  other	
  datasets...	
  
	
  	
  	
  	
  }	
  catch	
  (UnsupportedTypeException	
  e)	
  {	
  
	
  	
  	
  	
  	
  	
  //	
  cannot	
  happen	
  with	
  Contact	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}	
  
Programs
PROPRIETARY & CONFIDENTIAL31
• Standardized	
  containers	
  for	
  processing	
  paradigms	
  
• Establishes	
  unified	
  way	
  of	
  extracTng	
  logs	
  &	
  metrics	
  
• Compose	
  complex	
  applicaTons	
  -­‐	
  real-­‐5me	
  or	
  batch	
  	
  	
  
• Seamless	
  Integra5on	
  with	
  Datasets	
  -­‐	
  simple	
  or	
  composite.	
  	
  
• Provides	
  conceptual	
  integrity	
  across	
  different	
  processing	
  
paradigms	
  	
  
• Integrated	
  end-­‐to-­‐end	
  tes5ng	
  
• Extensible	
  to	
  add	
  new	
  processing	
  paradigms.	
  
• Leverage	
  common	
  services	
  to	
  ease	
  	
  
• version	
  management	
  
• deployment	
  
• management
Building Blocks - Programs
Applica5on
PROPRIETARY & CONFIDENTIAL33
ProgramaTc	
  abstracTon	
  for	
  composing	
  a	
  use	
  
case	
  by	
  combining	
  Datasets	
  	
  and	
  Programs	
  to	
  
perform	
  ingesTon,	
  transformaTon	
  and	
  serving.	
  
Building Blocks - Application
public	
  class	
  PurchaseApp	
  extends	
  AbstractApplication	
  {	
  
	
  @Override	
  
	
  	
  public	
  void	
  configure()	
  {	
  
	
  	
  	
  	
  .	
  .	
  .	
  
	
  	
  	
  	
  addStream(new	
  Stream("purchaseStream"));	
  
	
  	
  	
  	
  createDataset("frequentCustomers",	
  KeyValueTable.class);	
  
	
  	
  	
  	
  createDataset("userProfiles",	
  KeyValueTable.class);	
  
	
  	
  	
  	
  addFlow(new	
  PurchaseFlow());	
  
	
  	
  	
  	
  addWorkflow(new	
  PurchaseHistoryWorkflow());	
  
	
  	
  	
  	
  addService(new	
  PurchaseHistoryService());	
  
	
  	
  	
  	
  addService(UserProfileServiceHandler.SERVICE_NAME,	
  new	
  
UserProfileServiceHandler());	
  
	
  	
  	
  	
  addService(new	
  CatalogLookupService());	
  
	
  	
  	
  	
  try	
  {	
  
	
  	
  	
  	
  	
  	
  createDataset("history",	
  PurchaseHistoryStore.class,	
  
PurchaseHistoryStore.properties());	
  
	
  	
  	
  	
  	
  	
  ObjectStores.createObjectStore(getConfigurer(),	
  "purchases",	
  
Purchase.class);	
  
	
  	
  	
  	
  }	
  catch	
  (UnsupportedTypeException	
  e)	
  {	
  
	
  	
  	
  	
  	
  	
  throw	
  new	
  RuntimeException(e);	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}
PROPRIETARY & CONFIDENTIAL34
• Is	
  a	
  use-­‐case	
  Blueprint	
  
• Composed	
  using	
  one	
  or	
  more	
  Programs	
  and	
  
Datasets	
  
• Supports	
  real-­‐5me	
  or	
  batch	
  or	
  combina5on	
  
• Highly	
  reusable	
  through	
  configuraTon	
  &	
  
extensible	
  through	
  plugins	
  
• Is	
  an	
  applicaTon	
  that	
  is	
  reusable	
  through	
  
configuraTon	
  and	
  extensible	
  through	
  plugins.	
  	
  
• Plugins	
  extend	
  the	
  ApplicaTon	
  Template	
  by	
  
implemenTng	
  an	
  interface	
  expected	
  by	
  the	
  
template.	
  
• Support	
  with	
  an	
  end	
  to	
  end	
  tes5ng	
  framework
Building Blocks - Application Template
Application Template
Pluggable Interface
Adapter1
Plugin
Config1
Config2
Config3 Adapter2
Plugin
Adapter3
Plugin
PROPRIETARY & CONFIDENTIAL35
• Scalable	
  and	
  reliable	
  real-­‐time	
  business	
  critical	
  analytics	
  
• Closed	
  Loop	
  Recommendation	
  and	
  Analytics	
  
• Data	
  Ingestion	
  As	
  A	
  Service	
  -­‐	
  Realtime	
  and	
  Batch	
  
• Extendable	
  and	
  Reusable	
  use-­‐case	
  blueprints	
  
• Data	
  As	
  A	
  Service	
  
• Reduce	
  application	
  development	
  and	
  operational	
  complexity	
  
• ETL	
  Automation	
  -­‐	
  Real-­‐time	
  and	
  Batch	
  
Use-cases
Want to Learn More?
Open-source (Apache License v2)
Website:
http://cdap.io
Mailing List:
cdap-user@googlegroups.com
cdap-dev@googlegroups.com
IRC:
#cdap on freenode.net
QUESTIONS?
Want	
  to	
  work	
  on	
  these	
  and	
  other	
  challenges?	
  
http://cask.co/careers/

More Related Content

What's hot

Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightDataWorks Summit/Hadoop Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...DataWorks Summit/Hadoop Summit
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 

What's hot (20)

Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
HAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 

Viewers also liked

Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionDataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]DataWorks Summit
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenDataWorks Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets DataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...DataWorks Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 

Viewers also liked (20)

Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 

Similar to Evolving Hadoop Into An Operational Platform With Data Applications

Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoopCraig Jordan
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricUsing Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricCambridge Semantics
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)Data Finder
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsAndreas Schreiber
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 Chester Chen
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 

Similar to Evolving Hadoop Into An Operational Platform With Data Applications (20)

Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data FabricUsing Cloud Automation Technologies to Deliver an Enterprise Data Fabric
Using Cloud Automation Technologies to Deliver an Enterprise Data Fabric
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Evolving Hadoop Into An Operational Platform With Data Applications

  • 1. @nmotgi Nitin  Motgi Evolving  Hadoop  Into  An  Opera5onal Pla7orm  With  Data  Applica5ons  
  • 2. PROPRIETARY & CONFIDENTIAL2 • Introduction  to  operational  data  applications   • Challenges  with  building  operational  data  applications  on  Hadoop   • Goals  and  Motivation  for  CDAP   • Introduction  to  CDAP  and  Architecture  Overview   • Building  Blocks   • Datasets   • Programs     • Application  and  Application  Template   • Use-­‐cases Agenda
  • 3. PROPRIETARY & CONFIDENTIAL3 Applications  that  use  data  insights  to  enhance  the  customers/user   experience,  achieve  a  business  objective  or  improve  a  business  process. What are Operational Data Applications?
  • 4. PROPRIETARY & CONFIDENTIAL4 • 360-­‐Degree  Customer  View   • Recommendation  Engine   • Predictive  Modeling   • Fraud  Analysis   • Network  Threat  Detection   • Telemetry   • Time  Series  Analysis   • And  many  more Examples
  • 6. Technology Explosion Core Hadoop HDFS, MR 2006 Hbase ZooKeeper Core Hadoop 2008 Hive Pig Mahout Hbase ZooKeeper Core Hadoop 2009 Sqoop Whirr Avro Hive Pig Mahout Hbase Zookeeper Core Hadoop 2010 Flume Bigtop Oozie MRUnit HCatalog Sqoop Whirr Avro Hive Pig Mahout Hbase Zookeeper Core Hadoop 2011 Spark Impala Solr Kafka Flume Bigtop Oozie MRUnit HCatalog Sqoop Whirr Avro Hive Pig Mahout Hbase Zookeeper Core Hadoop 2012 Sentry Tez Parquet YARN Spark YARN Impala Solr Kafka Flume Bigtop Oozie MRUnit HCatalog Sqoop Whirr Avro Hive Pig Mahout Hbase Zookeeper Core Hadoop Knox Present
  • 7. APPLICATION COMPLEXITY MANY DOMAINS TO BRIDGE LOTS OF BOILERPLATE INCONSISTENT APIS NO REUSABILITY LACK OF DEVELOPER PRODUCTIVITY Challenges
  • 10. Motivation • Simple  yet  powerful  platform  for  developers  to  build  applications  on   Hadoop   • Expose  capabilities  rather  than  features   • Make  Hadoop    accessible  to  developers  with  no  Hadoop  knowledge
  • 11. Goals • Unified  platform  for  building  solutions  on  Hadoop   • Simpler  application  development  lifecycle   • Reusable  Data  and  Processing  Patterns   • Framework  level  correctness  and  consistency
  • 12. Introduc5on  to Cask  Data  Applica5on  Pla7orm
  • 13. An open source, integrated, distributed and extensible platform for building data applications on Hadoop. Cask Data Application Platform
  • 15. Supports developers, operations, and organizations through the entire enterprise data application lifecycle. CASK DATA APP PLATFORM Data Lifecycle Ingest Explore Transform Serve Application Lifecycle Develop Test Deploy Scale Enterprise Lifecycle Secure Manage Monitor Operate Supports
  • 16. Datasets Programs Tools & Experience • Standardized containers providing consistency for diverse processing paradigms • Services for developers to enable richer apps with less hassle; and production to enable application and data management • Libraries to build reusable data access patterns spanning multiple storage technologies Runtime Services 16 Programs Batch Programs Realtime Programs CASK DATA APPLICATION PLATFORM (CDAP) Event /Data Ingestion Tools and User Experience Datasets Runtime Services BATCH PROCESSING (MapReduce, Hive, pig) ANALYTIC SQL (Impala) SEARCH ENGINE (Cloudera Search) MACHINE LEARNING (Spark, MapReduce, Mahout) STREAM PROCESSING (Spark) 3RD PARTY APPS (Partners) DATA MANAGEMEN Egress Cloudera’s Enterprise Data Hub Adapters Data Application Examples Anomaly Detection 360o Consumer profile Network Analytics Multi-log Correlation Analytics Architecture HADOOP • Maven Archetype, Testing Framework, Debugging Tools, Monitoring Tool, Web based Application Management
  • 17. 17 ServeTransformExploreIngest Unification ACID Dataset Streams Realtime - Tigon JDBC Query RPC SparkMR Dataset Dataset MR Spark Ad-hoc query Dataset API, SPI & Management Services Application Structure
  • 18. 18 Deployment • Services • Master • Router • Auth Server CDAP Server • Highly Available (HA) • Installed on edge node(s) • Supports Kerberos - Impersonation & Permitter Security • Manager system services in YARN CDAP Server System Services (Twill Containers) • Transactions (Tephra) • Metrics Aggregation • Log Aggregation • Dataset Services • Metadata Management Service • Explore Service • Stream Management Service & more
  • 20. Building Blocks Dataset Program Encapsulated  data  access   paBerns  and  data  model  in  a   reusable,  domain-­‐specific  API Standardized  containers   for  processing  paradigms   ProgramaTc  abstracTon  for  composing  mulTple  Datasets    and  Programs   that  integrates  ingesTon,  exploraTon,  transformaTon  and  serving Application Dataset ProgramProgramDataset
  • 22. PROPRIETARY & CONFIDENTIAL22 RDBMS     Hadoop Dataset Raw  Storage  Interfaces,  Data   Modeling,  Data  Layout,   OpTmizaTons  and  Schema Raw  Storage Raw  Distributed  Storage,  Model,   Layout,  Op5miza5ons  and   op5onal  Schema • OpTmizaTons  are  pushed   closer  to  storage     • ApplicaTons  use  SQL  to  access   data  (store  or  retrieve)   • Simpler  ApplicaTons!   • Modeling,  layout  and   opTmizaTons  are  embedded   within  applicaTons   • Hard  to  scale  -­‐  lack  of   reusability • Access  through  domain   specific  APIs  with  opTonal  SQL   Interface   • OpTmizaTons  are   encapsulated  within  datasets   • Simpler  ApplicaTons! Dataset Motivation
  • 23. PROPRIETARY & CONFIDENTIAL23 • Encapsulate  a  data  access  paBern  and  data  model  in  a  reusable,  domain-­‐specific  API   • Establishes  best  prac5ces  in  schema  definiTon   • Abstract  away  underlying  storage  plaorm   • Reusable  as  data  storage  templates   • Easy  sharing  of  stored  data:     • Between  applicaTons   • Batch  and  real-­‐Tme  processing   • Integrated  with  TransacTons  for  consistency   • Integrated  tes5ng   • Extensible  to  create  your  own  soluTons   • Transparent  Integra5on  with   • Hive  metastore   • MR  Input/Output  Formats   • Spark  RDDs Building Blocks - Dataset
  • 24. PROPRIETARY & CONFIDENTIAL24 • Secondary  Indexes     • Example use case: Entity storage - store customer records indexed by location • Object  Mapping   • Example use case: Entity storage - easily store User instances for user profiles • Timeseries  Data   • Example use case: any data organized around a time dimension • Data  Cube   • Example use case: Retail product sales reports, web analytics • ParTToned  Fileset   • Example use case: Time partitioned processing of feeds • And  many  more Dataset - Types
  • 25. PROPRIETARY & CONFIDENTIAL25 Dataset - Example • A  Java  Library   • Table  Dataset   • First  Name,  Last  Name  and  Link  to   Picture  in  a  Table   • Fileset  Dataset   • Pictures  in  a  Fileset   • Instance  of  Dataset  as   • HBase  Table  and     • HDFS  Directory   • Access  using  SQL  (HIVE)   • Tigon,  MR  &  Spark  can  access
  • 26. public  class  ContactsDataset  extends  AbstractDataset  {      private  ObjectMappedTable<Contact>  contacts;      private  FileSet  pictures;      public  ContactsDataset(DatasetSpecification  spec,                                                    @EmbeddedDataset("contacts")  ObjectMappedTable<Contact>  contacts,                                                    @EmbeddedDataset("pictures")  FileSet  pictures)  {          super(spec.getName(),  contacts,  pictures);          this.contacts  =  contacts;          this.pictures  =  pictures;      }      public  void  addContact(String  nick,  Contact  contact)  {          contacts.write(nick,  contact);      }      public  Contact  getContact(String  nick)  {          return  contacts.read(nick);      }      //  continued...   PROPRIETARY & CONFIDENTIAL26 Dataset - Composite Embedded Datasets
  • 27. PROPRIETARY & CONFIDENTIAL27 public  class  ContactsDataset  extends  AbstractDataset  {      //  ...continued      public  void  addPhoto(String  nick,  byte[]  photoBytes)  throws  IOException  {          Contact  contact  =  getContact(nick);          if  (contact.getPicturePath()  !=  null)  {              //  delete  picture  path          }          String  picturePath  =  "pic."  +  nick;          Location  location  =  pictures.getLocation(picturePath);          try  {              ByteStreams.copy(new  ByteArrayInputStream(photoBytes),  location.getOutputStream());              contact.setPicturePath(picturePath);              contacts.write(nick,  contact);          }  catch  (IOException  e)  {              LOG.error("Got  exception:  ",  e);              //  delete  path              throw  e;          }      }   }   Dataset - Transactional Update
  • 28. PROPRIETARY & CONFIDENTIAL28 public  class  ContactsDataset  extends  AbstractDataset                                                            implements  RecordScannable<StructuredRecord>  {      //..      @Override      public  Type  getRecordType()  {          return  StructuredRecord.class;      }      @Override      public  List<Split>  getSplits()  {          return  contacts.getSplits();      }      @Override      public  RecordScanner<StructuredRecord>  createSplitRecordScanner(Split  split)  {          return  contacts.createSplitRecordScanner(split);      }   }   Dataset - Explorable
  • 29. PROPRIETARY & CONFIDENTIAL29 Dataset Example - Usage public  class  Contacts  extends  AbstractApplication  {      @Override      public  void  configure()  {          try  {              setName("Contacts");              setDescription("An  application  to  manage  contacts  and  their  pictures");                            createDataset("contacts",  ContactsDataset.class);              //  Define  programs,  other  datasets...          }  catch  (UnsupportedTypeException  e)  {              //  cannot  happen  with  Contact          }      }   }  
  • 31. PROPRIETARY & CONFIDENTIAL31 • Standardized  containers  for  processing  paradigms   • Establishes  unified  way  of  extracTng  logs  &  metrics   • Compose  complex  applicaTons  -­‐  real-­‐5me  or  batch       • Seamless  Integra5on  with  Datasets  -­‐  simple  or  composite.     • Provides  conceptual  integrity  across  different  processing   paradigms     • Integrated  end-­‐to-­‐end  tes5ng   • Extensible  to  add  new  processing  paradigms.   • Leverage  common  services  to  ease     • version  management   • deployment   • management Building Blocks - Programs
  • 33. PROPRIETARY & CONFIDENTIAL33 ProgramaTc  abstracTon  for  composing  a  use   case  by  combining  Datasets    and  Programs  to   perform  ingesTon,  transformaTon  and  serving.   Building Blocks - Application public  class  PurchaseApp  extends  AbstractApplication  {    @Override      public  void  configure()  {          .  .  .          addStream(new  Stream("purchaseStream"));          createDataset("frequentCustomers",  KeyValueTable.class);          createDataset("userProfiles",  KeyValueTable.class);          addFlow(new  PurchaseFlow());          addWorkflow(new  PurchaseHistoryWorkflow());          addService(new  PurchaseHistoryService());          addService(UserProfileServiceHandler.SERVICE_NAME,  new   UserProfileServiceHandler());          addService(new  CatalogLookupService());          try  {              createDataset("history",  PurchaseHistoryStore.class,   PurchaseHistoryStore.properties());              ObjectStores.createObjectStore(getConfigurer(),  "purchases",   Purchase.class);          }  catch  (UnsupportedTypeException  e)  {              throw  new  RuntimeException(e);          }      }   }
  • 34. PROPRIETARY & CONFIDENTIAL34 • Is  a  use-­‐case  Blueprint   • Composed  using  one  or  more  Programs  and   Datasets   • Supports  real-­‐5me  or  batch  or  combina5on   • Highly  reusable  through  configuraTon  &   extensible  through  plugins   • Is  an  applicaTon  that  is  reusable  through   configuraTon  and  extensible  through  plugins.     • Plugins  extend  the  ApplicaTon  Template  by   implemenTng  an  interface  expected  by  the   template.   • Support  with  an  end  to  end  tes5ng  framework Building Blocks - Application Template Application Template Pluggable Interface Adapter1 Plugin Config1 Config2 Config3 Adapter2 Plugin Adapter3 Plugin
  • 35. PROPRIETARY & CONFIDENTIAL35 • Scalable  and  reliable  real-­‐time  business  critical  analytics   • Closed  Loop  Recommendation  and  Analytics   • Data  Ingestion  As  A  Service  -­‐  Realtime  and  Batch   • Extendable  and  Reusable  use-­‐case  blueprints   • Data  As  A  Service   • Reduce  application  development  and  operational  complexity   • ETL  Automation  -­‐  Real-­‐time  and  Batch   Use-cases
  • 36. Want to Learn More? Open-source (Apache License v2) Website: http://cdap.io Mailing List: cdap-user@googlegroups.com cdap-dev@googlegroups.com IRC: #cdap on freenode.net
  • 37. QUESTIONS? Want  to  work  on  these  and  other  challenges?   http://cask.co/careers/