SlideShare a Scribd company logo
1 of 25
Programming Abstractions for Smart Apps on Clouds Prof. D. Janakiram, Professor, Dept of CSE,  IIT, Madras
Acknowledgements Work on Deformable Mesh Abstractions is joint work with GeetaIyer and SriramKailasam Work on Edge Node File Systems is joint work with Kovendhan Work on Deformable Mesh Abstractions is funded by Yahoo Research
Introduction Cloud computing: provides pay-for-use access to compute and storage resources over the Internet. Smart applications: intelligence embedded within the application (e.g. Recommender systems) Computation, data requirements and algorithms increasingly becoming complex. Popular programming models for cloud: MapReduce, Dryad. Are these right abstractions for smart apps?
MapReduce Origins Primary motivation: To facilitate indexing, searching, sorting like operations on massive datasets over large resources. Inspired from map and reduce primitives in LISP. Requirement to perform computations on key-value pairs to generate intermediate key-value pairs and reduce all values with the same key. Runtime responsible for parallelization of map and reduce tasks, and handles other low level details.
Limitations and Proposed Extensions Limitations in original MR model: Input/output restricted to key-value pairs. Jobs are loosely synchronized (no connected computation). No support for iteration and recursion. Doesn’t directly support multiple inputs  for a job. Optimized for batch processing. Different nodes are assumed to perform work roughly at the same rate. Inherent assumption that all tasks require the same amount of time. Extensions: IterativeMR:  adds support for iterations  relies on long running mapreduce tasks and streaming data between iterations Spark: Supports iterations and interactive queries.  Each iteration is handled as a separate MapReduce job, incurring job submission overheads. Streaming makes fault tolerance difficult.
Basic Database Operations Projection Selection Aggregation Join, Cartesian product, Set operations Only the unary operations can be directly modeled with the original MapReduce framework. There is no direct support for operations over multiple, possibly heterogeneous input data sources. Can be done indirectly by chaining extra MapReduce steps.
Dryad & DryadLINQ Motivated primarily from the parallel databases. Makes the communication graph explicit. Execution graph expressed as Directed Acyclic Graph (DAG). DryadLINQ allows computations to be expressed in terms of LINQ operators (similar to SQL operators) Automatically parallelized by Dryad execution engine. Supports multiple datasets and runtime optimizations of complete execution graph.
Limitations Lacks support for recursively spawning new tasks as computation proceeds. Adaptive computations like AI planning, branch-and-bound cannot be supported directly.
Smart Apps Key aspects/ requirements ,[object Object]
Different nodes executing in parallel needs to communicate; requires support for a shared communication model.
Data partitioning changes, as the computation proceeds.
Efficient support for fixed number of iterations or condition based termination.
Real world graphs may not be captured by hash-based partitioning; alternate partitioning schemes.Classes of Applications AI planning Decision tree algorithms Association rule mining Recommender systems Data mining Graph algorithms Clustering algorithms
Deformable Mesh Abstraction Focus:  New programming model targeted towards wider applications that cannot be modeled efficiently using existing frameworks. At the same time, support MapReduce-like computations efficiently. Bring out clear separation between programmer expressibility issues and runtime environment issues.
Expressibility Issues Loosely  Synchronized ,[object Object]
capturing different programming     paradigms efficiently.
recursive spawning of new tasks at runtime.
efficient and location independent communication support.
changing the “Shared nothing” viewpoint.
support operating on changing datasets.Unconnected  Iterative  Recursive (all-to-all) (point-to-point) Runtime  creation Programming Paradigms
Runtime Issues ,[object Object]
offering performance guarantees on unreliable environments.
 handling heterogeneity in terms of
capability
storage

More Related Content

What's hot

Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh
 
The Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent NetworkThe Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent NetworkIJAEMSJORNAL
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applicationsijcsit
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Multiple dag applications
Multiple dag applicationsMultiple dag applications
Multiple dag applicationscsandit
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...IJSRD
 
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...IRJET Journal
 
Optimized Assignment of Independent Task for Improving Resources Performance ...
Optimized Assignment of Independent Task for Improving Resources Performance ...Optimized Assignment of Independent Task for Improving Resources Performance ...
Optimized Assignment of Independent Task for Improving Resources Performance ...ijgca
 
Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing
Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing
Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing IJORCS
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
Designing Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyDesigning Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyMeysam Javadi
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
 

What's hot (17)

Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
MapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large ClustersMapReduce : Simplified Data Processing on Large Clusters
MapReduce : Simplified Data Processing on Large Clusters
 
The Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent NetworkThe Concept of Load Balancing Server in Secured and Intelligent Network
The Concept of Load Balancing Server in Secured and Intelligent Network
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow ApplicationsSharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Multiple dag applications
Multiple dag applicationsMultiple dag applications
Multiple dag applications
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
 
Optimized Assignment of Independent Task for Improving Resources Performance ...
Optimized Assignment of Independent Task for Improving Resources Performance ...Optimized Assignment of Independent Task for Improving Resources Performance ...
Optimized Assignment of Independent Task for Improving Resources Performance ...
 
J0210053057
J0210053057J0210053057
J0210053057
 
Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing
Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing
Max Min Fair Scheduling Algorithm using In Grid Scheduling with Load Balancing
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Designing Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas StudyDesigning Distributed Systems: Google Cas Study
Designing Distributed Systems: Google Cas Study
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 

Viewers also liked

110823 solta11 intro
110823 solta11 intro110823 solta11 intro
110823 solta11 introRudolf Husar
 
Aq Gci Infrastructure
Aq Gci InfrastructureAq Gci Infrastructure
Aq Gci InfrastructureRudolf Husar
 
111018 geo sif_aq_interop
111018 geo sif_aq_interop111018 geo sif_aq_interop
111018 geo sif_aq_interopRudolf Husar
 
110823 data fed_solta11
110823 data fed_solta11110823 data fed_solta11
110823 data fed_solta11Rudolf Husar
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

Viewers also liked (7)

Unenclosable
UnenclosableUnenclosable
Unenclosable
 
110823 solta11 intro
110823 solta11 intro110823 solta11 intro
110823 solta11 intro
 
Aq Gci Infrastructure
Aq Gci InfrastructureAq Gci Infrastructure
Aq Gci Infrastructure
 
111018 geo sif_aq_interop
111018 geo sif_aq_interop111018 geo sif_aq_interop
111018 geo sif_aq_interop
 
110823 data fed_solta11
110823 data fed_solta11110823 data fed_solta11
110823 data fed_solta11
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 

Similar to Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...Absi Ahmed
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...IJCI JOURNAL
 
Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceeSAT Publishing House
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...Dr. Thippeswamy S.
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET Journal
 
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORSMULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORScscpconf
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clusteringpaperpublications3
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentSafayet Hossain
 

Similar to Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram (20)

Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...Presented by Ahmed Abdulhakim Al-Absi -  Scaling map reduce applications acro...
Presented by Ahmed Abdulhakim Al-Absi - Scaling map reduce applications acro...
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
 
Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduce
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORSMULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
MULTIPLE DAG APPLICATIONS SCHEDULING ON A CLUSTER OF PROCESSORS
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Lj2419141918
Lj2419141918Lj2419141918
Lj2419141918
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
Application-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud EnvironmentApplication-Aware Big Data Deduplication in Cloud Environment
Application-Aware Big Data Deduplication in Cloud Environment
 
Facade
FacadeFacade
Facade
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous EnvironmentData Dimensional Reduction by Order Prediction in Heterogeneous Environment
Data Dimensional Reduction by Order Prediction in Heterogeneous Environment
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

  • 1. Programming Abstractions for Smart Apps on Clouds Prof. D. Janakiram, Professor, Dept of CSE, IIT, Madras
  • 2. Acknowledgements Work on Deformable Mesh Abstractions is joint work with GeetaIyer and SriramKailasam Work on Edge Node File Systems is joint work with Kovendhan Work on Deformable Mesh Abstractions is funded by Yahoo Research
  • 3. Introduction Cloud computing: provides pay-for-use access to compute and storage resources over the Internet. Smart applications: intelligence embedded within the application (e.g. Recommender systems) Computation, data requirements and algorithms increasingly becoming complex. Popular programming models for cloud: MapReduce, Dryad. Are these right abstractions for smart apps?
  • 4. MapReduce Origins Primary motivation: To facilitate indexing, searching, sorting like operations on massive datasets over large resources. Inspired from map and reduce primitives in LISP. Requirement to perform computations on key-value pairs to generate intermediate key-value pairs and reduce all values with the same key. Runtime responsible for parallelization of map and reduce tasks, and handles other low level details.
  • 5. Limitations and Proposed Extensions Limitations in original MR model: Input/output restricted to key-value pairs. Jobs are loosely synchronized (no connected computation). No support for iteration and recursion. Doesn’t directly support multiple inputs for a job. Optimized for batch processing. Different nodes are assumed to perform work roughly at the same rate. Inherent assumption that all tasks require the same amount of time. Extensions: IterativeMR: adds support for iterations relies on long running mapreduce tasks and streaming data between iterations Spark: Supports iterations and interactive queries. Each iteration is handled as a separate MapReduce job, incurring job submission overheads. Streaming makes fault tolerance difficult.
  • 6. Basic Database Operations Projection Selection Aggregation Join, Cartesian product, Set operations Only the unary operations can be directly modeled with the original MapReduce framework. There is no direct support for operations over multiple, possibly heterogeneous input data sources. Can be done indirectly by chaining extra MapReduce steps.
  • 7. Dryad & DryadLINQ Motivated primarily from the parallel databases. Makes the communication graph explicit. Execution graph expressed as Directed Acyclic Graph (DAG). DryadLINQ allows computations to be expressed in terms of LINQ operators (similar to SQL operators) Automatically parallelized by Dryad execution engine. Supports multiple datasets and runtime optimizations of complete execution graph.
  • 8. Limitations Lacks support for recursively spawning new tasks as computation proceeds. Adaptive computations like AI planning, branch-and-bound cannot be supported directly.
  • 9.
  • 10. Different nodes executing in parallel needs to communicate; requires support for a shared communication model.
  • 11. Data partitioning changes, as the computation proceeds.
  • 12. Efficient support for fixed number of iterations or condition based termination.
  • 13. Real world graphs may not be captured by hash-based partitioning; alternate partitioning schemes.Classes of Applications AI planning Decision tree algorithms Association rule mining Recommender systems Data mining Graph algorithms Clustering algorithms
  • 14. Deformable Mesh Abstraction Focus: New programming model targeted towards wider applications that cannot be modeled efficiently using existing frameworks. At the same time, support MapReduce-like computations efficiently. Bring out clear separation between programmer expressibility issues and runtime environment issues.
  • 15.
  • 16. capturing different programming paradigms efficiently.
  • 17. recursive spawning of new tasks at runtime.
  • 18. efficient and location independent communication support.
  • 19. changing the “Shared nothing” viewpoint.
  • 20. support operating on changing datasets.Unconnected Iterative Recursive (all-to-all) (point-to-point) Runtime creation Programming Paradigms
  • 21.
  • 22. offering performance guarantees on unreliable environments.
  • 27. minimizing synchronization delay between different tasks.
  • 28.
  • 29.
  • 30. Pipe abstraction supports location independent communication between different node and with shared structure; provides read()/write() functions.
  • 31. Shared Structure can be instantiated to stacks, queues, hash-based structure, depending upon application requirement.
  • 32. Recursive Splits can happen within Solve/ Combine.
  • 33. Combine can be of 2 types:Split3 Splitn Split1 Split2 Split12 Split11 Solve2 Solve3 Solven Split11 Split12 Solve11 Solve12 Solve2 Solve3 Solven Solve11 Solve12 Combine Combine Combine Reduce-like Combine Hierarchical Combine
  • 34. Heuristic-guided Problem Solving General Methodology (e.g. AI planning) Set of actions are evaluated in parallel on the problem state. Newly generated states are inserted into the queue, based on a heuristic value. Best state is selected from the queue for further processing. Iteration continues till the goal state is reached. Requirements State of the queue needs to be preserved across iterations. On-the-fly evaluation of termination condition to decide the number of iterations.
  • 35. Case Study: 1. Sapa Planner* Current State and all actions Splits: based on applicable actions Current State, appAction1 Current State, appActionn Current State, appAction3 Current State, appAction2 Solve1 Solve2 Solve3 Solven Evaluate action on current state and compute heuristic Communication to perform enqueue() Combine Distributed Priority Queue sorted by heuristic value Communication to perform dequeue() Select the next state and repeat again *M. B. Do and S. Kambhampati, “Sapa: a multi-objective metric temporal planner,” J. Artif. Int. Res., vol. 20, no. 1, pp. 155–194, 2003.
  • 36. Modeling Sapa Planner using DMA Solve tasks assigned to different machines, that evaluates actions on a particular state in parallel. Recursive split facilitated through invokeSplit() call from within Combine. Preliminary Result: Split, Solve and Combine operations are modeled with minimal modification of sequential planner code. Shared information required for Split, Solve and Combine operations are loaded only once on different machines, thus avoiding recursive split overheads.
  • 37. Case Study: 2. SGPlan4 Planner* Goals Subgoal1 Splits: constraint based subgoal partitioning Subgoal2 Subgoal3 Subgoaln Solve1 Subgoal11 Splits: based on landmark analysis Subgoal12 Subgoal1n Evaluate actions applicable on the current state Solve11 Solve12 Solve1n Splits: based on path optimization Communication based on global constraints Combine the subplans and update the penalty value of global constraints Combine & evaluate Combine & evaluate Check the producible resources and repeat again Combine & evaluate *Chen Y. X., Wah B. W. and Hsu C. W. “Temporal planning using subgoal partitioning and resolution in SGPlan”, J. of Artificial Intelligence Research 2006.
  • 38. Edge Node File System (ENFS) * ENFS Architecture Metadata management is distributed amongst supernodes. Centrally managed metadata at the namenode. *K. Ponnavaikko and D. Janakiram, “The edge node file system: A distributed file system for high performance computing,” Scalable Computing: Practice and Experience, vol. 10, pp. 111–114, 2009
  • 39. Comparing ENFS & HDFS Directory creation Directory Stat Recursively changing directory attributes I/O throughput
  • 40. Comparing execution time for Andrew Benchmark(AB) runs
  • 41.
  • 42. Scheduling considers data locality and node capability information.
  • 43. Underlying file system is responsible for providing fault tolerance support.
  • 44. Supernode responsible for maintaining shared storage’s metadata, while the shared storage itself is distributed across cluster nodes.
  • 45.
  • 46. Extending DMA on Hadoop Clear separation between expressibility issues and runtime issues, facilitate extending DMA to Hadoop environment. Advantages DMA interfaces can exploit the efficient runtime provided by Hadoop. At the same time, wider class of applications can be captured.

Editor's Notes

  1. Federation of clusters–Clusters: sets of geographically proximal Autonomous Systems (AS)–O(10^3) nodes per cluster•A dynamic set of relatively capable nodes, Supernodes, manage–Resources within a cluster such as devices, users, etc.–Portions of the file system namespace•Clusters connected by a system wide structured overlay