This presentation is part of my work for the course 'Big Data Seminar' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
Slides from my talk at Big Data Spain 2014 in Madrid.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems. Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
Lessons learnt from applying PyData to GetYourGuide marketingJose Luis Lopez Pino
For all e-commerce sites, marketing is a big part of the business and marketing efficiency and effectiveness are critical to their success. Companies must make many data-driven decisions in order to reach customers that their competitors don’t, maximize the revenue of each click, decide wisely what are the costs to cut, enter new markets, etc.
GetYourGuide has been working for more than two years on building a marketing intelligence that allows us growing our marketing efforts in the travel market without building a huge team or buying extremely expensive tools.
All the decisions are supported by a dedicated system that runs on the PyData stack that allows marketers to extract valuable insights from data and performs critical marketing tasks: keyword mining, campaign automation, predictive modeling, omni-channel marketing data integration, customer segmentation, pattern mining from click data, etc.
As a result of this, we were able to scale up 3 times our marketing efforts, launch campaigns in 13 markets and automate 75% of our work only in the last 8 months. But this is not the end of our journey, GetYourGuide is building a Data Science team to understand travelers needs and wants and make our Customers' trips amazing.
This presentation is part of my work for the course 'Big Data Analytics Projects' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems.
Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
Slides from my talk at Big Data Spain 2014 in Madrid.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems. Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
Lessons learnt from applying PyData to GetYourGuide marketingJose Luis Lopez Pino
For all e-commerce sites, marketing is a big part of the business and marketing efficiency and effectiveness are critical to their success. Companies must make many data-driven decisions in order to reach customers that their competitors don’t, maximize the revenue of each click, decide wisely what are the costs to cut, enter new markets, etc.
GetYourGuide has been working for more than two years on building a marketing intelligence that allows us growing our marketing efforts in the travel market without building a huge team or buying extremely expensive tools.
All the decisions are supported by a dedicated system that runs on the PyData stack that allows marketers to extract valuable insights from data and performs critical marketing tasks: keyword mining, campaign automation, predictive modeling, omni-channel marketing data integration, customer segmentation, pattern mining from click data, etc.
As a result of this, we were able to scale up 3 times our marketing efforts, launch campaigns in 13 markets and automate 75% of our work only in the last 8 months. But this is not the end of our journey, GetYourGuide is building a Data Science team to understand travelers needs and wants and make our Customers' trips amazing.
This presentation is part of my work for the course 'Big Data Analytics Projects' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems.
Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
A review of some of the content and some of the references for the paper:
Flexible Support for Spatial Decision Making
Shan Gao, John Paynter, and David Sundaram Proceedings of the 37th Hawaii International Conference on System Sciences – 2004
The International Journal of Database Management Systems (IJDMS) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the database management systems & its applications. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding Modern developments in this field, and establishing new collaborations in these areas.
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...ijait
he objective the work is intend to highlight the key features and afford finest future directions in the
research community of Resource Allocation, Resource Scheduling and Resource management from 2009 to
2016. Exemplifying how research on Resource Allocation, Resource Scheduling and Resource management
has progressively increased in the past decade by inspecting articles, papers from scientific and standard
publications. Survey materialized in three fold process. Firstly, investigate on the amalgamation of
Resource Allocation, Resource Scheduling and then proceeded with Resource management. Secondly, we
performed a structural analysis on different author’s prominent contributions in the form of tabulation by
categories and graphical representation. Thirdly, huddle with conceptual similarity in the field and also
impart a summary on all resource allocations. In cloud computing environments, there are two players:
cloud providers and cloud users. On one hand, providers hold massive computing resources in their large
datacenters and rent resources out to users on a per-usage basis. On the other hand, there are users who
Meeting the NSF DMP Requirement: March 7, 2012IUPUI
March 7 version of the IUPUI workshop Meeting the NSF Data Management Plan Requirement: What you need to know. This workshop is co-sponsored by the Office of the Vice Chancellor for Research and the University Library.
Meeting the NSF DMP Requirement June 13, 2012IUPUI
June 13 version of the IUPUI workshop Meeting the NSF Data Management Plan Requirement: What you need to know. This workshop is co-sponsored by the Office of the Vice Chancellor for Research and the University Library.
Introduction to Database and Database Management. This presentation gives a basic idea of the differences among terms and types of databases.
It can be used for the first lecture on Database Management course or a seminar in Information Systems.
It doesn't cover database modelling and languages.
Rethinking Lessons Learned in the PMBoK Process Groups: A Model based on Peop...Marcirio Chaves
The Ballistic 2.0 model
Intends to fill a gap in literature regarding LL
Based on consolidated literature
Expands the use of the knowledge creation model
Is in tune with PM 2.0 (agile, flexible, dynamic)
Provides theoretical foundation for future researches.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
'Using Linked Data in Learning Analytics' is a tutorial targeting researchers in Learning Analytics interested in exploiting linked data resources, developers of Learning Analytics solutions that could benefit from Linked Data and data owners wanting to understand how linked data can help the analysis of their data in relation to other sources of information. The tutorial is described in more details at http://linkedu.eu/event/lak2013-linkeddata-tutorial/, where learning material related to the topic of the tutorial will also be disseminated.
http://portal.ou.nl/documents/363049/033208ab-9dba-43be-b1d8-80d6423c0654
http://creativecommons.org/licenses/by-nc-sa/3.0/
d'Aquin, M., Dietze, S., Herder, E., Drachsler, H. (Eds.) (2013). Tutorial: Using Linked Data in Learning Analytics. Tutorial given at LAK 2013, the Third Conference on Learning Analytics and Knowledge. Leuven, Belgium.
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...DuraSpace
“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 1: “Research Data Curation at UC San Diego: An Overview”
Presented by David Minor & Declan Fleming, Chief Technology Strategist, UC San Diego Library
This presentation is part of my work for the course 'Heterogeneous and Distributed Information Systems' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Report for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
More Related Content
Similar to Scheduling and sharing resources in Data Clusters
A review of some of the content and some of the references for the paper:
Flexible Support for Spatial Decision Making
Shan Gao, John Paynter, and David Sundaram Proceedings of the 37th Hawaii International Conference on System Sciences – 2004
The International Journal of Database Management Systems (IJDMS) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the database management systems & its applications. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding Modern developments in this field, and establishing new collaborations in these areas.
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...ijait
he objective the work is intend to highlight the key features and afford finest future directions in the
research community of Resource Allocation, Resource Scheduling and Resource management from 2009 to
2016. Exemplifying how research on Resource Allocation, Resource Scheduling and Resource management
has progressively increased in the past decade by inspecting articles, papers from scientific and standard
publications. Survey materialized in three fold process. Firstly, investigate on the amalgamation of
Resource Allocation, Resource Scheduling and then proceeded with Resource management. Secondly, we
performed a structural analysis on different author’s prominent contributions in the form of tabulation by
categories and graphical representation. Thirdly, huddle with conceptual similarity in the field and also
impart a summary on all resource allocations. In cloud computing environments, there are two players:
cloud providers and cloud users. On one hand, providers hold massive computing resources in their large
datacenters and rent resources out to users on a per-usage basis. On the other hand, there are users who
Meeting the NSF DMP Requirement: March 7, 2012IUPUI
March 7 version of the IUPUI workshop Meeting the NSF Data Management Plan Requirement: What you need to know. This workshop is co-sponsored by the Office of the Vice Chancellor for Research and the University Library.
Meeting the NSF DMP Requirement June 13, 2012IUPUI
June 13 version of the IUPUI workshop Meeting the NSF Data Management Plan Requirement: What you need to know. This workshop is co-sponsored by the Office of the Vice Chancellor for Research and the University Library.
Introduction to Database and Database Management. This presentation gives a basic idea of the differences among terms and types of databases.
It can be used for the first lecture on Database Management course or a seminar in Information Systems.
It doesn't cover database modelling and languages.
Rethinking Lessons Learned in the PMBoK Process Groups: A Model based on Peop...Marcirio Chaves
The Ballistic 2.0 model
Intends to fill a gap in literature regarding LL
Based on consolidated literature
Expands the use of the knowledge creation model
Is in tune with PM 2.0 (agile, flexible, dynamic)
Provides theoretical foundation for future researches.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
Data integration in a Hadoop-based data lake: A bioinformatics caseIJDKP
When we work in a data lake, data integration is not easy, mainly because the data is usually
stored in raw format. Manually performing data integration is a time-consuming task that requires the
supervision of a specialist, which can make mistakes or not be able to see the optimal point for data integration among two or more datasets. This paper presents a model to perform heterogeneous in-memory
data integration in a Hadoop-based data lake based on a top-k set similarity approach. Our main contribution is the process of ingesting, storing, processing, integrating, and visualizing the data integration
points. The algorithm for data integration is based on the Overlap coefficient since it presented better
results when compared with the set similarity metrics Jaccard, Sørensen-Dice, and the Tversky index. We
tested our model applying it on eight bioinformatics-domain datasets. Our model presents better results
when compared to an analysis of a specialist, and we expect our model can be reused for other domains of
datasets.
'Using Linked Data in Learning Analytics' is a tutorial targeting researchers in Learning Analytics interested in exploiting linked data resources, developers of Learning Analytics solutions that could benefit from Linked Data and data owners wanting to understand how linked data can help the analysis of their data in relation to other sources of information. The tutorial is described in more details at http://linkedu.eu/event/lak2013-linkeddata-tutorial/, where learning material related to the topic of the tutorial will also be disseminated.
http://portal.ou.nl/documents/363049/033208ab-9dba-43be-b1d8-80d6423c0654
http://creativecommons.org/licenses/by-nc-sa/3.0/
d'Aquin, M., Dietze, S., Herder, E., Drachsler, H. (Eds.) (2013). Tutorial: Using Linked Data in Learning Analytics. Tutorial given at LAK 2013, the Third Conference on Learning Analytics and Knowledge. Leuven, Belgium.
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...DuraSpace
“Hot Topics: The DuraSpace Community Webinar Series, " Series Six: Research Data in Repositories” Curated by David Minor, Research Data Curation Program, UC San Diego Library. Webinar 1: “Research Data Curation at UC San Diego: An Overview”
Presented by David Minor & Declan Fleming, Chief Technology Strategist, UC San Diego Library
This presentation is part of my work for the course 'Heterogeneous and Distributed Information Systems' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Report for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Presentation for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
¿Qué es la esteganografía?
¿Qué NO es la esteganografía?
Esteganografía y criptografía
¿Por qué usarla?
Esteganografía física
Técnicas de esteganografía digital
Usos curiosos de la esteganografía digital
Ataques
Técnicas de ataque
Estegoanálisis
Marcas de agua
Presentación realizada para el CUSL nacional.
Se puede probar la última versión de Visuse en www.visuse.com
Más información sobre el proyecto en http://visuse.wordpress.com
Introducción a Firefox, navegador libre de Mozilla
Versión 2:
- Arreglada imagen sobre el consumo de RAM.
- Incluidos los ejemplos.
- Incluidas extensiones buscadas para la presentación en Económicas.
Versión 3:
- Nuevas extensiones: Cooliris, Peers y Speed Dial.
- Algunas características que se van a incluir en próximas versiones de Firefox.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
2. Introduction
YARN
Mesos
Omega
Related work
Conclusions
Table of contents
1
2
3
Introduction
The problem
Solutions
YARN
Architecture
Advantages
Drawbacks
Performance
Mesos
Architecture
Advantages
4
5
6
Jose Luis Lopez Pino
Drawbacks
Performance
Omega
Architecture
Advantages
Drawbacks
Performance
Related work
Resource managers
Scheduling techniques
Conclusions
Scheduling and sharing resources in Data Clusters
18. Introduction
YARN
Mesos
Omega
Related work
Conclusions
Resource managers
Scheduling techniques
Scheduling techniques
Lottery scheduling[11]
Dynamic Proportional Share Scheduling[7]
Calibration: how does a particular task perform in a particular
node?[5]
Stragglers and speculative relaunch[13]
Delay scheduling: achieve locality, relax fairness[12]
Rich resource-requests[2]
Optimize short jobs[3]
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters
20. Introduction
YARN
Mesos
Omega
Related work
Conclusions
References I
[1]
Ronnie Chaiken, Bob Jenkins, Per-˚ke Larson, Bill Ramsey,
A
Darren Shakib, Simon Weaver, and Jingren Zhou.
Scope: easy and efficient parallel processing of massive data
sets.
Proceedings of the VLDB Endowment, 1(2):1265–1276, 2008.
[2]
Carlo Curino, Djellel Difallah, Chris Douglas, Raghu
Ramakrishnan, and Sriram Rao.
Reservation-based scheduling: If youre late dont blame us!
[3]
Khaled Elmeleegy.
Piranha: Optimizing short jobs in hadoop.
Proceedings of the VLDB Endowment, 6(11):985–996, 2013.
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters
21. Introduction
YARN
Mesos
Omega
Related work
Conclusions
References II
[4]
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder,
Kunal Talwar, and Andrew Goldberg.
Quincy: fair scheduling for distributed computing clusters.
In Proceedings of the ACM SIGOPS 22nd symposium on
Operating systems principles, pages 261–276. ACM, 2009.
[5]
Gunho Lee, Byung-Gon Chun, and Randy H Katz.
Heterogeneity-aware resource allocation and scheduling in the
cloud.
In Proceedings of the 3rd USENIX Workshop on Hot Topics
in Cloud Computing, HotCloud, volume 11, 2011.
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters
22. Introduction
YARN
Mesos
Omega
Related work
Conclusions
References III
[6]
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn
Chung, and Bongki Moon.
Parallel data processing with mapreduce: a survey.
ACM SIGMOD Record, 40(4):11–20, 2012.
[7]
Thomas Sandholm and Kevin Lai.
Dynamic proportional share scheduling in hadoop.
In Job scheduling strategies for parallel processing, pages
110–131. Springer, 2010.
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters
23. Introduction
YARN
Mesos
Omega
Related work
Conclusions
References IV
[8]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek,
and John Wilkes.
Omega: Flexible, scalable schedulers for large compute
clusters.
In Proceedings of the 8th ACM European Conference on
Computer Systems, EuroSys ’13, pages 351–364, New York,
NY, USA, 2013. ACM.
[9]
Facebook Engineering Team.
Under the hood: Scheduling mapreduce jobs more efficiently
with corona.
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters
24. Introduction
YARN
Mesos
Omega
Related work
Conclusions
References V
[10] Vinod K. Vavilapalli.
Apache Hadoop YARN: Yet Another Resource Negotiator.
In Proc. SOCC, 2013.
[11] Carl A Waldspurger and William E Weihl.
Lottery scheduling: Flexible proportional-share resource
management.
In Proceedings of the 1st USENIX conference on Operating
Systems Design and Implementation, page 1. USENIX
Association, 1994.
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters
25. Introduction
YARN
Mesos
Omega
Related work
Conclusions
References VI
[12] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma,
Khaled Elmeleegy, Scott Shenker, and Ion Stoica.
Delay scheduling: a simple technique for achieving locality
and fairness in cluster scheduling.
In Proceedings of the 5th European conference on Computer
systems, pages 265–278. ACM, 2010.
[13] Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H
Katz, and Ion Stoica.
Improving mapreduce performance in heterogeneous
environments.
In OSDI, volume 8, page 7, 2008.
Jose Luis Lopez Pino
Scheduling and sharing resources in Data Clusters