The document discusses two contributions to performance modeling and management of data centers: 1) Resource allocation for a large-scale cloud environment, and 2) Performance modeling of a distributed key-value store. For resource allocation, the author developed a generic and scalable gossip-based protocol that supports joint allocation of compute and network resources under different management objectives. For performance modeling, the author created analytical models that accurately predict response time distributions and estimate capacity for Spotify's distributed key-value store backend. The models are simple yet obtain accurate results within Spotify's operational range.
[poster] A Compare-Aggregate Model with Latent Clustering for Answer SelectionSeoul National University
CIKM 2019
In this paper, we propose a novel method for a sentence-level answer-selection task that is one of the fundamental problems in natural language processing. First, we explore the effect of additional information by adopting a pretrained language model to compute the vector representation of the input text and by applying transfer learning from a large-scale corpus. Second, we enhance the compare-aggregate model by proposing a novel latent clustering method to compute additional information within the target corpus and by changing the objective function from listwise to pointwise. To evaluate the performance of the proposed approaches, experiments are performed with the WikiQA and TRECQA datasets. The empirical results demonstrate the superiority of our proposed approach, which achieve state-of-the-art performance on both datasets.
This document provides an introduction to boosted trees. It reviews key concepts in supervised learning like loss functions and regularization. Regression trees make predictions by assigning scores to leaf nodes. Gradient boosting is an algorithm that learns an ensemble of regression trees additively to minimize a loss function. It works by greedily adding trees to improve the model's fit on the training data based on the gradient of the loss function. The trees are learned one at a time by calculating the optimal split at each node to reduce a "structure score" measuring how well the split partitions the data.
The document describes research on scalable machine learning algorithms for massive astronomical datasets. It discusses 10 common data analysis problems in astronomy like querying, density estimation, regression, classification, etc. For each problem, it lists relevant statistical methods and their computational costs. It emphasizes choosing the right method for the scientific question and developing fast algorithms for bottleneck operations like generalized N-body problems and graphical model inference to enable machine learning on datasets with billions of data points and features.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...aimsnist
- Ankit Agrawal is a research associate professor in the Department of Electrical Engineering and Computer Science at Northwestern University who works on materials informatics and big data.
- His research thrusts include the Materials Genome Initiative, developing forward and inverse models to predict materials properties and optimize microstructures using data-driven techniques, and applying machine learning to characterize materials structures.
- Some illustrative projects include using data mining to predict steel fatigue strength and formation energies from DFT calculations, optimizing magnetostrictive Galfenol alloys, and developing deep learning methods for EBSD indexing and microstructure design.
Applications of Machine Learning for Materials Discovery at NRELaimsnist
Machine learning and artificial intelligence techniques are being applied at NREL to accelerate materials discovery in several ways:
1) Clustering of experimental XRD patterns allows automated structure determination, replacing slow manual analysis.
2) Neural networks can predict optoelectronic properties of molecules from their structure alone, screening millions of candidates.
3) Models are being developed to predict properties not measured in experiments to augment experimental data.
4) End-to-end deep learning on molecular and crystal structures may predict properties with accuracy approaching computationally expensive DFT simulations.
An online semantic enhanced dirichlet model for short textJay Kumarr
The document presents a new online semantic-enhanced Dirichlet model (OSDM) for clustering short text streams. OSDM addresses challenges with existing approaches like semantic ambiguity, concept drift over time, and batch vs online processing. It maintains active topics online using a non-parametric probabilistic graphical model that incorporates semantic information through term co-occurrence and performs automatic topic detection. Experimental results on news, tweet and Reuters datasets show OSDM outperforms other models in clustering performance over data streams and is robust to parameter changes.
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
JARVIS-ML provides concise summaries of materials properties using machine learning models trained on the extensive data in the JARVIS repositories. It has developed regression and classification models that can predict formation energies, bandgaps, and other material properties in seconds, much faster than traditional DFT calculations. The models use gradient boosting decision trees and feature importance analysis to provide explanations. JARVIS-ML is available as a public web app and API for rapid screening and discovery of new materials.
[poster] A Compare-Aggregate Model with Latent Clustering for Answer SelectionSeoul National University
CIKM 2019
In this paper, we propose a novel method for a sentence-level answer-selection task that is one of the fundamental problems in natural language processing. First, we explore the effect of additional information by adopting a pretrained language model to compute the vector representation of the input text and by applying transfer learning from a large-scale corpus. Second, we enhance the compare-aggregate model by proposing a novel latent clustering method to compute additional information within the target corpus and by changing the objective function from listwise to pointwise. To evaluate the performance of the proposed approaches, experiments are performed with the WikiQA and TRECQA datasets. The empirical results demonstrate the superiority of our proposed approach, which achieve state-of-the-art performance on both datasets.
This document provides an introduction to boosted trees. It reviews key concepts in supervised learning like loss functions and regularization. Regression trees make predictions by assigning scores to leaf nodes. Gradient boosting is an algorithm that learns an ensemble of regression trees additively to minimize a loss function. It works by greedily adding trees to improve the model's fit on the training data based on the gradient of the loss function. The trees are learned one at a time by calculating the optimal split at each node to reduce a "structure score" measuring how well the split partitions the data.
The document describes research on scalable machine learning algorithms for massive astronomical datasets. It discusses 10 common data analysis problems in astronomy like querying, density estimation, regression, classification, etc. For each problem, it lists relevant statistical methods and their computational costs. It emphasizes choosing the right method for the scientific question and developing fast algorithms for bottleneck operations like generalized N-body problems and graphical model inference to enable machine learning on datasets with billions of data points and features.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...aimsnist
- Ankit Agrawal is a research associate professor in the Department of Electrical Engineering and Computer Science at Northwestern University who works on materials informatics and big data.
- His research thrusts include the Materials Genome Initiative, developing forward and inverse models to predict materials properties and optimize microstructures using data-driven techniques, and applying machine learning to characterize materials structures.
- Some illustrative projects include using data mining to predict steel fatigue strength and formation energies from DFT calculations, optimizing magnetostrictive Galfenol alloys, and developing deep learning methods for EBSD indexing and microstructure design.
Applications of Machine Learning for Materials Discovery at NRELaimsnist
Machine learning and artificial intelligence techniques are being applied at NREL to accelerate materials discovery in several ways:
1) Clustering of experimental XRD patterns allows automated structure determination, replacing slow manual analysis.
2) Neural networks can predict optoelectronic properties of molecules from their structure alone, screening millions of candidates.
3) Models are being developed to predict properties not measured in experiments to augment experimental data.
4) End-to-end deep learning on molecular and crystal structures may predict properties with accuracy approaching computationally expensive DFT simulations.
An online semantic enhanced dirichlet model for short textJay Kumarr
The document presents a new online semantic-enhanced Dirichlet model (OSDM) for clustering short text streams. OSDM addresses challenges with existing approaches like semantic ambiguity, concept drift over time, and batch vs online processing. It maintains active topics online using a non-parametric probabilistic graphical model that incorporates semantic information through term co-occurrence and performs automatic topic detection. Experimental results on news, tweet and Reuters datasets show OSDM outperforms other models in clustering performance over data streams and is robust to parameter changes.
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
JARVIS-ML provides concise summaries of materials properties using machine learning models trained on the extensive data in the JARVIS repositories. It has developed regression and classification models that can predict formation energies, bandgaps, and other material properties in seconds, much faster than traditional DFT calculations. The models use gradient boosting decision trees and feature importance analysis to provide explanations. JARVIS-ML is available as a public web app and API for rapid screening and discovery of new materials.
Smart Metrics for High Performance Material Designaimsnist
This document discusses smart metrics for high-performance material design using density functional theory (DFT), classical force fields (FF), and machine learning (ML). It provides an overview of the JARVIS database and tools containing over 35,000 materials and classical properties calculated using DFT, FF, and ML methods. Metrics discussed include formation energy, exfoliation energy, elastic constants, surface energy, vacancy energy, grain boundary energy, bandgaps, and other electronic and optical properties important for applications like solar cells. ML models are developed to predict these properties with mean absolute errors within chemical accuracy compared to DFT benchmarks.
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while keeping high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A novel distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it eighty times faster and lighter than other typical ensemble methods. All models are evaluated on seven document classification datasets and show significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models.
This paper surveys massively parallel techniques for indexing multidimensional datasets on GPUs. It discusses existing techniques like Parallel R-Trees that have irregular memory access patterns and are not well-suited for GPUs. It proposes two new algorithms: Massively Parallel Hilbert R-Trees (MPHR) which uses Hilbert curves to cluster data, and Massively Parallel Restart Scanning (MPRS) which transforms tree traversal into sequential data processing without recursion. MPHR and MPRS improve GPU utilization for range queries by avoiding irregular search paths and enabling contiguous memory access patterns.
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...aimsnist
The Ramprasad Research Group at Georgia Tech uses machine learning and high-throughput computations to screen the chemical space of polymers for materials with applications in energy storage. They have identified several high-performing polymer dielectrics with higher dielectric constants and energy densities than the current standard. The group is working to further improve materials by incorporating metals and exploring morphological complexity with the goal of autonomous polymer design.
This document summarizes research on semi-supervised classification methods for protein crystallization image classification. It describes self-training and YATSI (Yet Another Two Staged Idea) semi-supervised classification approaches applied to a dataset of 2250 protein crystallization images. Experimental results show that naive Bayesian and SMO classifiers benefited from self-training and YATSI, while decision trees, multilayer perceptron, and random forests did not improve. Random forest provided the best overall classification performance. Future work will investigate active learning combined with semi-supervised learning.
This document discusses various machine learning techniques for transfer learning, including unsupervised domain adaptation (UDA), few-shot learning (FSL), zero-shot learning (ZSL), and hypothesis transfer learning (HTL). For UDA, the author proposes graph matching approaches to minimize domain discrepancy between source and target domains. For FSL, a two-stage approach is used to estimate novel class prototypes and variances. For ZSL, an approach is described that uses relational matching, adaptation, and calibration. For HTL, estimating novel class prototypes from source prototypes and sparse target data is discussed. Experimental results demonstrate the effectiveness of the proposed approaches.
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...Debasmit Das
This document proposes a three-step approach for zero-shot image recognition using relational matching, domain adaptation, and calibration. The approach uses relational matching to find structural correspondences between semantic embeddings and features, domain adaptation to adapt unseen semantic embeddings to the test data domain, and calibration to reduce bias towards seen classes. Experimental results on four datasets show improved zero-shot and generalized zero-shot classification performance compared to previous methods, with domain adaptation providing the most benefit. Analysis of hubness and convergence properties are also presented.
The document discusses using topic modeling techniques to cluster and classify records from multiple OAI repositories to enhance metadata and subject descriptions. Key steps included preprocessing records, building a vocabulary, running topic modeling to generate 500 topics, organizing topics into broad topical categories, and developing a browser to explore topics and records. Evaluation of the techniques found it worked well for English repositories but requires more testing on other languages and repository types. Potential products and services are proposed like integrating the topics into OAIster for subject search and browse.
[GAN by Hung-yi Lee]Part 1: General introduction of GANNAVER Engineering
Generative Adversarial Network and its Applications on Speech and Natural Language Processing, Part 1.
발표자: Hung-yi Lee(국립 타이완대 교수)
발표일: 18.7.
Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.
In the first part of the talk, I will first give an introduction of GAN and provide a thorough review about this technology. In the second part, I will focus on the applications of GAN to speech and natural language processing. I will demonstrate the applications of GAN on voice I will also talk about the research directions towards unsupervised speech recognition by GAN.conversion, unsupervised abstractive summarization and sentiment controllable chat-bot.
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
Sandia National Laboratories is developing SNAP (Spectral Neighbor Analysis Potential) potentials for molecular dynamics simulations. SNAP potentials are fitted to quantum mechanical data using bispectrum components that describe the local atomic environments. SNAP potentials have been shown to accurately reproduce properties of tantalum, including liquid structure and screw dislocation behavior not included in the training data. Work is ongoing to develop multi-element SNAP potentials, including for tungsten-beryllium alloys relevant to modeling plasma-surface interactions in nuclear fusion reactors.
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...aimsnist
This document discusses how high-throughput experimentation (HTE) and machine learning (ML) can accelerate materials discovery for functional metallic glasses (MGs). It describes a round robin experiment between NIST and NREL to synthesize and characterize composition spread samples to test data sharing standards. General trends predicted by ML models often correlate within a given synthesis method but systematic differences can occur between methods. While ML is not a replacement for physics, the combination of HTE and ML can identify promising new materials faster than traditional experimentation alone. Autonomous research platforms may enable an even greater acceleration of the materials discovery process.
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
This document summarizes a research paper on a gossip-based resource allocation protocol called GRMP-Q for server consolidation in large cloud environments. The protocol aims to minimize active servers and allocate resources fairly while adapting dynamically to load changes. It uses a distributed middleware architecture and gossip algorithms to provide scalability without single points of failure. Simulation results show GRMP-Q reduces power usage by shutting down servers, satisfies demand fairly, and reconfigures with low cost compared to optimal solutions. Future work areas include analyzing convergence, supporting heterogeneity, and expanding the architecture.
The document summarizes a presentation about Stack Mobile, a mobile application for managing OpenStack clouds from Android phones. The presentation covers the motivation for developing a mobile-friendly app, the design goals of optimizing the app for mobile and limiting battery usage, and an overview of the features currently supported in the alpha version, including managing instances, volumes, snapshots, and security groups. It concludes by inviting volunteers to test the alpha version and provide feedback.
http://inarocket.com
Learn BEM fundamentals as fast as possible. What is BEM (Block, element, modifier), BEM syntax, how it works with a real example, etc.
The document discusses how personalization and dynamic content are becoming increasingly important on websites. It notes that 52% of marketers see content personalization as critical and 75% of consumers like it when brands personalize their content. However, personalization can create issues for search engine optimization as dynamic URLs and content are more difficult for search engines to index than static pages. The document provides tips for SEOs to help address these personalization and SEO challenges, such as using static URLs when possible and submitting accurate sitemaps.
This document summarizes a study of CEO succession events among the largest 100 U.S. corporations between 2005-2015. The study analyzed executives who were passed over for the CEO role ("succession losers") and their subsequent careers. It found that 74% of passed over executives left their companies, with 30% eventually becoming CEOs elsewhere. However, companies led by succession losers saw average stock price declines of 13% over 3 years, compared to gains for companies whose CEO selections remained unchanged. The findings suggest that boards generally identify the most qualified CEO candidates, though differences between internal and external hires complicate comparisons.
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
How can we take UX and Data Storytelling out of the tech context and use them to change the way government behaves?
Showcasing the truth is the highest goal of data storytelling. Because the design of a chart can affect the interpretation of data in a major way, one must wield visual tools with care and deliberation. Using quantitative facts to evoke an emotional response is best achieved with the combination of UX and data storytelling.
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
We asked LinkedIn members worldwide about their levels of interest in the latest wave of technology: whether they’re using wearables, and whether they intend to buy self-driving cars and VR headsets as they become available. We asked them too about their attitudes to technology and to the growing role of Artificial Intelligence (AI) in the devices that they use. The answers were fascinating – and in many cases, surprising.
This SlideShare explores the full results of this study, including detailed market-by-market breakdowns of intention levels for each technology – and how attitudes change with age, location and seniority level. If you’re marketing a tech brand – or planning to use VR and wearables to reach a professional audience – then these are insights you won’t want to miss.
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
Presentation for "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
See paper: http://ceur-ws.org/Vol-1653/paper_11.pdf
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
Slides for the presentation of the paper "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
Smart Metrics for High Performance Material Designaimsnist
This document discusses smart metrics for high-performance material design using density functional theory (DFT), classical force fields (FF), and machine learning (ML). It provides an overview of the JARVIS database and tools containing over 35,000 materials and classical properties calculated using DFT, FF, and ML methods. Metrics discussed include formation energy, exfoliation energy, elastic constants, surface energy, vacancy energy, grain boundary energy, bandgaps, and other electronic and optical properties important for applications like solar cells. ML models are developed to predict these properties with mean absolute errors within chemical accuracy compared to DFT benchmarks.
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while keeping high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A novel distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it eighty times faster and lighter than other typical ensemble methods. All models are evaluated on seven document classification datasets and show significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models.
This paper surveys massively parallel techniques for indexing multidimensional datasets on GPUs. It discusses existing techniques like Parallel R-Trees that have irregular memory access patterns and are not well-suited for GPUs. It proposes two new algorithms: Massively Parallel Hilbert R-Trees (MPHR) which uses Hilbert curves to cluster data, and Massively Parallel Restart Scanning (MPRS) which transforms tree traversal into sequential data processing without recursion. MPHR and MPRS improve GPU utilization for range queries by avoiding irregular search paths and enabling contiguous memory access patterns.
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...aimsnist
The Ramprasad Research Group at Georgia Tech uses machine learning and high-throughput computations to screen the chemical space of polymers for materials with applications in energy storage. They have identified several high-performing polymer dielectrics with higher dielectric constants and energy densities than the current standard. The group is working to further improve materials by incorporating metals and exploring morphological complexity with the goal of autonomous polymer design.
This document summarizes research on semi-supervised classification methods for protein crystallization image classification. It describes self-training and YATSI (Yet Another Two Staged Idea) semi-supervised classification approaches applied to a dataset of 2250 protein crystallization images. Experimental results show that naive Bayesian and SMO classifiers benefited from self-training and YATSI, while decision trees, multilayer perceptron, and random forests did not improve. Random forest provided the best overall classification performance. Future work will investigate active learning combined with semi-supervised learning.
This document discusses various machine learning techniques for transfer learning, including unsupervised domain adaptation (UDA), few-shot learning (FSL), zero-shot learning (ZSL), and hypothesis transfer learning (HTL). For UDA, the author proposes graph matching approaches to minimize domain discrepancy between source and target domains. For FSL, a two-stage approach is used to estimate novel class prototypes and variances. For ZSL, an approach is described that uses relational matching, adaptation, and calibration. For HTL, estimating novel class prototypes from source prototypes and sparse target data is discussed. Experimental results demonstrate the effectiveness of the proposed approaches.
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...Debasmit Das
This document proposes a three-step approach for zero-shot image recognition using relational matching, domain adaptation, and calibration. The approach uses relational matching to find structural correspondences between semantic embeddings and features, domain adaptation to adapt unseen semantic embeddings to the test data domain, and calibration to reduce bias towards seen classes. Experimental results on four datasets show improved zero-shot and generalized zero-shot classification performance compared to previous methods, with domain adaptation providing the most benefit. Analysis of hubness and convergence properties are also presented.
The document discusses using topic modeling techniques to cluster and classify records from multiple OAI repositories to enhance metadata and subject descriptions. Key steps included preprocessing records, building a vocabulary, running topic modeling to generate 500 topics, organizing topics into broad topical categories, and developing a browser to explore topics and records. Evaluation of the techniques found it worked well for English repositories but requires more testing on other languages and repository types. Potential products and services are proposed like integrating the topics into OAIster for subject search and browse.
[GAN by Hung-yi Lee]Part 1: General introduction of GANNAVER Engineering
Generative Adversarial Network and its Applications on Speech and Natural Language Processing, Part 1.
발표자: Hung-yi Lee(국립 타이완대 교수)
발표일: 18.7.
Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.
In the first part of the talk, I will first give an introduction of GAN and provide a thorough review about this technology. In the second part, I will focus on the applications of GAN to speech and natural language processing. I will demonstrate the applications of GAN on voice I will also talk about the research directions towards unsupervised speech recognition by GAN.conversion, unsupervised abstractive summarization and sentiment controllable chat-bot.
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
Sandia National Laboratories is developing SNAP (Spectral Neighbor Analysis Potential) potentials for molecular dynamics simulations. SNAP potentials are fitted to quantum mechanical data using bispectrum components that describe the local atomic environments. SNAP potentials have been shown to accurately reproduce properties of tantalum, including liquid structure and screw dislocation behavior not included in the training data. Work is ongoing to develop multi-element SNAP potentials, including for tungsten-beryllium alloys relevant to modeling plasma-surface interactions in nuclear fusion reactors.
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...aimsnist
This document discusses how high-throughput experimentation (HTE) and machine learning (ML) can accelerate materials discovery for functional metallic glasses (MGs). It describes a round robin experiment between NIST and NREL to synthesize and characterize composition spread samples to test data sharing standards. General trends predicted by ML models often correlate within a given synthesis method but systematic differences can occur between methods. While ML is not a replacement for physics, the combination of HTE and ML can identify promising new materials faster than traditional experimentation alone. Autonomous research platforms may enable an even greater acceleration of the materials discovery process.
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
This document summarizes a research paper on a gossip-based resource allocation protocol called GRMP-Q for server consolidation in large cloud environments. The protocol aims to minimize active servers and allocate resources fairly while adapting dynamically to load changes. It uses a distributed middleware architecture and gossip algorithms to provide scalability without single points of failure. Simulation results show GRMP-Q reduces power usage by shutting down servers, satisfies demand fairly, and reconfigures with low cost compared to optimal solutions. Future work areas include analyzing convergence, supporting heterogeneity, and expanding the architecture.
The document summarizes a presentation about Stack Mobile, a mobile application for managing OpenStack clouds from Android phones. The presentation covers the motivation for developing a mobile-friendly app, the design goals of optimizing the app for mobile and limiting battery usage, and an overview of the features currently supported in the alpha version, including managing instances, volumes, snapshots, and security groups. It concludes by inviting volunteers to test the alpha version and provide feedback.
http://inarocket.com
Learn BEM fundamentals as fast as possible. What is BEM (Block, element, modifier), BEM syntax, how it works with a real example, etc.
The document discusses how personalization and dynamic content are becoming increasingly important on websites. It notes that 52% of marketers see content personalization as critical and 75% of consumers like it when brands personalize their content. However, personalization can create issues for search engine optimization as dynamic URLs and content are more difficult for search engines to index than static pages. The document provides tips for SEOs to help address these personalization and SEO challenges, such as using static URLs when possible and submitting accurate sitemaps.
This document summarizes a study of CEO succession events among the largest 100 U.S. corporations between 2005-2015. The study analyzed executives who were passed over for the CEO role ("succession losers") and their subsequent careers. It found that 74% of passed over executives left their companies, with 30% eventually becoming CEOs elsewhere. However, companies led by succession losers saw average stock price declines of 13% over 3 years, compared to gains for companies whose CEO selections remained unchanged. The findings suggest that boards generally identify the most qualified CEO candidates, though differences between internal and external hires complicate comparisons.
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
How can we take UX and Data Storytelling out of the tech context and use them to change the way government behaves?
Showcasing the truth is the highest goal of data storytelling. Because the design of a chart can affect the interpretation of data in a major way, one must wield visual tools with care and deliberation. Using quantitative facts to evoke an emotional response is best achieved with the combination of UX and data storytelling.
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
We asked LinkedIn members worldwide about their levels of interest in the latest wave of technology: whether they’re using wearables, and whether they intend to buy self-driving cars and VR headsets as they become available. We asked them too about their attitudes to technology and to the growing role of Artificial Intelligence (AI) in the devices that they use. The answers were fascinating – and in many cases, surprising.
This SlideShare explores the full results of this study, including detailed market-by-market breakdowns of intention levels for each technology – and how attitudes change with age, location and seniority level. If you’re marketing a tech brand – or planning to use VR and wearables to reach a professional audience – then these are insights you won’t want to miss.
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
Presentation for "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
See paper: http://ceur-ws.org/Vol-1653/paper_11.pdf
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
Slides for the presentation of the paper "Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks" at the 7th Italian Information Retrieval Workshop.
Wrokflow programming and provenance query model Rayhan Ferdous
This document defines key concepts for a workflow programming and provenance query model, including workflows, data, modules, dataflow, and properties. It proposes three fundamental queries - Decide, Sequence, and Map - that can answer provenance questions about workflows. These three queries are shown to be sufficient to address provenance queries posed in several other research works. Query results are proposed to be visualized through techniques like DAGs and tables.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
In this deck from the 2019 Stanford HPC Conference, Srinivasan Parthasarathy from Ohio State University presents: Scalable Machine Learning: The Role of Stratified Data Sharding.
"With the increasing popularity of structured data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as trees and graphs, are becoming ubiquitous. Managing and learning from such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently, is daunting. In this talk I will begin with discussing some of these challenges. Subsequently I will discuss a critical element at the heart of this challenge relates to the sharding, placement, storage and access of such tera- and peta- scale data. In this work we develop a novel distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which seeks to initially group structurally (or semantically) similar entities into strata. Subsequently strata are partitioned within this eco-system according to the needs of the application to maximize locality, balance load, minimize data skew or even take into account energy consumption. Results on several real-world applications validate the efficacy and efficiency of our approach. (Notes: Joint work with Y. Wang (Airbnb) and A. Chakrabarti (MSR))."
Srinivasan Parthasarathy, Professor of Computer Science & Engineering, The Ohio State University
Srinivasan Parthasarathy is a Professor of Computer Science and Engineering and the director of the data mining research laboratory at Ohio State. His research interests span databases, data mining and high performance computing. He is among a handful of researchers nationwide to have won both the Department of Energy and National Science Foundation Career awards. He and his students have won multiple best paper awards or "best of" nominations from leading forums in the field including: SIAM Data Mining, ACM SIGKDD, VLDB, ISMB, WWW, ICDM, and ACM Bioinformatics. He chairs the SIAM data mining conference steering committee and serves on the action board of ACM TKDD and ACM DMKD --leading journals in the field. Since 2012 he also helped lead the creation of OSU's first-of-a-kind nationwide (USA) undergraduate major in data analytics and serves as one of its founding directors.
Watch the video: https://youtu.be/hOJI8e0p-UI
Learn more: http://web.cse.ohio-state.edu/~parthasarathy.2/
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document proposes a probabilistic approach to ranking results of database queries when multiple tuples satisfy the query criteria. It calculates global and conditional scores for tuples based on attribute correlations learned from the query workload and data, and combines these into a ranking function. User studies on real estate and movie datasets show the conditional ranking approach outperforms a baseline that only considers global scores. The approach can be efficiently implemented using indexing and optimization techniques.
This thesis focuses on performance management techniques for cloud services. It presents work in three key areas: 1) Developing a scalable and generic resource allocation protocol for large cloud environments. 2) Building performance models to predict response times and capacity for a distributed key-value store. 3) Enabling real-time prediction of service metrics using analytics on low-level system statistics. The thesis contributes solutions for these challenging problems and identifies open questions around decentralized resource allocation, online performance management, and analytics-based forecasting at large scales.
Talk given by prof. T.K. Prasad at the workshop on Semantics in Geospatial Architectures: Applications and Implementation. The workshop was held from October 28-29, 2013 at Pyle Center (702 Langdon Street, Madison, WI), University of Wisconsin-Madison.
This document outlines the agenda for a two-day workshop on learning R and analytics. Day 1 will introduce R and cover data input, quality, and exploration. Day 2 will focus on data manipulation, visualization, regression models, and advanced topics. Sessions include lectures and demos in R. The goal is to help attendees learn R in 12 hours and gain an introduction to analytics skills for career opportunities.
Provinance in scientific workflows in e sciencebdemchak
This document summarizes a survey of data provenance in e-science. It defines data provenance as information that helps determine the derivation history of a data product, starting from its original sources. The document outlines why data provenance is important, its key dimensions, examples of provenance techniques, existing provenance tools, and research frontiers. It also summarizes several systems that track provenance in scientific workflows and databases.
While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
Training in Analytics, R and Social Media AnalyticsAjay Ohri
This document provides an overview of basics of analysis, analytics, and R. It discusses why analysis is important, key concepts like central tendency, variance, and frequency analysis. It also covers exploratory data analysis, common analytics software, using R for tasks like importing data, data manipulation, visualization and more. Examples and demos are provided for many common R functions and techniques.
Knowledge Discovery in Environmental Management Dr. Aparna Varde
This is a research presentation by Aparna Varde during a summer research visit at the Max Planck Institute for Informatics, Saarbruecken, Germany in August 2015 within the research group of Dr. Gerhard Weikum, The presentation focuses on various aspects of data mining and knowledge discovery pertinent to environmental science and management. It encompasses three main topics: (1) decision support for the greening of data centers; (2) predictive analysis in urban planning and simulation; and (3) common sense knowledge for domain-specific KBs. It includes a few brief highlights on web and text mining in article / collocation error detection as well as in terminology evolution. This presentation is based on her relevant work as per August 2015, serving as an invited talk during this research visit.
This document provides a summary of practical machine learning on big data platforms. It begins with an introduction and agenda, then provides a quick brief on the machine learning process. It discusses the current landscape of open source tools, including evolutionary drivers and examples. It covers case studies from Twitter and their experience. Finally, it discusses architectural forces like Moore's Law and Kryder's Law that are shaping the field. The document aims to present a unified approach for machine learning on big data platforms and discuss how industry leaders are implementing these techniques.
Latency-aware Elastic Scaling for Distributed Data Stream Processing SystemsZbigniew Jerzak
Elastic scaling allows a data stream processing system to react to a dynamically changing query or event workload by automatically scaling in or out. Thereby, both unpredictable load peaks as well as underload situations can be handled. However, each scaling decision comes with a latency penalty due to the required operator movements. Therefore, in practice an elastic system might be able to improve the system utilization, however it is not able to provide latency guarantees defined by a service level agreement (SLA). In this paper we introduce an elastic scaling system, which optimizes the utilization under certain latency constraints defined by a SLA. Specifically, we present a model, which estimates the latency spike created by a set of operator movements. We use this model to build a latency-aware elastic operator placement algorithm, which minimizes the number of latency violations. We show that our solution is able to reduce the 90th percentile of the end to end latency by up to 30% and reduce the number of latency violations by 50%. The achieved system utilization for our approach is comparable to a scaling strategy, which does not use latency as optimization target.
The document discusses the need for a new open source database management system called SciDB to address the challenges of storing and analyzing extremely large scientific datasets. SciDB is being designed to handle petabyte-scale multidimensional array data with native support for features important to science like provenance tracking, uncertainty handling, and integration with statistical tools. An international partnership involving scientists, database experts, and a nonprofit company is developing SciDB with initial funding and use cases coming from astronomy, industry, genomics and other domains.
This presentation was given by Jon Wheeler and Karl Benedict of the University of New Mexico during the joint NISO-NFAIS Virtual Conference held on December 7, 2016
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Licentiate Defense Slide
1. Contributions to Performance Modeling and
Management of Data Centers
Rerngvit Yanggratoke
Advisor: Prof. Rolf Stadler
Opponent: Prof. Filip De Turck
Licentiate Seminar
October 25, 2013
1
2. Introduction
Networked services – search and social-network services
Performance of such services is important
Data center
Internet
Backend
component
Performance management of backend components in a data center
1. Resource allocation for a large-scale cloud environment
2. Performance modeling of a distributed key-value store
2
3. Outline
1. Resource allocation for a large-scale cloud environment
• Motivation
• Problem
• Approach
• Evaluation results
• Related research / publications
2. Performance modeling of a distributed key-value store
• Motivation
• Problem
• Approach
• Evaluation results
• Related research / publications
– Contributions
– Open questions
3
4. Motivation – Large-scale Cloud
“U.S. Data centers Growing in Size But Declining in Numbers”
IDC Research, 2012.
• Apple's data center in North
Carolina, USA (500’000 feet2)
• Expected to host 150’000+
machines
• Amazon data center in
Virginia, USA (about 300’000
machines)
• Microsoft's mega data center
in Dublin, Ireland (300’000
feet2)
• Facebook planned for a
massive data center in
Iowa, USA (1.4M feet2)
• eBay's data center in
Utah, USA (at least 240’000
feet2)
4
5. Resource Allocation
Select the machine for executing an application that satisfies:
• Resource demands of all applications in the cloud
• Management objectives of the cloud provider
Resource allocation system computes the solution to the problem 5
6. Problem & Approach
Resource allocation system that supports
– Joint allocation of compute and network resources
– Generic and extensible for different management objectives
– Scalable operation (> 100’000 machines)
– Dynamic adaptation to changes in load patterns
Approach
– We assume the full-bisection-bandwidth network
– Formulate the problem as an optimization problem
– Distributed protocols - gossip-based algorithms
6
7. An Objective Function f(t)
An objective function f(t) expresses a management objective
“Smaller f(t) moves the system closer to achieve the objective”
Balance load objective
f (t) =
å
k
q
q Î{w ,g , l }
Energy efficiency objective
åq q
c u
n n
(t)
2
nÎN
Fairness objective
æ
ö
d
v
f (t) = - ç å kq å q m q m (t) ÷
ç
÷
q Î{w ,g }
mÎM (t )
è
ø
f (t) = å pn (t)
Pn(t) = energy of machine n at time t
Service differentiation objective
æ
ö
q
d
v
f (t) = - ç å k å q m q m (t) ÷
ç
÷
èq Î{w ,g } mÎb (t )
ø
q Î {w, g, l} = {CPU, Memory, Network}
N = set of machines, M = set of applications
7
8. Optimization Problem
Minimize
f(t)
Subject to:
net
(1) q n (t) +
å
r
q m (t) £ qnc for q Î {w, g , l } and n Î N
mÎAn (t )
d
m
r
(2) qm (t) ³ q for q Î {w, g, l } and m Î M
• (1) are capacity constraints
• (2) are demand constraints
• Choose a specific f(t) states the problem for a specific
management objective
• This problem is NP-hard
• We apply a heuristic solution
• How to distribute the computation?
8
9. Gossip Protocol
• A round-based protocol that relies on pairwise interactions between
nodes to accomplish a global objective
• Each node executes the same code and the size of the exchanged
message is limited
“During a gossip interaction, move an application that minimizes f(t)”
Balanced load objective
9
10. Generic Resource Management Protocol (GRMP)
A generic and scalable gossip protocol for resource allocation
<- f(t) is applied here
10
11. Evaluation Results
Balanced load objective
Energy efficiency objective
Protocol
Ideal
Protocol
Ideal
Relative power consumption
1
Unbalance
0.8
0.6
0.4
0.2
0
0.95
0.75
0.55
0.35
0.15
CPU Load Factor
0.3
0.6
0.9
1.2
1.5
1
0.8
0.6
0.4
0.2
0
0.95
0.75
0.55
0.35
0.15
CPU Load Factor
Network Load Factor
0.3
0.6
0.9
1.2
1.5
Network Load Factor
Scalability:
• System size up to 100’000
• Other parameters do not change
11
12. Related Research
• Centralized resource allocation system
– (D. Gmach et al., 2008), (A. Verma et al., 2009),
(C. Guo et al., 2010), (H. Hitesh et al., 2011),
(V. Shrivastava et al., 2011), (D. Breitgand and
A.Epstein, 2012), (J.W. Jiang et al., 2012), (O. Biran et al. 2012), …
• Distributed resource allocation system
– (G. Jung et al., 2010), (Yazir et al., 2010),
(D. Jayasinghe et al., 2011), (E. Feller et al., 2012), this thesis
• Initial placement
– (A. Verma et al., 2009), (M. Wang et al., 2011),
(D. Jayasinghe et al., 2011), (X. Zhang et al., 2012)
• Dynamic placement
– (F. Wuhib et al., 2012), (D. Gmach et al., 2008), (Yazir et
al., 2010), (J. Xu and Jose Fortes, 2011) , this thesis
12
13. Publications
• R. Yanggratoke, F. Wuhib and R. Stadler, “Gossip-based
resource allocation for green computing in large clouds,” In
Proc. 7th International Conference on Network and Service
Management (CNSM), Paris, France, October 24-28, 2011.
• F. Wuhib, R. Yanggratoke, and R. Stadler, “Allocating Compute
and Network Resources under Management Objectives in
Large-Scale Clouds,” Journal of Network and Systems
Management (JNSM), pages 126, 2013.
13
14. Outline
1. Resource allocation for a large-scale cloud environment
• Motivation
• Problem
• Approach
• Evaluation results
• Related research / publications
2. Performance modeling of a distributed key-value store
• Motivation
• Problem
• Approach
• Evaluation results
• Related research / publications
– Contributions
– Open questions
14
16. Problem & Approach
• Development of analytical models for
– Predicting response time distribution
– Estimating capacity under different object allocation policies
• Random policy
• Popularity-aware policy
• The models should be
– Accurate for Spotify’s operational range
– Obtainable
– Simple
– Efficient
• Approach
– Simplified architecture
– Probabilistic and stochastic modeling techniques
16
17. Simplified Architecture for a Particular Site
•
•
•
•
Model only Production Storage
AP selects a storage server uniformly at random
Ignore network and access point delays
Consider steady-state conditions and Poisson arrivals
17
18. Model for Response Time Distribution
Model for a single storage server
Probability that a request to a server is served below a latency t is
(
Pr(T £ t) = q +(1- q) 1- e
- md (1-(1-q)l /md nd )t
)
Model for a cluster of storage servers
Probability that a request to the cluster is served below a latency t is
1
l
Pr(Tc £ t) =
f (t, nd , m d,s , c , qs )
å
S sÎS
S
18
19. Model for Estimating Capacity
under Different Object Allocation Policies
Capacity
Ω : Max request rate to a server before a QoS is not satisfied
Ωc: Max request rate to the cluster such that the request rate to
each server is at most Ω
Capacity of a cluster under the popularity-aware policy
Wc (popularity) = W× S
Capacity of a cluster under the random policy
Oa
1/a
O +
a -1
Wc (random) = W ×
Om a
1/a
Om +
a -1
When
Om
O
»
+
S
2 O ln S æ
log log S
ç1ç
S
2 log S
è
ö
÷
÷
ø
(
, Assuming O >> S log S
19
)
3
22. Related Research
• Distributed key-value store
– Amazon Dynamo (G. Decandia et al., 2007), Cassandra
(A. Lakshman, 2010), Riak, HyperDec, Scalaris
• Performance modeling of a storage device
– (J.D. Garcia et al., 2008), (A.S. Lebrecht., 2009),
(F. Cady et al., 2011), (S. Boboila and P. Desnoyers, 2011),
(P. Desnoyers, 2012), (S. Li and H.H. Huang, 2010)
• White-box vs. black-box performance model for a storage system
– Whitebox: (A.S. Lebrecht., 2009), (F. Cady et al., 2011),
(S. Boboila and P. Desnoyers, 2011), (P. Desnoyers, 2012),
this thesis
– Blackbox: (J.D. Garcia et al., 2008), (S. Li and H.H.
Huang, 2010), (A. Gulati et al., 2011)
22
23. Publications
• R. Yanggratoke, G. Kreitz, M. Goldmann and R.
Stadler, “Predicting response times for the Spotify backend,”
In Proc. 8th International conference on Network and service
management (CNSM), Las Vegas, NV, USA, October 2226, 2012. Best Paper award.
• R. Yanggratoke, G. Kreitz, M. Goldmann, R. Stadler and V.
Fodor, “On the performance of the Spotify backend,” accepted
to Journal of Network and Systems Management
(JNSM), 2013.
23
24. Contributions
• We designed, developed, and evaluated a generic protocol for
resource allocation that
– supports joint allocation of compute and network resources
– Is generic and extensible for different management
objectives
– enables scalable operation (> 100’000 machines)
– supports dynamic adaptation to changes in load patterns
• We designed, developed, and evaluated performance models
for predicting response time distribution and capacity of a
distributed key-value store
– Simple yet accurate for Spotify’s operational range
– Obtainable
– Efficient
24
25. Open Questions for Future Research
• Resource allocation for a large-scale cloud environment
– Centralized vs. decentralized resource allocation
– Multiple data centers & telecom clouds
• Performance modeling of a distributed key-value store
– Black-box models for performance predictions
– Online performance management using analytical models
25
Editor's Notes
Good morning every body, welcome to my licentiate seminar on the topic “Contributions to Performance Modeling and Management of Data Centers”So, let begin with the introduction to the topic.
Let me start with the introduction to the topic by explaining to you about the networked service, which is the central theme of this thesis.1.) I believe many of you here have used or heard some networked services before, for example, the search-engine services like Google or a social network services like Facebook.===These services have become parts of our daily life, because we interact with them very often, e.g., when we want to search for information, or follow what our friends are doing.2.) We all agree that performance of such services are important. An example of such performance is the response time of services. For example, when you search for information, you want to be interactive, you want it to be fast! You don’t want to wait for half an hour for your query to come back.3.) In order to understand the performance of these services, let look into how these services work. They typically adopt two main components.3.1) The first one is the frontend component are the components that interact directly with the users like interfaces in your laptop, web technologies, mobile applications like in Android or iPhone. 3.2) The second component, which is equally important, which are not very visible to the user but also quite important is the backend components typically residing in a data center. The backend component performs operations for sharing states between applications, for example, web servers that serve data from data bases.4.) Performance management of this second component is the main focus of this thesis. In particular, this thesis focuses on two key problems in management of such backend components residing in a data center.I will go through each of them in subsequent slides.4.)
Here is an outline of the talk today.For each problem, I will go through motivation, problem, approach, and results what we worked on.Then, I will summarize related research for both problems, conclude the presentation, and outline some open requests.
This work is about large-scale cloud. So, let begin with the motivation of the work, or why should we consider a large-scale cloud?The key reason is that data centers are getting bigger. This is the quote of IDC Research last year that “US Data center is increasing in size but decreasing in numbers.” If we think about this, it also makes senses because it is more cost effectiveness when we have an “economy of scale”.What we see here is the Icloud data center from Apple. With a massive size of 500,000 square feet, it expects to host at least 150,000 machines.It is not just Apple. Many other Internet-based companies like Amazon, Microsoft, Facebook and eBay. All have plan for these kind of massive-scale data centers.So the key message here is that the large-scale center is the trend. Let move on to see what kind of problem that this thesis deal with.
In particular, the first problem in this thesis relates to the resource allocation. SoWhat do I mean by resource allocation?I particularly focus on the “application placement problem”or answering the questionHow to select the machine for executing an application that satisfies - Resource demands of all applications in the cloud -> Application need resources to operation for example some CPU cores or some memory for temporary computation- Management objectives of the cloud provider -> The easiest ways to do this is to
In particular, the first problem we consider is designing and evaluating the resource allocation system that
So, this is our optimization problem. Essentially, it minimizes f(t) under the following constraints.
A round-based protocol that relies on pairwise interactions between nodes to accomplish a global objectiveEach node executes the same code and the size of the exchanged message is limited
Here is an outline of the talk today.For each problem, I will go through motivation, problem, approach, and results what we worked on.Then, I will summarize related research for both problems, conclude the presentation, and outline some open requests.