PNUTS is a massively parallel and geographically distributed database system for Yahoo!’s web applications. It provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. This presentation describes the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then presents experimental results.
Reactive Programming, Traits and Principles. What is Reactive, where does it come from, and what is it good for? How does it differ from event driven programming? It only functional?
This talk was given during DockerCon EU 2018.
It ain't just a whim - to be able to continue innovating, we’ve moved our good old static production to containers. We needed to be elastic, fast, reliable and production ready at any time - that's why we chose Docker. But like in most enterprises, lots of our apps run on the JVM and most JVMs’ ergonomics assume they “own” the server they are running on. So how do you containerize JVM apps? Should you really increase JVM heap if you have spare memory? What about OS caches? What are the differences between JDK 8, 9 and 10 when it comes to container awareness? Outages because of out of memory errors? Slowness because of long garbage collection and poor environment visibility? Long story short, in this session, we’ll look at the gotchas of running JVM apps in containers and teach you how to avoid costly mistakes.
Top 3 things attendees will learn:
1. Key differences between various JVM versions relevant for containerized Java apps.
2. Best practices for running JVM in containers.
3. Avoiding common pitfalls when running containerized JVM applications.
Reactive Programming, Traits and Principles. What is Reactive, where does it come from, and what is it good for? How does it differ from event driven programming? It only functional?
This talk was given during DockerCon EU 2018.
It ain't just a whim - to be able to continue innovating, we’ve moved our good old static production to containers. We needed to be elastic, fast, reliable and production ready at any time - that's why we chose Docker. But like in most enterprises, lots of our apps run on the JVM and most JVMs’ ergonomics assume they “own” the server they are running on. So how do you containerize JVM apps? Should you really increase JVM heap if you have spare memory? What about OS caches? What are the differences between JDK 8, 9 and 10 when it comes to container awareness? Outages because of out of memory errors? Slowness because of long garbage collection and poor environment visibility? Long story short, in this session, we’ll look at the gotchas of running JVM apps in containers and teach you how to avoid costly mistakes.
Top 3 things attendees will learn:
1. Key differences between various JVM versions relevant for containerized Java apps.
2. Best practices for running JVM in containers.
3. Avoiding common pitfalls when running containerized JVM applications.
Kafka monitoring using Prometheus and Grafanawonyong hwang
Kafka Cluster를 모니터링 하기 위한 Prometheus 설정을 가이드하고, 이를 시각화하기 위해 Grafana를 연동하는 방법을 설명합니다.
Guide Prometheus settings for monitoring the Kafka Cluster and explain how to work with Grafana to visualize them.
Boosting I/O Performance with KVM io_uringShapeBlue
Storage performance is becoming much more important. KVM io_uring attempts to bring the I/O performance of a virtual machine on almost the same level of bare metal. Apache CloudStack has support for io_uring since version 4.16. Wido will show the difference in performance io_uring brings to the table.
Wido den Hollander is the CTO of CLouDinfra, an infrastructure company offering total Webhosting solutions. CLDIN provides datacenter, IP and virtualization services for the companies within TWS. Wido den Hollander is a PMC member of the Apache CloudStack Project and a Ceph expert. He started with CloudStack 9 years ago. What attracted his attention is the simplicity of CloudStack and the fact that it is an open-source solution. During the years Wido became a contributor, a PMC member and he was a VP of the project for a year. He is one of our most active members, who puts a lot of efforts to keep the project active and transform it into a turnkey solution for cloud builders.
-----------------------------------------
The CloudStack European User Group 2022 took place on 7th April. The day saw a virtual get together for the European CloudStack Community, hosting 265 attendees from 25 countries. The event hosted 10 sessions with from leading CloudStack experts, users and skilful engineers from the open-source world, which included: technical talks, user stories, new features and integrations presentations and more.
------------------------------------------
About CloudStack: https://cloudstack.apache.org/
NHN NEXT 게임 서버 프로그래밍 강의 자료입니다. 최소한의 필요한 이론 내용은 질문 위주로 구성되어 있고 (답은 학생들 개별로 고민해와서 피드백 받는 방식) 해당 내용에 맞는 실습(구현) 과제가 포함되어 있습니다.
참고로, 서버 아키텍처에 관한 과목은 따로 있어서 본 강의에는 포함되어 있지 않습니다.
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
source : http://www.opennaru.com/redhat/jboss/
세계 최고의 오픈소스 미들웨어 JBoss EAP
JBoss EAP ( JBoss® Enterprise Application Platform )는 클라우드와 컨테이너를 포함한 모든 IT 환경에서 엔터프라이즈급의 보안, 성능, 확장성을 제공합니다.
Java EE 표준을 지원하는 세계에서 가장 많이 사용되는 오픈 소스 웹 어플리케이션 서버 입니다.
오픈 소스 소프트웨어이기 때문에 도입 비용이 저렴할 뿐만 아니라, 레드햇의 높은 기술력으로 기업용 미들웨어에 적합한 품질과 기술지원을 제공합니다.
라이선스 형태는 GNU Lesser General Public License (LGPL) 한가지 이지만, 배포 버전은 커뮤니티 버전(WIldfly)와 엔터프라이즈 버전(JBoss EAP) 두가지 입니다.
엔터프라이즈 버전인 JBoss EAP는 레드햇과 유료 서브스크립션 계약을 맺음으로써 사전에 인증된 JBoss 소프트웨어 최신 패치 파일과 업그레이드을 할 수있습니다.
다양한 하둡에코 소프트웨어 성능을 검증하려는 목적으로 성능 테스트 환경을 구성해보았습니다. ELK, JMeter를 활용해 구성했고 Kafka에 적용해 보았습니다.
프로젝트에서 요구되는 성능요건을 고려해 다양한 옵션을 조정해 시뮬레이션 해볼수 있습니다.
처음 적용한 뒤 2년 정도가 지났지만, kafka 만이 아니다 다른 Hadoop eco 및 Custom Solution에도 유용하게 활용 가능하겠습니다.
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work.
Kafka monitoring using Prometheus and Grafanawonyong hwang
Kafka Cluster를 모니터링 하기 위한 Prometheus 설정을 가이드하고, 이를 시각화하기 위해 Grafana를 연동하는 방법을 설명합니다.
Guide Prometheus settings for monitoring the Kafka Cluster and explain how to work with Grafana to visualize them.
Boosting I/O Performance with KVM io_uringShapeBlue
Storage performance is becoming much more important. KVM io_uring attempts to bring the I/O performance of a virtual machine on almost the same level of bare metal. Apache CloudStack has support for io_uring since version 4.16. Wido will show the difference in performance io_uring brings to the table.
Wido den Hollander is the CTO of CLouDinfra, an infrastructure company offering total Webhosting solutions. CLDIN provides datacenter, IP and virtualization services for the companies within TWS. Wido den Hollander is a PMC member of the Apache CloudStack Project and a Ceph expert. He started with CloudStack 9 years ago. What attracted his attention is the simplicity of CloudStack and the fact that it is an open-source solution. During the years Wido became a contributor, a PMC member and he was a VP of the project for a year. He is one of our most active members, who puts a lot of efforts to keep the project active and transform it into a turnkey solution for cloud builders.
-----------------------------------------
The CloudStack European User Group 2022 took place on 7th April. The day saw a virtual get together for the European CloudStack Community, hosting 265 attendees from 25 countries. The event hosted 10 sessions with from leading CloudStack experts, users and skilful engineers from the open-source world, which included: technical talks, user stories, new features and integrations presentations and more.
------------------------------------------
About CloudStack: https://cloudstack.apache.org/
NHN NEXT 게임 서버 프로그래밍 강의 자료입니다. 최소한의 필요한 이론 내용은 질문 위주로 구성되어 있고 (답은 학생들 개별로 고민해와서 피드백 받는 방식) 해당 내용에 맞는 실습(구현) 과제가 포함되어 있습니다.
참고로, 서버 아키텍처에 관한 과목은 따로 있어서 본 강의에는 포함되어 있지 않습니다.
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
Apache Kafak의 빅데이터 아키텍처에서 역할이 점차 커지고, 중요한 비중을 차지하게 되면서, 성능에 대한 고민도 늘어나고 있다.
다양한 프로젝트를 진행하면서 Apache Kafka를 모니터링 하기 위해 필요한 Metrics들을 이해하고, 이를 최적화 하기 위한 Configruation 설정을 정리해 보았다.
[Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안]
Apache Kafka 성능 모니터링에 필요한 metrics에 대해 이해하고, 4가지 관점(처리량, 지연, Durability, 가용성)에서 성능을 최적화 하는 방안을 정리함. Kafka를 구성하는 3개 모듈(Producer, Broker, Consumer)별로 성능 최적화를 위한 …
[Apache Kafka 모니터링을 위한 Metrics 이해]
Apache Kafka의 상태를 모니터링 하기 위해서는 4개(System(OS), Producer, Broker, Consumer)에서 발생하는 metrics들을 살펴봐야 한다.
이번 글에서는 JVM에서 제공하는 JMX metrics를 중심으로 producer/broker/consumer의 지표를 정리하였다.
모든 지표를 정리하진 않았고, 내 관점에서 유의미한 지표들을 중심으로 이해한 내용임
[Apache Kafka 성능 Configuration 최적화]
성능목표를 4개로 구분(Throughtput, Latency, Durability, Avalibility)하고, 각 목표에 따라 어떤 Kafka configuration의 조정을 어떻게 해야하는지 정리하였다.
튜닝한 파라미터를 적용한 후, 성능테스트를 수행하면서 추출된 Metrics를 모니터링하여 현재 업무에 최적화 되도록 최적화를 수행하는 것이 필요하다.
source : http://www.opennaru.com/redhat/jboss/
세계 최고의 오픈소스 미들웨어 JBoss EAP
JBoss EAP ( JBoss® Enterprise Application Platform )는 클라우드와 컨테이너를 포함한 모든 IT 환경에서 엔터프라이즈급의 보안, 성능, 확장성을 제공합니다.
Java EE 표준을 지원하는 세계에서 가장 많이 사용되는 오픈 소스 웹 어플리케이션 서버 입니다.
오픈 소스 소프트웨어이기 때문에 도입 비용이 저렴할 뿐만 아니라, 레드햇의 높은 기술력으로 기업용 미들웨어에 적합한 품질과 기술지원을 제공합니다.
라이선스 형태는 GNU Lesser General Public License (LGPL) 한가지 이지만, 배포 버전은 커뮤니티 버전(WIldfly)와 엔터프라이즈 버전(JBoss EAP) 두가지 입니다.
엔터프라이즈 버전인 JBoss EAP는 레드햇과 유료 서브스크립션 계약을 맺음으로써 사전에 인증된 JBoss 소프트웨어 최신 패치 파일과 업그레이드을 할 수있습니다.
다양한 하둡에코 소프트웨어 성능을 검증하려는 목적으로 성능 테스트 환경을 구성해보았습니다. ELK, JMeter를 활용해 구성했고 Kafka에 적용해 보았습니다.
프로젝트에서 요구되는 성능요건을 고려해 다양한 옵션을 조정해 시뮬레이션 해볼수 있습니다.
처음 적용한 뒤 2년 정도가 지났지만, kafka 만이 아니다 다른 Hadoop eco 및 Custom Solution에도 유용하게 활용 가능하겠습니다.
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work.
Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.
Flink Forward San Francisco 2018 keynote: Stephan Ewen - "What turns stream p...Flink Forward
Stream Processing is a powerful paradigm, especially when backed by a system like Apache Flink. With each release and year, we see Flink being used for more challenging use case and applications.
But beyond the individual application (though it may be grand and challenging in itself), stream processing is a much broader building block: the foundational piece of a platform that brings together the different parts of a data architecture. A platform that integrates data analytics, data ingestion, SQL, Machine Learning, data provenance, databases, and other aspects of a data-driven infrastructure in a meaningful way.
In this keynote, we look at what goes into building a stream processing platform that is more than the sum of its parts.
I promise that understand NoSQL is as easy as playing with LEGO bricks ! The Google Bigtable presented in 2006 is the inspiration for Apache HBase: let's take a deep dive into Bigtable to better understand Hbase.
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
If you're building relational, time-series, IOT, or real-time architectures using Hadoop, you will find Apache Kudu an attractive choice. With Kudu, you'll be able to build your applications more simply and with fewer moving parts.
Hadoop has become faster and more capable, and has continued to narrow the gap compared to traditional database technologies. However, for developers looking for up-to-the-second analytics on fast-moving data, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing and analytical workloads.
This talk will describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark and Apache Impala. Kudu fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
From Backups To Time Travel: A Systems Perspective on SnapshotsNuoDB
Many applications today are dependent on databases. Access to past states of database data enables new kinds of useful queries: time-traveling queries. With time travel, application developers can analyze and predict trends in changing data over time, detect data anomalies, and recover from user error such as accidental deletion of data (without relying on a cumbersome database restore). System administrators want simple and efficient backups. Database snapshots can bridge this gap and provide both, without disrupting performance.
This talk dives into snapshots as a database system service. We will discuss design choices for snapshots and time travel, and how those choices impact applications. You will learn about novel research results about how to add snapshots to a database system in a modular way, and we will touch on the challenges and opportunities that present when that database is distributed.
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.
Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.
In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
An approach towards greening the digital display systemTarik Reza Toha
Signage display, which is used to convey message or information, has evolved from conventional to digital display. Conventional signage which may be hand written or printed papers are being wiped out by digital displays used by industries because of its attractive features of efficient involvement of consumers. However, extensive use of digital signage displays contributes a notable amount of power consumption (about 1000W for a 14inch × 48inch display) of a region. In this literature, we have devised a novel approach for reducing power consumption of digital signage as well as satisfying human visibility by exploiting duty cycle. Our proposed technique is capable of relinquishing a significant amount (about 14.54% in comparison with existing display system) of power consumption occurred by digital display by keeping an eye on expected human vision.
Many-Objective Performance Enhancement in Computing ClustersTarik Reza Toha
In a heterogeneous computing cluster, cluster objectives are conflicting to each other. Selecting a right combination of machines is necessary to enhance cluster performance, and to optimize all the cluster objectives. In this paper, we perform empirical performance analyses of a real cluster with our year-long collected data, formulate a new many-objective optimization problem for clusters, and integrate a greedy approach with the existing NSGA-III algorithm to solve this problem. From our experimental results, we find our approach performs better than existing optimization approaches.
Exploiting a Synergy between Greedy Approach and NSGA for Scheduling in Compu...Tarik Reza Toha
Computing clusters are evaluated using different performance metrics, which often appear to be conflicting while being attempted to be optimized. For such conflicting cases along with frequently having an existence of heterogeneous environment, it is difficult for the cluster administrators to efficiently schedule machines, i.e., to select the right number and right combination of machines. In this paper, we develop a technique through which cluster administrators can select the right set of machines to enhance energy efficiency and cluster performance. To do so, first, we perform extensive laboratory experiments for a period of more than one year. Based on empirical analyses of data collected from the experiments, we formulate a many-objective optimization problem for clusters and integrate a greedy approach with Non-dominated Sorting Genetic Algorithm (NSGA-III) to solve this problem. We demonstrate that our approach mostly performs better than existing approaches in the literature through both real experimentation and simulation.
Predicting Human Count through Environmental Sensing in Closed Indoor SettingsTarik Reza Toha
Detecting count of human beings accurately in a closed indoor environment is crucial in diverse application areas including search and rescue, surveillance, customer analytics, abnormal event detection, human gait characterization, congestion analysis and many more. Moreover, it has significant importance in preventing any intrusion in a secured indoor space such as a bank vault. Sensors-based technologies (for example camera, PR, etc.) are becoming more popular day by day as the regular methodologies are not good enough to ensure enhanced security in a closed indoor environment. As sensors used in these technologies have to be deployed in visible places, there exist possibilities of damaging the sensors by the intruder. Therefore, this paper proposes a novel methodology to detect human count in such closed indoor setting, which can be deployed in any hidden place. Here, human count is done based on four environmental gaseous parameters (Carbon Dioxide, Liquefied Petroleum Gas or LPG, Nitrogen Dioxide, and Sulfur Dioxide) and two weather parameters (temperature and humidity). Real experiments are done under closed controlled settings and counting is done using machine learning algorithms such as Bagging, Random-Forest, IBK, and J48. We achieve more than 99% accuracy for some of the classifiers in detecting the number of humans present.
Automatic Fabric Defect Detection with a Wide-And-Compact NetworkTarik Reza Toha
Automatic detection of fabric defects is an important process for the textile industry. Besides the detection accuracy, an automatic fabric defect detection solution for a resource-limited system also requires superior performance in terms of processing time and simplicity. This paper proposes a compact convolutional neural network architecture for the detection of a few common fabric defects. The proposed architecture uses several micro architectures with multilayer perceptron to optimize network. The main component of a micro architecture is constructed using techniques of multi-scale analysis, filter factorization, multiple locations pooling, and parameters reduction, to improve detection accuracy in a compact model. Experimental results show that, compared to mainstream convolutional neural network architectures, the proposed network achieved superior performance in terms of detection accuracy with a much smaller model size. It worked well not only for fabric defects detection, but also for object recognition on a few public datasets.
Binarization of degraded document images based on hierarchical deep supervise...Tarik Reza Toha
The binarization of degraded document images is a challenging problem in terms of document analysis. Binarization is a classification process in which intra-image pixels are assigned to either of the two following classes: foreground text and background. Most of the algorithms are constructed on low-level features in an unsupervised manner, and the consequent disenabling of full utilization of input-domain knowledge considerably limits distinguishing of background noises from the foreground. In this paper, a novel supervised-binarization method is proposed, in which a hierarchical deep supervised network (DSN) architecture is learned for the prediction of the text pixels at different feature levels. With higher-level features, the network can differentiate text pixels from background noises, whereby severe degradations that occur in document images can be managed. Alternatively, foreground maps that are predicted at lower-level features present a higher visual quality at the boundary area. Compared with those of traditional algorithms, binary images generated by our architecture have cleaner background and better-preserved strokes. The proposed approach achieves state-of-the-art results over widely used DIBCO datasets, revealing the robustness of the presented method.
Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks—Countin...Tarik Reza Toha
For crowded scenes, the accuracy of object-based computer vision methods declines when the images are low-resolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regression-based methods to directly count the objects of interest. Among regression-based methods, density map estimation, where the number of objects inside a subregion is the integral of the density map over that subregion, is especially promising because it preserves spatial information, which makes it useful for both counting and localization (detection and tracking). With the power of deep convolutional neural networks (CNNs) the counting performance has improved steadily. The goal of this paper is to evaluate density maps generated by density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking. Most existing CNN methods produce density maps with resolution that is smaller than the original images, due to the downsample strides in the convolution/pooling operations. To produce an original-resolution density map, we also evaluate a classical CNN that uses a sliding window regressor to predict the density for every pixel in the image. We also consider a fully convolutional adaptation, with skip connections from lower convolutional layers to compensate for loss in spatial information during upsampling. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared with bilinear upsampling the lower-resolution density maps. Finally, we also propose several metrics for measuring the quality of a density map, and relate them to experiment results on counting and localization.
BGPC: Energy-Efficient Parallel Computing Considering Both Computational and ...Tarik Reza Toha
Parallel computing has become popular now-a-days due to its computing efficiency and cost effectiveness. However, in parallel computing systems, the computing demands a set of machines instead of a single machine. Therefore, it consumes a significant amount of power compared to single-machine computing systems. Moreover, a noticeable amount of power is necessary for maintaining the optimum temperature in the working environment of the parallel systems. This power is generally known as the cooling power required for the systems.
Although several power saving parallel computing schemes have already been proposed in the literature to date in order to minimize computational power consumption of a parallel system, designing a scheme considering both computational and cooling power consumption with low-cost resource is yet to be investigated in the literature. Therefore, in this thesis, we propose a low-cost power saving scheme simultaneously considering both computational and cooling power consumption. We design a machine learning framework BPGC, which tries to find the number of machines needed to be activated to be optimal, or at least near-optimal, in terms of minimum total energy consumption, with minimal overhead.
In order to predict total energy, we need to predict response time, computational power, and cooling power. We fit different machine learning algorithms for these predictions by using a year long collected training data. K-nearest neighbors, Support Vector Machine for regression, and Additive Regression using Random Forest show the highest accuracy for these predictions respectively. We implement BPGC framework in our test-bed with two green methods and static method. Our framework outperforms the green methods with a little degradation of QoS compared to the best QoS provider, that is, static method.
Towards Simulating Non-lane Based Heterogeneous Road Traffic of Less Develope...Tarik Reza Toha
Microscopic traffic simulators have become efficient tools to conduct different analytic studies on roads, vehicles, behavior of drivers, and critical intersections, which lead towards a well-planned traffic solution. Devising a realistic and sustainable traffic solution requires replication of the real traffic scenario in a simulator. For example, to simulate the traffic streams of developing and under developed countries, we need to simulate non-lane based heterogeneous traffic stream, i.e., motorized and non-motorized vehicles, road traffic behaviors such as irregular pedestrian, illegal parking, violation of laws pertaining lanes, etc. However, most of the existing traffic simulators are unable to mimic the unstructured road traffic streams of less developed countries with their diversified behaviors. Therefore, in this work, we propose a new microscopic traffic simulator to handle nonlane based heterogeneous traffic stream and on road traffic behaviors that generally occurred in the road networks of cities in less developed countries. Our simulator receives network topology, traffic routes, and traffic demand flow rates as input, visualizes the traffic flows, and provides traffic statistics. To evaluate sustainability of our proposed simulator in real-life scenarios, we calibrate the simulator using real traffic data. Our evaluation reveals 99% accuracy in terms of travel time.
GMC: Greening MapReduce Clusters Considering both Computation Energy and Cool...Tarik Reza Toha
Increased processing power of MapReduce clusters generally enhances performance and availability at the cost of substantial energy consumption that often incurs higher operational costs (e.g., electricity bills) and negative environmental impacts (e.g., carbon dioxide emissions). There exist a few greening methods for computing clusters in the literature that focus mainly on computational energy consumption leaving cooling energy, which occupies a significant portion of the total energy consumed by the clusters. To this extent, in this paper, we propose a machine learning based approach named as Green MapReduce Cluster (GMC) that reduces the total energy consumption of a MapReduce cluster considering both computational energy and cooling energy. GMC predicts the number of machines that results in minimum total energy consumption. We perform the prediction through applying different machine learning techniques over year-long data collected from a real setup. We evaluate performance of GMC over a real testbed. Our evaluation reveals that GMC reduces total energy consumption by up to 47% compared to other alternatives while experiencing marginal throughput degradation in a few cases.
Signage display, which is used to convey message or information, has evolved from conventional to digital display. Conventional signage which may be hand written or printed papers are being wiped out by digital displays used by industries because of its attractive features of efficient involvement of consumers. However, extensive use of digital signage displays contributes a notable amount of power consumption (about 1000W for a 14inch × 48inch display) of a region. In this literature, we have devised a novel approach for reducing power consumption of digital signage as well as satisfying human visibility by exploiting duty cycle. Our proposed technique is capable of relinquishing a significant amount (about 14.54% in comparison with existing display system) of power consumption occurred by digital display by keeping an eye on expected human vision.
Workload-Based Prediction of CPU Temperature and Usage for Small-Scale Distri...Tarik Reza Toha
The recent boost in the usage of high-performance computing systems in small research environments, such as those found at many universities, stipulates the need of smallscale distributed systems. Owning to the rapid growth in both computing power and heat, development of proper thermal and resource management becomes crucial concern of the research community along with the vendors to ensure efficiency for such systems. Moreover, an accurate and relatively fast strategy is needed for adaptation of different sizes of workload in such systems. Therefore, in this paper, we focus on developing simple prediction models of CPU temperature and usage for the systems. We investigate impacts of macro-level parameters such as the number of machines and different sizes of workload on CPU temperature and usage via real experiment. Our experimental results reveal that for a certain size of workload, the variation in CPU temperature and usage is minimal in response to a change in the number of machines, which does not hold in the reverse way. Hence, we develop workload-based prediction models for CPU temperature and usage. We evaluate the accuracy of our models by comparing the values calculated based on these models against the measurements found from real implementation.
Towards Making an Anonymous and One-Stop Online Reporting System for Third-Wo...Tarik Reza Toha
Under-reporting is one of the main causes of failure to solve social problems, which obstruct national development in thirdworld countries. A one-stop online reporting system can facilitate minimizing the extent of under-reporting, which is yet to be developed for general people of third-world countries. Therefore, in this paper, we propose a generic online reporting system where one can submit report anonymously, even without registration. Our system aims at propagating the reports to the respective authorities such as law enforcement agencies, anti-corruption commission, city corporation, policy makers, human rights commissions, etc., after a reviewing process. The system will also publish the reports without disclosing identities of the reporters to disseminate the information among public and to collect the public opinions about the reports.
Sparse Mat: A Tale of Devising A Low-Cost Directional System for Pedestrian C...Tarik Reza Toha
Pedestrian counting is required in diversified places such as shopping malls, touristic spots, etc., however, a low-cost solution to this problem is yet to be proposed in the literature. Therefore, in this paper, we propose a new solution for pedestrian counting that exploits only a small number of COTS sensors (94% less than that used in the existing Eco-Counter solution). To do so, we propose detail designs and two different algorithms for separately sensing step-down and step-up phenomena that we find while walking. User evaluation of real implementations of both our algorithms confirms an average accuracy of up to 93% through sensing the step-up phenomena.
uReporter, an open public reporting system(SD)Tarik Reza Toha
In day-to-day life, we, the common people, face different types of social problems around us. But most often these issues cannot be reported properly to the proper authorities due to some massive roadblocks between victim and concerned authorities. Security threat, political pressure, lack of knowledge about responsible authorities are the most prevalent obstacles in our country. Also, there exists negligence of related authority like Police, RAB etc. To overcome these roadblocks, we want to build uReporter, a unified online reporting system, which will send the reports as soon as possible to proper authorities after successful completion of several validation steps and reporters' personal information will be hidden. For this purpose, we will build a central repository system to store the reports from common people. Here, people can also share their experiences, supplement the previous reports, compliment about any authority. We will generate periodical reports through mining the valid data and public survey which depicts the real scenario of the society.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
1. PNUTS: Yahoo!’s Hosted Data
Serving Platform
VLDB ‘08
Auckland, New Zealand
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver, and Ramana Yerneni
Presented by
Tarik Reza Toha
#1017052013
2. Outline
• Background and motivation
• Related work
• Proposed methodology
– Data storage and retrieval
– Asynchronous replication and consistency
• Experimental evaluation
• Conclusion and future work
2
4. 4
Modern Web Applications (contd.)
16 Mike <ph..
6 Jimi <ph..
8 Mary <re..
12 Sonja <ph..
15 Brandon <po..
17 Bob <re..
<photo>
<title>Flower</title>
<url>www.flickr.com</url>
</photo>
5. • Scalability
– Architectural scalability: scale during periods of rapid
growth with minimal operational effort
• Response time and geographic scope
– Fast response time to geographically distributed users
• High availability and fault tolerance
– Read and even write data in failures
• Relaxed consistency guarantees
– Eventually consistency: update one replica first and
then update others
5
Requirements of Modern Web Applications
6. • Traditional DBMS features are:
– Complicated queries
– Strong transactions
• Modern web applications need:
– Simplified query
• No joins, aggregations
– Relaxed consistency needs
• Applications can tolerate stale or reordered data
6
DBMS for Modern Web Applications
7. • Bigtable: A Distributed Storage System for
Structured Data [Google, Inc.]
– Chang et al., OSDI, 2006
– Provides record-oriented access to very large tables
– Lacks geographic replication
– Lacks rich database functionalities
• Secondary indexes
• Materialized views
• Create multiple tables
• Hash-organized tables
Existing Database Management Systems
7
8. • Dynamo: Amazon’s Highly Available Key-value
Store
– DeCandia et al., SIGOPS, 2007
– A highly-available system
– Provides geographic replication via a gossip
mechanism
– Uses eventual consistency model
• Creates temporary inconsistency
– Uses hash-tables
• Some storages become hot-spots
8
Existing Database Management Systems (contd.)
9. • Distributed filesystems
– Ceph, Boxwood, Sinfonia
– Store objects
– Inappropriate for databases
– Unscalable
• Distributed hash tables (peer-to-peer)
– Chord, Pastry
– Provides object routing and database system
– Lacks ordered table abstraction
– Focuses on reliable routing and object replication in the
face of massive node turnover
9
Existing Database Management Systems (contd.)
10. PNUTS is a massively parallel and geographically
distributed database system for Yahoo!’s web
applications, which provides data storage organized
as hashed or ordered tables, low latency for large
numbers of con-current requests including updates
and queries, and novel per-record consistency
guarantees
10
Platform for Nimble Universal Table Storage
11. 11
Proposed Architecture of PNUTS
E 75656 C
A 42342 E
B 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 E
B 42521 W
C 66354 W
D 12352 E
F 15677 E
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
…
)
Parallel database Geographic replication
Indexes and views
Structured, flexible schema
Hosted, managed infrastructure
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
12. 12
Detailed Architecture of PNUTS
Data-path components
Storage units
Tablet
controller
REST API
Clients
Message
Broker
Routers
13. 13
Detailed Architecture of PNUTS (contd.)
Storage units
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB
14. 14
Tablets in Hash Table
Apple
Lemon
Grape
Orange
Lime
Strawberry
Kiwi
Avocado
Tomato
Banana
Grapes are good to eat
Limes are green
Apple is wisdom
Strawberry shortcake
Arrgh! Don’t get scurvy!
But at what price?
How much did you pay for this lemon?
Is this a vegetable?
New Zealand
The perfect fruit
Name Description Price
$12
$9
$1
$900
$2
$3
$1
$14
$2
$8
0x0000
0xFFFF
0x911F
0x2AF3
Tablet 1
Tablet 2
Tablet 3
15. 15
Tablets in Ordered Table
Apple
Banana
Grape
Orange
Lime
Strawberry
Kiwi
Avocado
Tomato
Lemon
Grapes are good to eat
Limes are green
Apple is wisdom
Strawberry shortcake
Arrgh! Don’t get scurvy!
But at what price?
The perfect fruit
Is this a vegetable?
How much did you pay for this lemon?
New Zealand
$1
$3
$2
$12
$8
$1
$9
$2
$900
$14
Name Description Price
A
Z
Q
H
Tablet 1
Tablet 2
Tablet 3
16. 16
Single Query in PNUTS
1
Get key k (get( ))
2
Get key k3
Record for key k
4
Record for key k
Routers
Storage unit 1 Storage unit 2 Storage unit 3
17. 17
Range Queries in PNUTS
MIN-Canteloupe SU1
Canteloupe-Lime SU3
Lime-Strawberry SU2
Strawberry-MAX SU1
Storage unit 1 Storage unit 2 Storage unit 3
Router (Scatter-gather Engine)
Apple
Avocado
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
Lime
Mango
Orange
Pear
Strawberry
Tomato
Watermelon
Grapefruit…Pear? (scan( ))
Grapefruit…Lime?
Lime…Pear?
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU4MIN-Canteloupe
18. 18
Update Operation in PNUTS
1
Write key k (set(v))
2
Write key k7
Sequence # for key k
8
Sequence # for key k
SU SU SU
3
Write key k
4
5
SUCCESS
6
Write key k
Routers
Message brokers
19. 19
Load Balancing via Tablet Splitting
Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeOverfull tablets split
Storage unit may become a hotspot
Shed load by moving tablets to other servers
Storage unit
Tablet
21. • Eventual consistency
– Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
21
Consistency Levels
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Sleeping)
Region 2
(Alice, Work, Awake)
(Alice, Work, Awake)
Work
Awake
Final state consistent
“Invalid” state visible
Awake Work
22. • Timeline consistency
– Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
22
Consistency Levels (contd.)
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Awake)
Region 2
(Alice, Work, Awake)
Work
(Alice, Work, Awake)
Awake Work
23. 23
Consistency via Mastership
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 E
C 66354 W
D 12352 E
E 75656 C
F 15677 E
C 66354 W
B 42521 E
A 42342 E
D 12352 E
E 75656 C
F 15677 E
24. 24
Failover in PNUTS
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
X
X
OVERRIDE W → E
25. • PNUTS supports both eventual and timeline
consistency model
– Applications can choose which kind of table to create
• What happens to a record with primary key
“Brian”?
25
Consistency Models in PNUTS
Record
inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Update Update
26. 26
Some APIs of Timeline Model in PNUTS
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current
version
Stale versionStale version
Read-any
• Read-any returns a possibly stale version of the record
‒ Served using a local copy
• It can be used for displaying a user’s friend’s status in a social
networking application, as it is not absolutely essential to get
the most up-to-date value
27. 27
Some APIs of Timeline Model in PNUTS (contd.)
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read-latest
Current
version
Stale versionStale version
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write
Current
version
Stale versionStale version
28. 28
Some APIs of Timeline Model in PNUTS (contd.)
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current
version
Stale versionStale version
Read-critical(required version):
• Read-critical returns a version of the record that is strictly
newer than, or the same as the required version
• It can be used when a user writes a record, and then wants to
read a version of the record that definitely reflects his changes
29. 29
Some APIs of Timeline Model in PNUTS (contd.)
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current
version
Stale versionStale version
Test-and-set-write(required version)
• Test-and-set-write performs the requested write to the record if
and only if the present version of the record is the same as
required version
‒ Locking mechanism in row level
• It can be used to implement writing a record based on previous
reading, i.e., incrementing the value of a counter
30. • Yahoo! Message Broker (YMB) [redo log]
– Topic-based publish/subscribe system
– Data is considered “committed” when they have been published to YMB
– At some point after being committed, the update will be asynchronously
propagated to different regions and applied to their replicas
• Recovery via YMB
– The tablet controller requests a copy from a particular remote replica (the
“source tablet”)
– A “checkpoint message” is published to YMB to ensure that any in-flight
updates at the time the copy is initiated are applied to the source tablet
– The source tablet is copied to the destination region
– Backup is used in practice
30
Recovery via YMB
31. Other Features
31
• Notifications
– One pub-sub topic per tablet
– Client knows about tables instead of tablets
– Automatically subscribed to all tablets in spite of
adding/removing tablets
– Undelivered notifications are handled in usual way
• Hosted Database Service
– Centrally-managed database service shared by
multiple applications
32. Experimental Setup
32
• Three PNUTS regions
• Workload
– 1200-3600 requests/second
– 0-50% writes
– 80% locality
• Insert Operation
Region Machine Servers/region
West 1, West 2 2.8 GHz Xeon, 4GB RAM 5 SU, 2 YMB,
1 Router, 1 Tablet controllerEast Quad 2.13 GHz Xeon, 4GB RAM
Region Latency (hash table) Latency (ordered table)
West 1 (master) 75.6 ms 33 ms
West 2 (non-master) 131.5 ms 105.8 ms
East (non-master) 315.5 ms 324.5 ms
34. • Existing DBMS fails to provide rich database functionality and low
latency at massive scale
• PNUTS uses a asynchronous geographic replication to ensure low write
latency
– Per-record timeline consistency that provides useful guarantees to
applications without sacrificing scalability
– Message broker that serves both as the replication mechanism and redo log
of the database
– Flexible mapping of tablets to storage units to support automated failover
and load balancing
• Future work
– Indexes and materialized views
– Bundled updates
– Batch query processing (MapReduce)
34
Conclusion and Future Work
35. • Asynchronous View Maintenance for VLSD Databases
– Agarwal et al., SIGMOD, 2009
– Indexes and views
• A Batch of PNUTS: Experiences Connecting Cloud Batch
and Serving Systems
– Silberstein et al., SIGMOD, 2011
– PNUTS-Hadoop
• Where in the World is My Data?
– Kadambi et al., VLDB, 2011
– Selective replication
35
Subsequent Advancements
36. • Remote view table
– A regular table but updated by the view maintainer
instead of a client
36
Indexes and Views
Update
YMB YMBSU
VM
37. 37
PNUTS-Hadoop
Reading from PNUTS
Hadoop Tasks
scan(0x2-0x4)
scan(0xa-0xc)
scan(0x8-0xa)
scan(0x0-0x2)
scan(0xc-0xe)
Map
PNUTS
1. Split PNUTS table into ranges
2. Each Hadoop task assigned a range
3. Task uses PNUTS scan API to retrieve
records in range
4. Task feeds scan results and feeds
records to map function
Record
Reader
Writing to PNUTS
Map or Reduce
Hadoop Tasks
PNUTS
Router
set
set
set
set
set
set
1. Call PNUTS set to write output
set
38. • If a European user’s record is never accessed in Asia, it does
not make sense to pay the bandwidth and disk costs to maintain
an Asian replica
• Static replacement
– Per-record constraints
– Client sets mandatory, disallowed regions
• Dynamic replacement
– Create replicas in regions where record is read
– Evict replicas from regions where record not read
– Lease-based
• When a replica read, guaranteed to survive for a time period
• Eviction lazy; when lease expires, replica deleted on next write
38
Selective Replication