OpenOffice.org Performance Analysis - to improve responsiveness for older PCOsdev
A client with 1x,000 users has gradually migrated to OpenOffice.org for a few years. They found many complaints about the speed of OpenOffice.org compare to MS Office. This kind of complaint are familiar to us but the magnitude of the problems, e.g. open-file time in minutes seems rather strange. So we investigated the problems to find the kind and cause of them to reach a conclusion that could be use to improve the situation in our client case.
We found that the problem depends solely on hardware, notably RAM, vs. file size. Our client happen to have old IT infrastructure with a lot of old PC, e.g. Pentium with 256M RAM. Add enough RAM (>512M) and OpenOffice.org will run fine for files of average sizes. Add more (e.g. >1024M) to open larger files.
Here is the presentation of our finding.
OpenOffice.org Performance Analysis - to improve responsiveness for older PCOsdev
A client with 1x,000 users has gradually migrated to OpenOffice.org for a few years. They found many complaints about the speed of OpenOffice.org compare to MS Office. This kind of complaint are familiar to us but the magnitude of the problems, e.g. open-file time in minutes seems rather strange. So we investigated the problems to find the kind and cause of them to reach a conclusion that could be use to improve the situation in our client case.
We found that the problem depends solely on hardware, notably RAM, vs. file size. Our client happen to have old IT infrastructure with a lot of old PC, e.g. Pentium with 256M RAM. Add enough RAM (>512M) and OpenOffice.org will run fine for files of average sizes. Add more (e.g. >1024M) to open larger files.
Here is the presentation of our finding.
As the demand for computing power is quickly
increasing in the automotive domain, car manufactur-ers and tier-one suppliers are gradually introducing mul-ticore ECUs in their electronic architectures. Additionally, these multicore ECUs offer new features such as higher levels of parallelism which eases the respect of
the safety requirements introduced by the ISO 26262 and can be taken advantage of in various other automotive use-cases. These new features involve also more complexity in the design, development and verification of the software applications. Hence, OEMs and suppliers will require new tools and methodologies for deployment and
validation. In this paper, we present the main use cases
for multicore ECUs and then focus on one of them. Pre-
cisely, we address the problem of scheduling numerous
elementary software components (called runnables) on
a limited set of identical cores. In the context of an au-
tomotive design, we assume the use of the static task
partitioning scheme which provides simplicity and bet-
ter predictability for the ECU designers by comparison
with a global scheduling approach. We show how the
global scheduling problem can be addressed as two sub-
problems: partitioning the set of runnables and building
the schedule on each core. At that point, we prove that
each of the sub-problems cannot be solved optimally due
to their algorithmic complexity. We then present low com-
plexity heuristics to partition and build a schedule of the
runnable set on each core before discussing schedula-
bility verification methods. Finally, we assess the perfor-
mance of our approach on realistic case-studies.
This slide deck explains a bit how to deal best with state in scalable systems, i.e. pushing it to the system boundaries (client, data store) and trying to avoid state in-between.
Then it picks arbitrarily two scenarios - one in the frontend part and one in the backend part of a system and shows concrete techniques to deal with them.
In the frontend part is examined how to deal with session state of servlet containers in scalable scenarios and introduces the concept of a shared session cache layer. Also an example implementation using Redis is shown.
In the backend part it is examined how to deal with potential data inconsistencies that can occur if maximum availability of the data store is required and eventual consistency is used. The normal way is to resolve inconsistencies manually implementing business specific logic or - even worse - asking the user to resolve it. A pure technical solution called CRDTs (Conflict-free Replicated Data Types) is then shown. CRDTs, based on sound mathematical concepts, are self-stabilizing data structures that offer a generic way to resolve inconsistencies in an eventual consistent data store. Besides some theory also some examples are shown to provide a feeling how CRDTs feel in practice.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld
In this session we'll dive deep into how the vSphere compute and memory schedulers work to provide the same level of performance as bare metal. Hosted by two outstanding performance engineers, they will review concepts like how and when vSphere schedules vCPUs, how virtual machines are idles, understand virtual machine memory overhead and how large memory pages help or hurt performance. If you want to understand what vSphere does at an atomic level you don't want to miss this advanced session.
This is an updated version of my JITServer talk that I will present at Open Source Summit North America in May 2023
The Next Frontier in Open Source Java Compilers: Just-In-Time Compilation as a Service
For Java developers, the Just-In-Time (JIT) compiler is key to improved performance. However, in a container world, the performance gains are often negated due to CPU and memory consumption constraints. To help solve this issue, the Eclipse OpenJ9 JVM provides JITServer technology, which separates the JIT compiler from the application.
JITServer allows the user to employ much smaller containers enabling a higher density of applications, resulting in cost savings for end-users and/or cloud providers. Because the CPU and memory surges due to JIT compilation are eliminated, the user has a much easier task of provisioning resources for his/her application. Additional advantages include: faster ramp-up time, better control over resources devoted to compilation, increased reliability (JIT compiler bugs no longer crash the application) and amortization of compilation costs across many application instances.
We will dig into JITServer technology, showing the challenges of implementation, detailing its strengths and weaknesses and illustrating its performance characteristics. For the cloud audience we will show how it can be deployed in containers, demonstrate its advantages compared to a traditional JIT compilation technique and offer practical recommendations about when to use this technology.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
More Related Content
Similar to Performance and Availability Tradeoffs in Replicated File Systems
As the demand for computing power is quickly
increasing in the automotive domain, car manufactur-ers and tier-one suppliers are gradually introducing mul-ticore ECUs in their electronic architectures. Additionally, these multicore ECUs offer new features such as higher levels of parallelism which eases the respect of
the safety requirements introduced by the ISO 26262 and can be taken advantage of in various other automotive use-cases. These new features involve also more complexity in the design, development and verification of the software applications. Hence, OEMs and suppliers will require new tools and methodologies for deployment and
validation. In this paper, we present the main use cases
for multicore ECUs and then focus on one of them. Pre-
cisely, we address the problem of scheduling numerous
elementary software components (called runnables) on
a limited set of identical cores. In the context of an au-
tomotive design, we assume the use of the static task
partitioning scheme which provides simplicity and bet-
ter predictability for the ECU designers by comparison
with a global scheduling approach. We show how the
global scheduling problem can be addressed as two sub-
problems: partitioning the set of runnables and building
the schedule on each core. At that point, we prove that
each of the sub-problems cannot be solved optimally due
to their algorithmic complexity. We then present low com-
plexity heuristics to partition and build a schedule of the
runnable set on each core before discussing schedula-
bility verification methods. Finally, we assess the perfor-
mance of our approach on realistic case-studies.
This slide deck explains a bit how to deal best with state in scalable systems, i.e. pushing it to the system boundaries (client, data store) and trying to avoid state in-between.
Then it picks arbitrarily two scenarios - one in the frontend part and one in the backend part of a system and shows concrete techniques to deal with them.
In the frontend part is examined how to deal with session state of servlet containers in scalable scenarios and introduces the concept of a shared session cache layer. Also an example implementation using Redis is shown.
In the backend part it is examined how to deal with potential data inconsistencies that can occur if maximum availability of the data store is required and eventual consistency is used. The normal way is to resolve inconsistencies manually implementing business specific logic or - even worse - asking the user to resolve it. A pure technical solution called CRDTs (Conflict-free Replicated Data Types) is then shown. CRDTs, based on sound mathematical concepts, are self-stabilizing data structures that offer a generic way to resolve inconsistencies in an eventual consistent data store. Besides some theory also some examples are shown to provide a feeling how CRDTs feel in practice.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
VMworld 2015: Extreme Performance Series - vSphere Compute & MemoryVMworld
In this session we'll dive deep into how the vSphere compute and memory schedulers work to provide the same level of performance as bare metal. Hosted by two outstanding performance engineers, they will review concepts like how and when vSphere schedules vCPUs, how virtual machines are idles, understand virtual machine memory overhead and how large memory pages help or hurt performance. If you want to understand what vSphere does at an atomic level you don't want to miss this advanced session.
This is an updated version of my JITServer talk that I will present at Open Source Summit North America in May 2023
The Next Frontier in Open Source Java Compilers: Just-In-Time Compilation as a Service
For Java developers, the Just-In-Time (JIT) compiler is key to improved performance. However, in a container world, the performance gains are often negated due to CPU and memory consumption constraints. To help solve this issue, the Eclipse OpenJ9 JVM provides JITServer technology, which separates the JIT compiler from the application.
JITServer allows the user to employ much smaller containers enabling a higher density of applications, resulting in cost savings for end-users and/or cloud providers. Because the CPU and memory surges due to JIT compilation are eliminated, the user has a much easier task of provisioning resources for his/her application. Additional advantages include: faster ramp-up time, better control over resources devoted to compilation, increased reliability (JIT compiler bugs no longer crash the application) and amortization of compilation costs across many application instances.
We will dig into JITServer technology, showing the challenges of implementation, detailing its strengths and weaknesses and illustrating its performance characteristics. For the cloud audience we will show how it can be deployed in containers, demonstrate its advantages compared to a traditional JIT compilation technique and offer practical recommendations about when to use this technology.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Similar to Performance and Availability Tradeoffs in Replicated File Systems (20)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
Performance and Availability Tradeoffs in Replicated File Systems
1. Performance and
Availability Tradeoffs in
Replicated File Systems
Peter Honeyman
Center for Information Technology Integration
University of Michigan, Ann Arbor
2. Acknowledgements
• Joint work with Dr. Jiaying Zhang
• Now at Google
• This was a chapter of her dissertation
• Partially supported by
• NSF/NMI GridNFS
• DOE/SciDAC Petascale Data Storage Institute
• NetApp
• IBM ARC
9. Parameters
• Failure free, single server run time
• Can be estimated or measured
• Our focus is on 1 to 10 days
10. Parameters
• Replication overhead
• Penalty associated with replication to
backup servers
• Proportional to RTT
• Ratio can be measured by running with a
backup server a few msec away
11. Parameters
• Recovery time
• Time to detect failure of the primary
server and switch to a backup server
• Not a sensitive parameter
13. Server failure
• Estimated by analyzing PlanetLab ping data
• 716 nodes, 349 sites, 25 countries
• All-pairs, 15 minute interval, 1/04 to 6/05
• 692 nodes were alive throughout
• We ascribe missing pings to node failure
and network partition
15. Correlated failures
failed
nodes
nodes per site 2 3 4 5
2 0.526 0.593 0.552 0.561
3 0.546 0.440 0.538
4 0.378 0.488
5 0.488
number of sites 259 65 21 11
P(n nodes down | 1 node down)
16. 0.25
Correlated failures
Average Failure Correlations
0.20
0.15
0.10
0.05
0
25 75 125 175
RTT (ms)
nodes slope y-intercept
2 -2.4 x 10-4 0.195
3 -2.3 x 10-4 0.155
4 -2.3 x 10-4 0.134
5 -2.4 x 10-4 0.119
17. Run-time model
• Discrete event simulation for expected run
time and utilization
recover
fail ok
fail
start run end
18. Simulation results
one hour no replication: utilization = .995
write intensity
0.0001
0.001
0.01
RTT 0.1
1.0 1.0
0.8 0.8
0.6 0.6
RTT RTT
One backup Four backups
19. Simulation results
one day no replication: utilization = .934
write intensity
0.0001
0.001
0.01
RTT 0.1
1.0 1.0
0.8 0.8
0.6 0.6
RTT RTT
One backup Four backups
20. Simulation results
ten days no replication: utilization = .668
RTT RTT
1.00 1.00
0.75 0.75
0.50 0.50
RTT RTT
One backup Four backups
21. Simulation discussion
• Replication improves utilization for long-
running jobs
• Multiple backup servers do not improve
utilization (due to low PlanetLab failure
rates)
22. Simulation discussion
• Distant backup servers improve utilization
for light writers
• Distant backup servers do not improve
utilization for heavy writers
• Implications for checkpoint interval …
23. Checkpoint interval
calculated on the
back of a napkin
one day, 20% checkpoint overhead
10 day, 2% checkpoint overhead 10 day, 2% checkpoint overhead
one backup server four backup servers
24. Work in progress
• Realistic failure data
• Storage and processor failure
• PDSI failure data repository
• Realistic checkpoint costs — help!
• Realistic replication overhead
• Depends on amount of computation
• Less than 10% for NAS Grid Benchmarks
25. Conclusions
• Conventional wisdom holds that
consistent mutable replication
in large-scale distributed systems
is too expensive to consider
• Our study suggests otherwise
26. Conclusions
• Consistent replication in large-scale
distributed storage systems is
feasible and practical
• Superior performance
• Rigorous adherence to conventional file
system semantics
• Improved utilization
27. Thank you for your attention!
www.citi.umich.edu
Questions?