Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Work-Bench
Delivered by Max Kuhn, Pfizer Global R&D, and Zachary Deane–Mayer, Cognius, at the inaugural New York R Conference in New York City at Work-Bench on Friday, April 44th, and Saturday, April 25th.
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...Raffi Khatchadourian
Actor concurrency is becoming increasingly important in the development of real-world software systems. Although actor concurrency may be less susceptible to some multithreaded concurrency bugs, such as low-level data races and deadlocks, it comes with its own bugs that may be different. However, the fundamental characteristics of actor concurrency bugs, including their symptoms, root causes, API usages, examples, and differences when they come from different sources are still largely unknown. Actor software development can significantly benefit from a comprehensive qualitative and quantitative understanding of these characteristics, which is the focus of this work, to foster better API documentation, development practices, testing, debugging, repairing, and verification frameworks. To conduct this study, we take the following major steps. First, we construct a set of 186 real-world Akka actor bugs from Stack Overflow and GitHub via manual analysis of 3,924
Stack Overflow questions, answers, and comments and 3,315 GitHub commits, messages, original and modified code snippets, issues, and pull requests. Second, we manually study these actor bugs and their fixes to understand and classify their symptoms, root causes, and API usages. Third, we study the differences between the commonalities and distributions of symptoms, root causes, and API usages of our Stack Overflow and GitHub actor bugs. Fourth, we discuss real-world examples of our actor bugs with these symptoms and root causes. Finally, we investigate the relation of our findings with those of previous work and discuss their implications. A few findings of our study are: (1) symptoms of our actor bugs can be classified into five categories, with Error as the most common symptom and Incorrect Exceptions as the least common, (2) root causes of our actor bugs can be classified into ten categories, with Logic as the most common root cause and Untyped Communication as the least common, (3) a small number of Akka API packages are responsible for most of API usages by our actor bugs, and (4) our Stack Overflow and GitHub actor bugs can differ significantly in commonalities and distributions of their symptoms, root causes, and API usages. While some of our findings agree with those of previous work, others sharply contrast.
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Work-Bench
Delivered by Max Kuhn, Pfizer Global R&D, and Zachary Deane–Mayer, Cognius, at the inaugural New York R Conference in New York City at Work-Bench on Friday, April 44th, and Saturday, April 25th.
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...Raffi Khatchadourian
Actor concurrency is becoming increasingly important in the development of real-world software systems. Although actor concurrency may be less susceptible to some multithreaded concurrency bugs, such as low-level data races and deadlocks, it comes with its own bugs that may be different. However, the fundamental characteristics of actor concurrency bugs, including their symptoms, root causes, API usages, examples, and differences when they come from different sources are still largely unknown. Actor software development can significantly benefit from a comprehensive qualitative and quantitative understanding of these characteristics, which is the focus of this work, to foster better API documentation, development practices, testing, debugging, repairing, and verification frameworks. To conduct this study, we take the following major steps. First, we construct a set of 186 real-world Akka actor bugs from Stack Overflow and GitHub via manual analysis of 3,924
Stack Overflow questions, answers, and comments and 3,315 GitHub commits, messages, original and modified code snippets, issues, and pull requests. Second, we manually study these actor bugs and their fixes to understand and classify their symptoms, root causes, and API usages. Third, we study the differences between the commonalities and distributions of symptoms, root causes, and API usages of our Stack Overflow and GitHub actor bugs. Fourth, we discuss real-world examples of our actor bugs with these symptoms and root causes. Finally, we investigate the relation of our findings with those of previous work and discuss their implications. A few findings of our study are: (1) symptoms of our actor bugs can be classified into five categories, with Error as the most common symptom and Incorrect Exceptions as the least common, (2) root causes of our actor bugs can be classified into ten categories, with Logic as the most common root cause and Untyped Communication as the least common, (3) a small number of Akka API packages are responsible for most of API usages by our actor bugs, and (4) our Stack Overflow and GitHub actor bugs can differ significantly in commonalities and distributions of their symptoms, root causes, and API usages. While some of our findings agree with those of previous work, others sharply contrast.
Build systems orchestrate how human-readable source code is translated into executable programs. In a software project, source code changes can induce changes in the build system (aka. build co-changes). It is difficult for developers to identify when build co-changes are necessary due to the complexity of build systems. Prediction of build co-changes works well if there is a sufficient amount of training data to build a model. However, in practice, for new projects, there exists a limited number of changes. Using training data from other projects to predict the build co-changes in a new project can help improve the performance of the build co-change prediction. We refer to this problem as cross-project build co-change prediction.
In this paper, we propose CroBuild, a novel cross-project build co-change prediction approach that iteratively learns new classifiers. CroBuild constructs an ensemble of classifiers by iteratively building classifiers and assigning them weights according to its prediction error rate. Given that only a small proportion of code changes are build co-changing, we also propose an imbalance-aware approach that learns a threshold boundary between those code changes that are build co-changing and those that are not in order to construct classifiers in each iteration. To examine the benefits of CroBuild, we perform experiments on 4 large datasets including Mozilla, Eclipse-core, Lucene, and Jazz, comprising a total of 50,884 changes. On average, across the 4 datasets, CroBuild achieves a F1-score of up to 0.408. We also compare CroBuild with other approaches such as a basic model, AdaBoost proposed by Freund et al., and TrAdaBoost proposed by Dai et al.. On average, across the 4 datasets, the CroBuild approach yields an improvement in F1-scores of 41.54%, 36.63%, and 36.97% over the basic model, AdaBoost, and TrAdaBoost, respectively.
Having trouble managing dependencies with golang ? Here's how to resolve those issues using some of the best tools built by the community for the community.
The Impact of Code Review Coverage and Participation on Software QualityShane McIntosh
Software code review, i.e., the practice of having third-party team members critique changes to a software system, is a well-established best practice in both open source and proprietary software domains. Prior work has shown that the formal code inspections of the past tend to improve the quality of software delivered by students and small teams. However, the formal code inspection process mandates strict review criteria (e.g., in-person meetings and reviewer checklists) to ensure a base level of review quality, while the modern, lightweight code reviewing process does not. Although recent work explores the modern code review process qualitatively, little research quantitatively explores the relationship between properties of the modern code review process and software quality. Hence, in this paper, we study the relationship between software quality and: (1) code review coverage, i.e., the proportion of changes that have been code reviewed, and (2) code review participation, i.e., the degree of reviewer involvement in the code review process. Through a case study of the Qt, VTK, and ITK projects, we find that both code review coverage and participation share a significant link with software quality. Low code review coverage and participation are estimated to produce components with up to two and five additional post-release defects respectively. Our results empirically confirm the intuition that poorly reviewed code has a negative impact on software quality in large systems using modern reviewing tools.
A collaborative DevOps is crucial to fast project development. With an MPL, modular pipeline library, such collaboration between teams is quick and easy.
Architecting The Future - WeRise Women in TechnologyDaniel Barker
Kubernetes and Docker are two of the top open source projects, and they’re built around abstractions and metadata. These two concepts are the key to architecting in the future. Come with me as I dig a little deeper into these concepts within k8s and Docker and provide some examples from my own work.
Presented at CPPP Conference, Paris, on 15th June 2019.
You've inherited some legacy code: it's valuable, but it doesn't have tests, and it wasn't designed to be testable, so you need to start refactoring! But you can't refactor safely until the code has tests, and you can't add tests without refactoring. How can you ever break out of this loop?
I present a new C++ library for applying Llewellyn Falco's "Approval Tests" approach to testing cross-platform C++ code - for both legacy and green-field systems, and a range of testing frameworks.
I describe its use in some real-world situations, including how to quickly lock down the behaviour of legacy code. I show how to quickly achieve good test coverage, even for very large sets of inputs. Finally, I also describe some general techniques I learned along the way.
This webinar by Oleksandr Navka (Lead Software Engineer, Consultant, GlobalLogic) was delivered at Java Community Webinar #2 on September 17, 2020.
Webinar agenda:
- tools for testing,
- features of creating a context for testing Spring-applications,
- context caching to speed up integration testing
More details and presentation: https://www.globallogic.com/ua/about/events/java-community-webinar-2/
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Rafael Ferreira da Silva
What is the best way to write secure and reliable applications?
- Good Enough Practices in Scientific Computing
(https://arxiv.org/abs/1609.00037)
- Best Practices for Scientific Computing (https://doi.org/10.1371/journal.pbio.1001745)
- Scientific Software Best Practices (https://scientific-software-best-practices.readthedocs.io/en/latest/)
- Best Practices for Scientific Software (https://software.ac.uk/blog/2017-11-29-best-practices-scientific-software)
OpenStack documentation has a series of documents for administrators and API users. All these documents need to be translated. The continuous development of documents brings difficulties to the translation management, but the process can be automated with continuous integration. This slide deck introduces the process and the technologies used in the translation management during OpenStack document internationalization. It also includes a demo for creating a Chinese version of manuals.
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Paul Richards
Presentation by Jean Sanderson given at June R Users Group meeting, describing how to build an R package and submit to CRAN, based on her experiences with the "adwave" package.
Collecting and Leveraging a Benchmark of Build System Clones to Aid in Qualit...Shane McIntosh
Build systems specify how sources are transformed into deliverables, and hence must be carefully maintained to ensure that deliverables are assembled correctly. Similar to source code, build systems tend to grow in complexity unless specifications are refactored. This paper describes how clone detection can aid in quality assessments that determine if and where build refactoring effort should be applied. We gauge cloning rates in build systems by collecting and analyzing a benchmark comprising 3,872 build systems. Analysis of the benchmark reveals that: (1) build systems tend to have higher cloning rates than other software artifacts, (2) recent build technologies tend to be more prone to cloning, especially of configuration details like API dependencies, than older technologies, and (3) build systems that have fewer clones achieve higher levels of reuse via mechanisms not offered by build technologies. Our findings aided in refactoring a large industrial build system containing 1.1 million lines.
But when you call me Bayesian I know I’m not the only oneWork-Bench
Delivered by Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, at the inaugural New York R Conference in New York City at Work-Bench on Friday, April 44th, and Saturday, April 25th.
Build systems orchestrate how human-readable source code is translated into executable programs. In a software project, source code changes can induce changes in the build system (aka. build co-changes). It is difficult for developers to identify when build co-changes are necessary due to the complexity of build systems. Prediction of build co-changes works well if there is a sufficient amount of training data to build a model. However, in practice, for new projects, there exists a limited number of changes. Using training data from other projects to predict the build co-changes in a new project can help improve the performance of the build co-change prediction. We refer to this problem as cross-project build co-change prediction.
In this paper, we propose CroBuild, a novel cross-project build co-change prediction approach that iteratively learns new classifiers. CroBuild constructs an ensemble of classifiers by iteratively building classifiers and assigning them weights according to its prediction error rate. Given that only a small proportion of code changes are build co-changing, we also propose an imbalance-aware approach that learns a threshold boundary between those code changes that are build co-changing and those that are not in order to construct classifiers in each iteration. To examine the benefits of CroBuild, we perform experiments on 4 large datasets including Mozilla, Eclipse-core, Lucene, and Jazz, comprising a total of 50,884 changes. On average, across the 4 datasets, CroBuild achieves a F1-score of up to 0.408. We also compare CroBuild with other approaches such as a basic model, AdaBoost proposed by Freund et al., and TrAdaBoost proposed by Dai et al.. On average, across the 4 datasets, the CroBuild approach yields an improvement in F1-scores of 41.54%, 36.63%, and 36.97% over the basic model, AdaBoost, and TrAdaBoost, respectively.
Having trouble managing dependencies with golang ? Here's how to resolve those issues using some of the best tools built by the community for the community.
The Impact of Code Review Coverage and Participation on Software QualityShane McIntosh
Software code review, i.e., the practice of having third-party team members critique changes to a software system, is a well-established best practice in both open source and proprietary software domains. Prior work has shown that the formal code inspections of the past tend to improve the quality of software delivered by students and small teams. However, the formal code inspection process mandates strict review criteria (e.g., in-person meetings and reviewer checklists) to ensure a base level of review quality, while the modern, lightweight code reviewing process does not. Although recent work explores the modern code review process qualitatively, little research quantitatively explores the relationship between properties of the modern code review process and software quality. Hence, in this paper, we study the relationship between software quality and: (1) code review coverage, i.e., the proportion of changes that have been code reviewed, and (2) code review participation, i.e., the degree of reviewer involvement in the code review process. Through a case study of the Qt, VTK, and ITK projects, we find that both code review coverage and participation share a significant link with software quality. Low code review coverage and participation are estimated to produce components with up to two and five additional post-release defects respectively. Our results empirically confirm the intuition that poorly reviewed code has a negative impact on software quality in large systems using modern reviewing tools.
A collaborative DevOps is crucial to fast project development. With an MPL, modular pipeline library, such collaboration between teams is quick and easy.
Architecting The Future - WeRise Women in TechnologyDaniel Barker
Kubernetes and Docker are two of the top open source projects, and they’re built around abstractions and metadata. These two concepts are the key to architecting in the future. Come with me as I dig a little deeper into these concepts within k8s and Docker and provide some examples from my own work.
Presented at CPPP Conference, Paris, on 15th June 2019.
You've inherited some legacy code: it's valuable, but it doesn't have tests, and it wasn't designed to be testable, so you need to start refactoring! But you can't refactor safely until the code has tests, and you can't add tests without refactoring. How can you ever break out of this loop?
I present a new C++ library for applying Llewellyn Falco's "Approval Tests" approach to testing cross-platform C++ code - for both legacy and green-field systems, and a range of testing frameworks.
I describe its use in some real-world situations, including how to quickly lock down the behaviour of legacy code. I show how to quickly achieve good test coverage, even for very large sets of inputs. Finally, I also describe some general techniques I learned along the way.
This webinar by Oleksandr Navka (Lead Software Engineer, Consultant, GlobalLogic) was delivered at Java Community Webinar #2 on September 17, 2020.
Webinar agenda:
- tools for testing,
- features of creating a context for testing Spring-applications,
- context caching to speed up integration testing
More details and presentation: https://www.globallogic.com/ua/about/events/java-community-webinar-2/
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Rafael Ferreira da Silva
What is the best way to write secure and reliable applications?
- Good Enough Practices in Scientific Computing
(https://arxiv.org/abs/1609.00037)
- Best Practices for Scientific Computing (https://doi.org/10.1371/journal.pbio.1001745)
- Scientific Software Best Practices (https://scientific-software-best-practices.readthedocs.io/en/latest/)
- Best Practices for Scientific Software (https://software.ac.uk/blog/2017-11-29-best-practices-scientific-software)
OpenStack documentation has a series of documents for administrators and API users. All these documents need to be translated. The continuous development of documents brings difficulties to the translation management, but the process can be automated with continuous integration. This slide deck introduces the process and the technologies used in the translation management during OpenStack document internationalization. It also includes a demo for creating a Chinese version of manuals.
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Paul Richards
Presentation by Jean Sanderson given at June R Users Group meeting, describing how to build an R package and submit to CRAN, based on her experiences with the "adwave" package.
Collecting and Leveraging a Benchmark of Build System Clones to Aid in Qualit...Shane McIntosh
Build systems specify how sources are transformed into deliverables, and hence must be carefully maintained to ensure that deliverables are assembled correctly. Similar to source code, build systems tend to grow in complexity unless specifications are refactored. This paper describes how clone detection can aid in quality assessments that determine if and where build refactoring effort should be applied. We gauge cloning rates in build systems by collecting and analyzing a benchmark comprising 3,872 build systems. Analysis of the benchmark reveals that: (1) build systems tend to have higher cloning rates than other software artifacts, (2) recent build technologies tend to be more prone to cloning, especially of configuration details like API dependencies, than older technologies, and (3) build systems that have fewer clones achieve higher levels of reuse via mechanisms not offered by build technologies. Our findings aided in refactoring a large industrial build system containing 1.1 million lines.
But when you call me Bayesian I know I’m not the only oneWork-Bench
Delivered by Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, at the inaugural New York R Conference in New York City at Work-Bench on Friday, April 44th, and Saturday, April 25th.
Delivered by Josh Katz (Graphics Editor, The New York Times) at the 2016 New York R Conference on April 8th and 9th at Work-Bench. See the rest of the conference videos & presentations at http://www.rstats.nyc.
Presented to Chicago R User Group, Jan 29 2015
Good data analysis is reproducible. If someone else can’t independently replicate your results from your data, the consequences can be severe. With R, a major challenge for reproducibility is the ever-changing package ecosystem: it's all too easy to develop an R script using packages, only to find collaborators will download later versions of those packages when they attempt to reproduce your results, and outcome can be unpredictable!
In this talk I'll introduce the Reproducible R Toolkit, and the "checkpoint" package, included with Revolution R Open, and describe some best practices for writing reliable, reproducible R code with packages.
Presented at useR! 2014, July 2, 2014.
The R ecosystem is in a state of near constant change. While a new version of the R engine is now released just once a year, 2-3 patches are usually released in the interim. On top of that, new versions of R packages on CRAN are released at rate of several per day (and that’s not counting packages that are part of the BioConductor project or hosted elsewhere on the Web).
While this rapid change is a boon for the advancement of R, it can cause problems for package authors[1] and also for scientists and their peers who may need to reliably reproduce the results of an R script (possibly dependent on a number of packages) months or even years down the line. In this talk we propose a downstream distribution of CRAN packages that provides for the reproducibility of R scripts and reduces the impact of dependencies for packages authors.
Apresentação do meetup "[JOI] TOTVS Developers Joinville - Java #1" que ocorreu dia 07/08/2019.
** Novidades Java, GraalVM e Quarkus
** Do zero à nuvem com Java e Kubernetes
This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
SPACK: A Package Manager for Supercomputers, Linux, and MacOSinside-BigData.com
“HPC software is becoming increasingly complex. The space of possible build configurations is combinatorial, and existing package management tools do not handle these complexities well. Because of this, most HPC software is built by hand. This talk introduces “Spack“, an open-source tool for scientific package management which helps developers and cluster administrators avoid having to waste countless hours porting and rebuilding software. Spack uses concise package recipes written in Python to automate builds with arbitrary combinations of compilers, MPI versions, and dependency libraries. With Spack, users can rapidly install software without knowing how to build it; developers can efficiently manage automatic builds of tens or hundreds of dependency libraries; and HPC centers staff can deploy many versions of software for thousands of users.”
Watch the video: http://insidehpc.com/2017/04/spack-package-manager-supercomputers-linux-macos/
Learn more: https://spack.io/
and
http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...Fasten Project
The FASTEN project wants to support DevOps teams and help developers tracking, managing and mastering dependencies. FASTEN’s goal is to develop a toolchain that is provisioning and collecting project information, security alerts, and repositories from well-known and widely used services. It merges this information into a data stream, performs analysis, stores it, and, consequently, builds a call-graph for each analyzed project. The gathered information is made available through a REST API and Web UI and performs continuous integration to provide developers with updated and sanitized versions of their dependencies. One part of this toolchain will be an Open Source license analysis. This analysis should perform a verification and compatibility check on licenses used in Open Source projects and facilitate development from a user perspective as well as create industry-relevant information on license infringements. This functionality shall be presented in this talk.
FASTEN has received funding from the European Union's Horizon 2020 research and innovation programme. It is carried out by a Consortium composed of AUEB, TUDelft, University of Milan-Bicocca, Endocode, OW2, SIG, and XWIKI.
How to implement continuous delivery with enterprise java middleware?Thoughtworks
The goal of Continuous Delivery is to move your production release frequency from months to weeks or even days. This all sounds great, but is Continuous Delivery achievable in a complex enterprise IT environment running Java EE middleware such as WebLogic, WebSphere or JBoss?
In this deck, Andrew Phillips, VP Products, XebiaLabs and Sriram Narayan, Product Principal, ThoughtWorks Studios examine the challenges of Continuous Delivery in a complex environment, the key drivers and benefits for moving to Continuous Delivery and simple ways to get started. We also demonstrate a Java EE delivery pipeline using ThoughtWorks Go and XebiaLabs Deployit that helps you get started and addresses the challenges commonly encountered in enterprise environments.
Новый InterSystems: open-source, митапы, хакатоныTimur Safin
Presentation for the 1st InterSystems Meetup in the Minsk:
- New and better InterSystems changes their practice.
- open-source repositories, meetups, and hackathon;
- CPM (package manager) as a good example of open-source project
Similar to Reproducibility with Checkpoint & RRO (20)
The Next Generational Shift In Enterprise Infrastructure Has Arrived. If SlideShare is broken, please download report here: https://www.scribd.com/document/352452857/2017-Enterprise-Almanac
AI to Enable Next Generation of People ManagersWork-Bench
In our work with hundreds of top fast growth startups and globally-distributed Fortune 1000 corporations in our enterprise tech ecosystem here in NYC, the most common refrain we hear is: "managing people is hard."
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
2. 2
OUR COMPANY
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
OUR PRODUCT
REVOLUTION R: The
enterprise-grade predictive
analytics application platform
based on the R language
SOME KUDOS
“This acquisition will help
customers use advanced
analytics within Microsoft data
platforms“
-- Joseph Sirosh, CVP C+E
3. What is Reproducibility?
“The goal of reproducible research is to tie specific
instructions to data analysis and experimental data so that
scholarship can be recreated, better understood and
verified.”
CRAN Task View on Reproducible Research (Kuhn)
Method + Environment
-> Results
A process for:
– Sharing the method
– Describing the environment
– Recreating the results
3
xkcd.com/242/
4. Reproducibility – why do we care?
Academic / Research
Verify results
Advance Research
Business
Production code
Reliability
Reusability
Collaboration
Regulation
4
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
5. Observations
R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
Good solutions for literate programming
– Rstudio / knitr / Rmarkdown
OS/Hardware not the major cause of problems
The big problem is with packages
– CRAN is in a state of continual flux
5
6. Package dependency explosion
R script file using 6 most popular packages
6
Any updated package = potential reproducibility error!
http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html
8. 8
Reproducible R Toolkit
Static CRAN mirror in Revolution R Open
– CRAN packages fixed with each RRO update
Daily CRAN snapshots
– Storing every package version since September 2014
– Hosted at mran.revolutionanalytics.com/snapshot
Write and share scripts synced to a specific snapshot date
– checkpoint package installed with RRO
projects.revolutionanalytics.com/rrt/
9. 9
Using checkpoint
Add 2 lines to the top of your script
library(checkpoint)
checkpoint("2015-01-28")
That’s it?
Optionally, check the R version as well
library(checkpoint)
checkpoint("2015-01-28", R.version="3.1.3")
Or, whichever date you want
10. 10
Using checkpoint
Easy to use: add 2 lines to the top of each script
library(checkpoint)
checkpoint("2014-09-17")
For the package author:
– Use package versions available on the chosen date
– Installs packages local to this project
• Allows different package versions to be used simultaneously
For a script collaborator:
– Automatically installs required packages
• Detects required packages (no need to manually install!)
– Uses same package versions as script author to ensure reproducibility
11. 11
R Script with packages having dependencies
## Adapted from https://gist.github.com/abresler/46c36c1a88c849b94b07
## Blog post Jan 21
## http://blog.revolutionanalytics.com/2015/01/a-beautiful-story-about-nyc-weather.html
library(checkpoint)
checkpoint("2015-02-05",R="3.1.2")
library(dplyr)
library(tidyr)
library(magrittr)
library(ggplot2)
12. 12
First time script is run
checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics
http://projects.revolutionanalytics.com/rrt/
> checkpoint("2015-02-05",R="3.1.2")
Scanning for packages used in this project
|==========================================================================================| 100%
- Discovered 5 packages
Installing packages used in this project
- Installing ‘dplyr’
also installing the dependencies ‘assertthat’, ‘R6’, ‘Rcpp’, ‘magrittr’, ‘lazyeval’, ‘DBI’, ‘BH’
package ‘assertthat’ successfully unpacked and MD5 sums checked
package ‘R6’ successfully unpacked and MD5 sums checked
package ‘Rcpp’ successfully unpacked and MD5 sums checked
package ‘magrittr’ successfully unpacked and MD5 sums checked
package ‘lazyeval’ successfully unpacked and MD5 sums checked
package ‘DBI’ successfully unpacked and MD5 sums checked
package ‘BH’ successfully unpacked and MD5 sums checked
package ‘dplyr’ successfully unpacked and MD5 sums checked
- Installing ‘ggplot2’
also installing the dependencies ‘colorspace’, ‘stringr’, ‘RColorBrewer’, ‘dichromat’, ‘munsell’, ‘labeling’, ‘plyr’, ‘digest’, ‘gtable’, ‘reshape2’,
‘scales’, ‘proto’, ‘MASS’
13. 13
Lots of dependent packages loaded
package ‘colorspace’ successfully unpacked and MD5 sums checked
package ‘stringr’ successfully unpacked and MD5 sums checked
package ‘RColorBrewer’ successfully unpacked and MD5 sums checked
package ‘dichromat’ successfully unpacked and MD5 sums checked
package ‘munsell’ successfully unpacked and MD5 sums checked
package ‘labeling’ successfully unpacked and MD5 sums checked
package ‘plyr’ successfully unpacked and MD5 sums checked
package ‘digest’ successfully unpacked and MD5 sums checked
package ‘gtable’ successfully unpacked and MD5 sums checked
package ‘reshape2’ successfully unpacked and MD5 sums checked
package ‘scales’ successfully unpacked and MD5 sums checked
package ‘proto’ successfully unpacked and MD5 sums checked
package ‘MASS’ successfully unpacked and MD5 sums checked
package ‘ggplot2’ successfully unpacked and MD5 sums checked
- Previously installed ‘magrittr’
- Installing ‘tidyr’
also installing the dependency ‘stringi’
package ‘stringi’ successfully unpacked and MD5 sums checked
package ‘tidyr’ successfully unpacked and MD5 sums checked
checkpoint process complete
14. Checkpoint tips for script authors
Work within a project
– Dedicated folder with scripts, data and output
– eg /Users/Joe/Rstudio_Projects/weather
Create a master .R script file beginning with
library(checkpoint)
checkpoint("DATE")
– package versions used will be as of this date
Don’t use install.packages directly
– Use library() and checkpoint does the rest
– You can have different package versions installed for different projects at the
same time!
14
15. Sharing projects with checkpoint
Just share your script or project folder!
Recipient only needs:
– compatible R version
– checkpoint package (installed with RRO)
– Internet connection to MRAN (at least first time)
Checkpoint takes care of:
– Installing CRAN packages
• Binaries (ease of installation)
• Correct versions (reproducibility)
• Dependencies (ease of installation)
– Eliminating conflicts with other installed packages
15
16. The checkpoint magic
The checkpoint() call does all this:
Scans project for required packages
Installs required packages and dependencies
– Packages installed specific to project
– Versions specific to checkpoint date
• Installed in ~/.checkpoint/DATE
• Skips packages if already installed (2nd run through)
Reconfigures package search path
– Points only to project-specific library
16
17. MRAN checkpoint server
Checkpoint uses MRAN’s downstream CRAN mirror
with daily snapshots.
17
CRAN
RRDaily
snapshots
checkpoint
package
library(checkpoint)
checkpoint("2015-01-28")
CRAN mirror
cran.revolutionanalytics.com
(Windows binaries virus-scanned)
checkpoint
server
Midnight
UTC
mran.revolutionanalytics.com/snapshot/
18. checkpoint server - implementation
Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots.
rsync to mirror CRAN daily
– Only downloads changed packages
zfs to store incremental snapshots
– Storage only required for new packages
Organizes snapshots into a labelled hierarchy
– mran.revolutionanalytics.com/snapshot/YYYY-MM-DD
MRAN hosted by high-performance cloud provider
– Provisioned for availability and latency
18
https://github.com/RevolutionAnalytics/checkpoint-server
19. 19
Using non-CRAN packages Reproducibly
Today, checkpoint only manages packages from CRAN
GitHub: use install_github with a specific checkin hash
install_github("ramnathv/rblocks",
ref="a85e748390c17c752cc0ba961120d1e784fb1956")
BioConductor: use packages from a specific BioConductor release
– Not as easy as it seems!
Private packages / behind the firewall
– use miniCRAN to create a local, static repository
20. 20
Why use checkpoint?
Write and share code R whose results can be reproduced, even if new (and
possibly incompatible) package versions are released later.
Share R scripts with others that will automatically install the appropriate package
versions (no need to manually install CRAN packages).
Write R scripts that use older versions of packages, or packages that are no
longer available on CRAN.
Install packages (or package versions) visible only to a specific project, without
affecting other R projects or R users on the same system.
Manage multiple projects that use different package versions.
21. What about packrat?
Packrat is flexible and powerful
– Supports non-CRAN packages (e.g. github)
– Allows mix-and-matching package versions
– Requires shipping all package source
– Requires recipients to build packages from source
Checkpoint is simple
– Reproducibility from one script
– Simple for recipients to reproduce results
– Only allows use of CRAN packages versions that have been tested
together
– Requires Web access (and availability of MRAN)
21
rstudio.github.io/packrat/
22. 22
Revolution R Open is:
An enhanced open source distribution of R
Compatible with all R-related software
Multi-threaded for performance
Focused on reproducibility
Open source (GPLv2 license)
Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE
Includes checkpoint
Designed to work with RStudio
Side-by-side installations
Download from mran.revolutionanalytics.com
23. CRAN mirrors and RRO
Revolution R Open ships with a fixed default CRAN mirror
– Currently, 1 March 2015 snapshot (v 8.0.2)
– Soon: (v 8.0.3)
All users of same version get same CRAN package versions by default
– regardless when “install.packages” is run
Use checkpoint to access newer package versions
23
24. 24
MRAN
The Managed R Archive Network
Download Revolution R Open
Learn about R and RRO
Explore R Packages
Explore Task Views
R tips and applications
Daily CRAN snapshots
mran.revolutionanalytics.com
25. Thank you.
Learn more at:
projects.revolutionanalytics.com/rrt/
Find archived webinars at:
revolutionanalytics.com/webinars
www.revolutionanalytics.com
1.855.GET.REVO
Twitter: @RevolutionR
Joseph Rickert
Data Scientist, Community Manager
Revolution Analytics
@RevoJoe
Joseph.rickert@revolutionanalytics.com
26. 26
Multi-threaded performance
Intel MKL replaces standard
BLAS/LAPACK algorithms
(Windows/Linux)
Pipelined operations
– Optimized for Intel, works for all archs
High-performance algorithms
Sequential Parallel
– Uses as many threads as there are
available cores
– Control with:
setMKLthreads(<value>)
No need to change any R code
Included in RRO binary distribution
More at Revolutions blog
27. 27
# Create a local checkpoint library
library(checkpoint)
checkpoint("2014-11-14")
> library(checkpoint)
checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics
http://projects.revolutionanalytics.com/rrt/
Warning message:
package ‘checkpoint’ was built under R version 3.1.2
> checkpoint("2014-11-14")
Scanning for loaded pkgs
Scanning for packages used in this project
Installing packages used in this project
Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available
also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’,
‘survival’, ‘XML’, ‘httr’, ‘Matrix’
package ‘bitops’ successfully unpacked and MD5 sums checked
package ‘stringr’ successfully unpacked and MD5 sums checked
package ‘digest’ successfully unpacked and MD5 sums checked
package ‘jsonlite’ successfully unpacked and MD5 sums checked
package ‘lattice’ successfully unpacked and MD5 sums checked
package ‘RCurl’ successfully unpacked and MD5 sums checked
package ‘rjson’ successfully unpacked and MD5 sums checked
package ‘statmod’ successfully unpacked and MD5 sums checked
package ‘survival’ successfully unpacked and MD5 sums checked
package ‘XML’ successfully unpacked and MD5 sums checked
package ‘httr’ successfully unpacked and MD5 sums checked
package ‘Matrix’ successfully unpacked and MD5 sums checked
package ‘h2o’ successfully unpacked and MD5 sums checked
package ‘miniCRAN’ successfully unpacked and MD5 sums checked
package ‘igraph’ successfully unpacked and MD5 sums checked
28. SVD 100,000 by 2,000 Matrix
28
RRO R3.1.1
user system elapsed
78.08 1.18 42.51
R 3.1.2
user system elapsed
431.44 0.95 432.70
29. 29
Revolution R Plus Technical Support
Technical support for R, from the R experts.
Developers: Email and phone support, 8AM-6PM Mon-Fri
Production Servers and Hadoop: 24x7 Severity-1 coverage
On-line case management and knowledgebase
Access to technical resources, documentation and user forums
Exclusive on-line webinars from community experts
Guaranteed response times
Also available: expert hands-on and on-line training for R, from
Revolution Analytics AcademyR.
www.revolutionanalytics.com/plus