This document discusses various programming language trends including functional programming languages like Haskell, purely object-oriented languages, and Scala. It also covers database trends like NoSQL databases Cassandra and MongoDB. Programming tools like Docker and Vagrant are mentioned. The document discusses paradigms like static vs dynamic typing and strong vs weak typing. It provides examples and resources for languages including Haskell, Erlang, Scala, and databases like Cassandra.
Third-Party Software Library Reuse : From Adoption to MigrationAli Ouni
AndroLib is a search-based approach for recommending libraries for Android apps. It uses NSGA-II to generate a set of non-dominated library recommendation solutions based on three objectives: maximizing recommended library co-usage, maximizing library functional diversity, and maximizing reuse from successful apps. This addresses limitations of existing approaches which do not customize recommendations for Android or consider libraries together. An evaluation compares AndroLib to state-of-the-art techniques to show it achieves more accurate recommendations.
Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.2- Introduction to Galaxy. A web-based genome analysis platform.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Using Apache Spark with IBM SPSS Modeler with Dr. Steve Poulin.
An introduction to Apache Spark and its relevant integration with IBM SPSS Modeler. Why integrate? What type of benefits?
A review the integration process high level and advise which enhanced features to pay attention to, and common pitfalls to avoid.
BigDataSpain 2016: Stream Processing Applications with Apache ApexThomas Weise
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
This document discusses various programming language trends including functional programming languages like Haskell, purely object-oriented languages, and Scala. It also covers database trends like NoSQL databases Cassandra and MongoDB. Programming tools like Docker and Vagrant are mentioned. The document discusses paradigms like static vs dynamic typing and strong vs weak typing. It provides examples and resources for languages including Haskell, Erlang, Scala, and databases like Cassandra.
Third-Party Software Library Reuse : From Adoption to MigrationAli Ouni
AndroLib is a search-based approach for recommending libraries for Android apps. It uses NSGA-II to generate a set of non-dominated library recommendation solutions based on three objectives: maximizing recommended library co-usage, maximizing library functional diversity, and maximizing reuse from successful apps. This addresses limitations of existing approaches which do not customize recommendations for Android or consider libraries together. An evaluation compares AndroLib to state-of-the-art techniques to show it achieves more accurate recommendations.
Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.2- Introduction to Galaxy. A web-based genome analysis platform.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Using Apache Spark with IBM SPSS Modeler with Dr. Steve Poulin.
An introduction to Apache Spark and its relevant integration with IBM SPSS Modeler. Why integrate? What type of benefits?
A review the integration process high level and advise which enhanced features to pay attention to, and common pitfalls to avoid.
BigDataSpain 2016: Stream Processing Applications with Apache ApexThomas Weise
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Kafka Connect allows data ingestion into and out of Kafka topics from external systems. It uses connectors that define how to read/write data from sources like files or databases and map them to Kafka topics. Connectors contain a SourceConnector that runs on the leader node and distributes work, and SourceTasks that do the actual data ingestion work. Sink connectors work similarly to ingest data from Kafka topics to external systems. While Kafka Connect provides a simple way to integrate systems with Kafka, it lacks some capabilities like exactly once delivery and backpressure control for ingestion speed.
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
This document discusses real-time log analysis using Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale. It provides an overview of the architecture, describing how data from various sources like syslog can be ingested into Kafka via Docker producers. It then discusses consuming from Kafka to write to Cassandra in real-time and running Spark jobs on Cassandra data. The document uses these open source tools together in a reference architecture to enable real-time analytics and search capabilities on streaming data.
The document describes an approach called Doc2Spec that infers resource specifications from natural language API documentation. It extracts method descriptions and class hierarchies from documentation using NLP. It builds action-resource pairs and infers automata specifications. An evaluation on 5 libraries found the inferred specifications have high precision and recall, and are useful for detecting real bugs in open source projects. Future work includes improving specification templates and applying it to other documentation.
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark
By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together.
R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics.
This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS .
- How to Scale R
- Work with R and Hadoop + Spark
-Demo of MLS on HDP/HDInsight server with RStudio
- How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio:
Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.
This document provides an overview of Kafka and Spark Streaming. It discusses key concepts of Kafka like brokers, topics, producers and consumers. It also explains how Spark Streaming works using micro-batches and time window APIs. Finally, it discusses reliability aspects and references for further reading.
Bringing complex event processing to Spark streamingDataWorks Summit
Complex event processing (CEP) is about identifying business opportunities and threats in real time by detecting patterns in data and taking appropriate automated action. Example business use cases for CEP include location-based marketing, smart inventories, targeted ads, Wi-Fi offloading, fraud detection, churn prediction, fleet management, predictive maintenance, security incident event management, and many more. While Spark Streaming provides a distributed resilient framework for ingesting events in real time, effort is still needed to build CEP applications. This is because CEP use cases require correlation of events, which in turn requires us to treat every incoming event as a discrete occurrence in time. Spark Streaming treats the entire batch of events as single occurrence. Many CEP use cases also require alerts to be fired even when there is no incoming event. An example of such use case is to fire an alert when an order-shipped event is NOT received within the SLA times following an order-received event. At Oracle we have adopted a few neat techniques like running continuous query engines as long running tasks, using empty batches as triggers, etc. to bring complex event processing to Spark Streaming.
Join us to learn more on CEP for Spark, the fastest growing data processing platform in the world.
Speakers
Prabhu Thukkaram, Senior Director, Product Development, Oracle
Hoyong Park, Architect, Oracle
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
Sentiment Analysis (SA) relates to the use of: Natural Language Processing (NLP), analysis and computational linguistics text to extract and identify subjective information in the source material. A fundamental task of SA is to "classify" the polarity of a given document text, phrases or levels of functionality/appearance - whether the opinion expressed in a document or in a sentence is positive, negative or neutral. Usually, this analysis is performed "offline" using Machine Learning (ML) techniques. In this project two online tweet classification methods have been proposed, which exploits the well known framework "Apache Spark" for processing the data and the tool "Apache Zeppelin" for data visualization.
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
The document discusses data science workflows on Hadoop. It describes data science as involving three phases - data plumbing to ingest and transform data, exploratory analytics to investigate and analyze data, and operational analytics to build and deploy models. It provides examples of tools used for each phase including Spark, Hadoop streaming, SAS, and Python for exploratory analytics, and MLlib and Spark for operational analytics. The document also discusses lambda architectures for handling both batch and real-time analytics.
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
In this talk Senior Software engineer Wolfgang Hoschek from Cloudera discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. The talk was recorded at the SumbleUpon offices.
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines. He is a former CERN fellow and received his Ph.D from the Technical University of Vienna, Austria, and M.S from the University of Linz, Austria.
This document discusses improvements to ORC support in Apache Spark 2.3. It describes previous issues with ORC performance and compatibility in Spark. The current approach in Spark 2.3 introduces a new native ORC file format that provides significantly better performance compared to the previous Hive ORC implementation. It allows configuring the ORC implementation and reader type. The document also demonstrates ORC usage in Spark and PySpark. Benchmark results show the native ORC reader provides up to 15x faster performance for scans and predicate pushdown. Future work items are discussed to further improve ORC support in Spark.
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
In this article I am going to present some of the most useful Scala libraries and frameworks which help ASSIST Software engineers to develop highly scalable applications that support concurrency and non-blocking.
Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen
1) The document discusses a model-based framework for analyzing large scale software repositories. It involves reverse engineering software from version control systems to create abstract syntax tree models, applying transformations and queries to derive metrics and insights, and using Scala for flexible queries and transformations.
2) Two example analyses are described: calculating design structure matrices and propagation costs, and detecting cross-cutting concerns by analyzing co-changed methods within commits.
3) The goal is to enable scalable, language-independent analysis of ultra-large repositories through model-based techniques instead of analyzing raw code directly. This allows abstracting different languages and repositories with common models and analyses.
Introduction to Roslyn and its use in program developmentPVS-Studio
Roslyn is a platform which provides the developer with powerful tools to parse and analyze code. It's not enough just to have these tools, you should also understand what they are needed for. This article is intended to answer these questions. Besides this, you will find details about the static analyzer development which uses Roslyn API.
Roslyn is a platform which provides the developer with powerful tools to parse and analyze code. It's not enough just to have these tools, you should also understand what they are needed for. This article is intended to answer these questions. Besides this, you will find details about the static analyzer development which uses Roslyn API.
The document discusses key concepts related to memory models in C#, including:
1. The compilation process involves lexical analysis, parsing, semantic analysis, optimization, and code generation.
2. Value types are stored on the stack while reference types are stored on the heap.
3. The garbage collector performs memory management by freeing up unused memory on the heap.
A presentation at Twitter's official developer conference, Chirp, about why we use the Scala programming language and how we build services in it. Provides a tour of a number of libraries and tools, both developed at Twitter and otherwise.
Citation Networks present us with a wide variety of problems. This project interprets a large number of Computer Science Research Papers from the DBLP archives and predicts a field in which a certain author is likely to contribute in the near future.
Building .NET Core tools using the Roslyn API by Arthur Tabatchnic at .Net fo...DevClub_lv
The Roslyn C# gives developers access to the C# language parsing and generation capabilities. This can be leveraged in many ways by building tools that fully understand the code and are capable of generating new code with full comprehension of your codebase. In this talk I will walk you through the basic concepts of the Roslyn and language parsing, building a small tool using these principles and packaging it as a .NET Core tool for easy distribution and usage via the .NET Core CLI
This presentation is motivated by the continuous growth of Scala language popularity thanks to many new concepts it offers. Therefore, it makes a perfect sense to take a further insight on this language. Beside the language itself, its ecosystem is also very important. That is why I will focus on the Scala ecosystem in this presentation.
Analyzing the Evolution of Testing Library Usage in Open Source Java ProjectsAhmed Zerouali
This document analyzes the evolution of testing library usage in open source Java projects. It finds that JUnit is the most frequently used testing library. It also examines which libraries are introduced when in a project's lifetime, which libraries are used over time, and whether projects migrate between competing libraries. The analysis is based on over 4,500 Java projects from GitHub that use Maven and have been active for at least two years. It finds that many libraries are used simultaneously and that JUnit is dominant, but that a small percentage of projects do migrate between libraries.
Kafka Connect allows data ingestion into and out of Kafka topics from external systems. It uses connectors that define how to read/write data from sources like files or databases and map them to Kafka topics. Connectors contain a SourceConnector that runs on the leader node and distributes work, and SourceTasks that do the actual data ingestion work. Sink connectors work similarly to ingest data from Kafka topics to external systems. While Kafka Connect provides a simple way to integrate systems with Kafka, it lacks some capabilities like exactly once delivery and backpressure control for ingestion speed.
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
This document discusses real-time log analysis using Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale. It provides an overview of the architecture, describing how data from various sources like syslog can be ingested into Kafka via Docker producers. It then discusses consuming from Kafka to write to Cassandra in real-time and running Spark jobs on Cassandra data. The document uses these open source tools together in a reference architecture to enable real-time analytics and search capabilities on streaming data.
The document describes an approach called Doc2Spec that infers resource specifications from natural language API documentation. It extracts method descriptions and class hierarchies from documentation using NLP. It builds action-resource pairs and infers automata specifications. An evaluation on 5 libraries found the inferred specifications have high precision and recall, and are useful for detecting real bugs in open source projects. Future work includes improving specification templates and applying it to other documentation.
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark
By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together.
R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics.
This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS .
- How to Scale R
- Work with R and Hadoop + Spark
-Demo of MLS on HDP/HDInsight server with RStudio
- How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio:
Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.
This document provides an overview of Kafka and Spark Streaming. It discusses key concepts of Kafka like brokers, topics, producers and consumers. It also explains how Spark Streaming works using micro-batches and time window APIs. Finally, it discusses reliability aspects and references for further reading.
Bringing complex event processing to Spark streamingDataWorks Summit
Complex event processing (CEP) is about identifying business opportunities and threats in real time by detecting patterns in data and taking appropriate automated action. Example business use cases for CEP include location-based marketing, smart inventories, targeted ads, Wi-Fi offloading, fraud detection, churn prediction, fleet management, predictive maintenance, security incident event management, and many more. While Spark Streaming provides a distributed resilient framework for ingesting events in real time, effort is still needed to build CEP applications. This is because CEP use cases require correlation of events, which in turn requires us to treat every incoming event as a discrete occurrence in time. Spark Streaming treats the entire batch of events as single occurrence. Many CEP use cases also require alerts to be fired even when there is no incoming event. An example of such use case is to fire an alert when an order-shipped event is NOT received within the SLA times following an order-received event. At Oracle we have adopted a few neat techniques like running continuous query engines as long running tasks, using empty batches as triggers, etc. to bring complex event processing to Spark Streaming.
Join us to learn more on CEP for Spark, the fastest growing data processing platform in the world.
Speakers
Prabhu Thukkaram, Senior Director, Product Development, Oracle
Hoyong Park, Architect, Oracle
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
Sentiment Analysis (SA) relates to the use of: Natural Language Processing (NLP), analysis and computational linguistics text to extract and identify subjective information in the source material. A fundamental task of SA is to "classify" the polarity of a given document text, phrases or levels of functionality/appearance - whether the opinion expressed in a document or in a sentence is positive, negative or neutral. Usually, this analysis is performed "offline" using Machine Learning (ML) techniques. In this project two online tweet classification methods have been proposed, which exploits the well known framework "Apache Spark" for processing the data and the tool "Apache Zeppelin" for data visualization.
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
The document discusses data science workflows on Hadoop. It describes data science as involving three phases - data plumbing to ingest and transform data, exploratory analytics to investigate and analyze data, and operational analytics to build and deploy models. It provides examples of tools used for each phase including Spark, Hadoop streaming, SAS, and Python for exploratory analytics, and MLlib and Spark for operational analytics. The document also discusses lambda architectures for handling both batch and real-time analytics.
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
In this talk Senior Software engineer Wolfgang Hoschek from Cloudera discusses Morphlines, the easy way to build and integrate ETL apps for Hadoop. The talk was recorded at the SumbleUpon offices.
Cloudera Morphlines is a new open source framework that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Wolfgang Hoschek is a Software Engineer on the Platform team and the lead developer on Morphlines. He is a former CERN fellow and received his Ph.D from the Technical University of Vienna, Austria, and M.S from the University of Linz, Austria.
This document discusses improvements to ORC support in Apache Spark 2.3. It describes previous issues with ORC performance and compatibility in Spark. The current approach in Spark 2.3 introduces a new native ORC file format that provides significantly better performance compared to the previous Hive ORC implementation. It allows configuring the ORC implementation and reader type. The document also demonstrates ORC usage in Spark and PySpark. Benchmark results show the native ORC reader provides up to 15x faster performance for scans and predicate pushdown. Future work items are discussed to further improve ORC support in Spark.
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
In this article I am going to present some of the most useful Scala libraries and frameworks which help ASSIST Software engineers to develop highly scalable applications that support concurrency and non-blocking.
Model-based Analysis of Large Scale Software RepositoriesMarkus Scheidgen
1) The document discusses a model-based framework for analyzing large scale software repositories. It involves reverse engineering software from version control systems to create abstract syntax tree models, applying transformations and queries to derive metrics and insights, and using Scala for flexible queries and transformations.
2) Two example analyses are described: calculating design structure matrices and propagation costs, and detecting cross-cutting concerns by analyzing co-changed methods within commits.
3) The goal is to enable scalable, language-independent analysis of ultra-large repositories through model-based techniques instead of analyzing raw code directly. This allows abstracting different languages and repositories with common models and analyses.
Introduction to Roslyn and its use in program developmentPVS-Studio
Roslyn is a platform which provides the developer with powerful tools to parse and analyze code. It's not enough just to have these tools, you should also understand what they are needed for. This article is intended to answer these questions. Besides this, you will find details about the static analyzer development which uses Roslyn API.
Roslyn is a platform which provides the developer with powerful tools to parse and analyze code. It's not enough just to have these tools, you should also understand what they are needed for. This article is intended to answer these questions. Besides this, you will find details about the static analyzer development which uses Roslyn API.
The document discusses key concepts related to memory models in C#, including:
1. The compilation process involves lexical analysis, parsing, semantic analysis, optimization, and code generation.
2. Value types are stored on the stack while reference types are stored on the heap.
3. The garbage collector performs memory management by freeing up unused memory on the heap.
A presentation at Twitter's official developer conference, Chirp, about why we use the Scala programming language and how we build services in it. Provides a tour of a number of libraries and tools, both developed at Twitter and otherwise.
Citation Networks present us with a wide variety of problems. This project interprets a large number of Computer Science Research Papers from the DBLP archives and predicts a field in which a certain author is likely to contribute in the near future.
Building .NET Core tools using the Roslyn API by Arthur Tabatchnic at .Net fo...DevClub_lv
The Roslyn C# gives developers access to the C# language parsing and generation capabilities. This can be leveraged in many ways by building tools that fully understand the code and are capable of generating new code with full comprehension of your codebase. In this talk I will walk you through the basic concepts of the Roslyn and language parsing, building a small tool using these principles and packaging it as a .NET Core tool for easy distribution and usage via the .NET Core CLI
This presentation is motivated by the continuous growth of Scala language popularity thanks to many new concepts it offers. Therefore, it makes a perfect sense to take a further insight on this language. Beside the language itself, its ecosystem is also very important. That is why I will focus on the Scala ecosystem in this presentation.
Analyzing the Evolution of Testing Library Usage in Open Source Java ProjectsAhmed Zerouali
This document analyzes the evolution of testing library usage in open source Java projects. It finds that JUnit is the most frequently used testing library. It also examines which libraries are introduced when in a project's lifetime, which libraries are used over time, and whether projects migrate between competing libraries. The analysis is based on over 4,500 Java projects from GitHub that use Maven and have been active for at least two years. It finds that many libraries are used simultaneously and that JUnit is dominant, but that a small percentage of projects do migrate between libraries.
Apache maven and its impact on java 9 (Java One 2017)Robert Scholte
Maven is a build tool that manages projects and dependencies. It aims to make the build process easy and provide a uniform system. With Java 9, Maven faces challenges due to the module system's differences from the classpath. Specifically, the modulepath does not support all file types like the classpath. This causes issues for Maven which relies on dependencies being on the classpath. Additionally, automatic modules generated from JAR files can cause problems for projects. Maven cannot directly generate module descriptors but tools like Jdeps can assist with initial descriptors. Overall, while challenges exist, Maven can still be used for most projects with Java 9.
The essence of the VivaCore code analysis libraryPVS-Studio
The article tells developers about VivaCore library, preconditions of its creation, its possibilities, structure and scope of use. This article was written simultaneously with the development of VivaCore library and that's why some of the details of the final realization may differ from the features described here. But this won't prevent the developers from getting acquainted with the general work principles of the library, mechanisms of analysis and processing of C and C++ source code.
Build an application upon Semantic Web models. Brief overview of Apache Jena and OWL-API.
Semantic Web course
e-Lite group (https://elite.polito.it)
Politecnico di Torino, 2017
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release.
ESSIR LivingKnowledge DiversityEngine tutorialJonathon Hare
The document summarizes a symposium on bias and diversity in information retrieval testbeds. It introduces the Diversity Engine, which provides collections, annotation tools, and an evaluation framework to allow for collaborative and comparable research on indexing and searching documents annotated with various metadata like entities, bias, trust, and multimedia features. It describes the architecture, design decisions, supported document collections and formats, analysis modules for text and images, indexing and search functionality using Solr, application development, and evaluation framework. An example application is demonstrated by indexing a sample collection and making it searchable.
Ruby on Rails is a web application framework that uses the Ruby programming language. It allows developers to create and manage web applications that interact with relational databases through a web-based user interface. Rails emphasizes conventions over configuration, making assumptions about conventions to decrease configuration. It also utilizes metaprogramming techniques and scaffolding to increase developer productivity. Rails follows the model-view-controller architecture and embraces test-driven development.
Ruby on Rails is a web application framework for Ruby that allows developers to create or manage web applications that manipulate relational databases from a web-based user interface. Rails emphasizes convention over configuration, making assumptions about naming and directory structure that allow for rapid development. It implements the model-view-controller architecture and embraces test-driven development. Rails aims to increase productivity by reducing repetition and configuration through conventions and metaprogramming techniques.
Ruby on Rails is a web application framework that allows developers to build dynamic web applications using the Ruby programming language. It emphasizes convention over configuration and aims to increase developer productivity. Some key features of Rails include its use of the MVC framework, Active Record for database access, scaffolding for quickly building CRUD applications, and support for metaprogramming techniques. Rails also supports test-driven development and provides environments for development, testing and production.
Similar to Automating the Generation of Benchmark Suites (20)
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Automating the Generation of Benchmark Suites
1. SOFTWARE
TECHNIK
Automating the Generation
of Benchmark Suites
Creation, Assessment, and Management of Effective Test
Corpora
Ben Hermann
@benhermann
Joint work Lisa Nguyen Quang Do, Michael Eichberg, Karim Ali, and Eric Bodden
National Java Resource Workshop @ SPLASH, Vancouver
October 23rd, 2017
3. @benhermannABM @ NJR 2017
Evaluation of Code Analyses
• Compare results of an analysis against
• A ground truth show soundness
• A previous analysis show improvement (e.g., in precision)
3
New analysis Ground truthPrevious analyses
4. @benhermannABM @ NJR 2017
Evaluation of Code Analyses
• Compare results of an analysis against
• A ground truth show soundness
• A previous analysis show improvement (e.g., in precision)
3
New analysis Ground truthPrevious analyses
Evaluation corpus
analyzesanalyzes is based on
8. @benhermannABM @ NJR 2017
Construction of a Corpus
4
Size
Content
Representativeness
9. @benhermannABM @ NJR 2017
Construction of a Corpus
4
Size
Content
Representativeness
Permanence
Criteria from Tempero et al. 2010
10. @benhermannABM @ NJR 2017
Construction of a Corpus
4
Size
Content
Representativeness
Permanence
Criteria from Tempero et al. 2010
Sources
11. @benhermannABM @ NJR 2017
Construction of a Corpus
4
Size
Content
Representativeness
Permanence
Criteria from Tempero et al. 2010
Sources
Purpose
12. @benhermannABM @ NJR 2017
Construction of a Corpus
4
Size
Content
Representativeness
Permanence
Criteria from Tempero et al. 2010
Sources
Purpose
How to determine this?
13. @benhermannABM @ NJR 2017
Construction of a Corpus
4
Size
Content
Representativeness
Permanence
Criteria from Tempero et al. 2010
Sources
Purpose
How to determine this?
How to achieve this?
15. @benhermannABM @ NJR 2017
Sourcing Projects
for the Corpus
5
ABM
GitHub
BitBucket
…
collect
Size
Content
16. @benhermannABM @ NJR 2017
Sourcing Projects
for the Corpus
5
ABM
GitHub
BitBucket
…
collect
Criteria such as size,
license, or
programming
language apply
Size
Content
17. @benhermannABM @ NJR 2017
Sourcing Projects
for the Corpus
5
ABM
GitHub
BitBucket
…
collect build
Compiled
Projects
Criteria such as size,
license, or
programming
language apply
Size
Content
18. @benhermannABM @ NJR 2017
Sourcing Projects
for the Corpus
5
ABM
GitHub
BitBucket
…
collect build
Compiled
Projects
Criteria such as size,
license, or
programming
language apply
We currently support
maven and sbt, but are
expanding (e.g., gradle)
Size
Content
21. @benhermannABM @ NJR 2017
Representativeness in
Custom Collections
7
We used the three algorithms to construct respective call
graphs for a large set of libraries: the 100 most used distinct
Java related libraries from Maven Central Repository. The
set is representative for a wide range of libraries.
22. @benhermannABM @ NJR 2017
Representativeness in
Custom Collections
7
We used the three algorithms to construct respective call
graphs for a large set of libraries: the 100 most used distinct
Java related libraries from Maven Central Repository. The
set is representative for a wide range of libraries.
It contains very small (e.g., JUnit) to very large (e.g., Scala
Library) libraries; libraries developed primarily in an
industrial context (e.g., Guava) or in an open-source
setting (e.g., Apache Commons); libraries from very
different domains: testing (e.g., Hamcrest, Mockito),
databases (e.g., HSQLDB), bytecode engineering (e.g.,
cglib), runtime environments (e.g., Scala Runtime),
containers (e.g., Netty), and also general utility libraries
(e.g., osgi.core).
23. @benhermannABM @ NJR 2017
Representativeness in
Custom Collections
7
We used the three algorithms to construct respective call
graphs for a large set of libraries: the 100 most used distinct
Java related libraries from Maven Central Repository. The
set is representative for a wide range of libraries.
Additionally, it contains two libraries that have unusual
properties: jsr305 and easymockclassextesion both do not
contain a single instance method call. The jsr305 project is
just a collection of annotations and easymockclassextesion
only contains interface definitions and a few classes with
static methods.
It contains very small (e.g., JUnit) to very large (e.g., Scala
Library) libraries; libraries developed primarily in an
industrial context (e.g., Guava) or in an open-source
setting (e.g., Apache Commons); libraries from very
different domains: testing (e.g., Hamcrest, Mockito),
databases (e.g., HSQLDB), bytecode engineering (e.g.,
cglib), runtime environments (e.g., Scala Runtime),
containers (e.g., Netty), and also general utility libraries
(e.g., osgi.core).
24. @benhermannABM @ NJR 2017
Representativeness in
Custom Collections
7
We used the three algorithms to construct respective call
graphs for a large set of libraries: the 100 most used distinct
Java related libraries from Maven Central Repository. The
set is representative for a wide range of libraries.
Additionally, it contains two libraries that have unusual
properties: jsr305 and easymockclassextesion both do not
contain a single instance method call. The jsr305 project is
just a collection of annotations and easymockclassextesion
only contains interface definitions and a few classes with
static methods.
It contains very small (e.g., JUnit) to very large (e.g., Scala
Library) libraries; libraries developed primarily in an
industrial context (e.g., Guava) or in an open-source
setting (e.g., Apache Commons); libraries from very
different domains: testing (e.g., Hamcrest, Mockito),
databases (e.g., HSQLDB), bytecode engineering (e.g.,
cglib), runtime environments (e.g., Scala Runtime),
containers (e.g., Netty), and also general utility libraries
(e.g., osgi.core).
Lastly, the set also contains libraries that are written in other
languages, such as Scala (e.g., ScalaTest), whose compilers
only use a subset of the JVM’s concepts. The Scala
compiler, e.g., does not use package and protected visibility.
This significantly limits our possibilities to identify the
library-private implementation (recall that LibCHACPA
identifies a library’s private implementation based on the
evaluation of the code elements’ visibilities). For each
library, we also downloaded all of its dependencies to build
complete class hierarchies for them.
25. @benhermannABM @ NJR 2017
Representativeness in
Custom Collections
7
We used the three algorithms to construct respective call
graphs for a large set of libraries: the 100 most used distinct
Java related libraries from Maven Central Repository. The
set is representative for a wide range of libraries.
Additionally, it contains two libraries that have unusual
properties: jsr305 and easymockclassextesion both do not
contain a single instance method call. The jsr305 project is
just a collection of annotations and easymockclassextesion
only contains interface definitions and a few classes with
static methods.
It contains very small (e.g., JUnit) to very large (e.g., Scala
Library) libraries; libraries developed primarily in an
industrial context (e.g., Guava) or in an open-source
setting (e.g., Apache Commons); libraries from very
different domains: testing (e.g., Hamcrest, Mockito),
databases (e.g., HSQLDB), bytecode engineering (e.g.,
cglib), runtime environments (e.g., Scala Runtime),
containers (e.g., Netty), and also general utility libraries
(e.g., osgi.core).
Lastly, the set also contains libraries that are written in other
languages, such as Scala (e.g., ScalaTest), whose compilers
only use a subset of the JVM’s concepts. The Scala
compiler, e.g., does not use package and protected visibility.
This significantly limits our possibilities to identify the
library-private implementation (recall that LibCHACPA
identifies a library’s private implementation based on the
evaluation of the code elements’ visibilities). For each
library, we also downloaded all of its dependencies to build
complete class hierarchies for them.
Michael Reif, Michael Eichberg, Ben Hermann, Johannes Lerch, and Mira Mezini. 2016. Call graph construction for
Java libraries. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software
Engineering (FSE 2016)
Description of the Darmstadt Library Corpus (DLC) from:
30. @benhermannABM @ NJR 2017
How Hermes Works
9
Corpus candidates Hermes Optimal corpus
31. @benhermannABM @ NJR 2017
How Hermes Works
9
Corpus candidates Hermes Optimal corpus
Feature Queries
32. @benhermannABM @ NJR 2017
How Hermes Works
9
Corpus candidates Hermes Optimal corpus
Feature Queries
Manual or Automatic
Selection
33. @benhermannABM @ NJR 2017
OPAL
How Hermes Works
9
Corpus candidates Hermes Optimal corpus
Feature Queries
Manual or Automatic
Selection
34. @benhermannABM @ NJR 2017
OPAL
How Hermes Works
9
Corpus candidates Hermes Optimal corpus
Feature Queries
Manual or Automatic
Selection
Introduced at
SOAP 2014
Introduced at
SOAP 2017
38. @benhermannABM @ NJR 2017
Feature Queries
10
trait FeatureQuery {
// …
def apply[S](
projectConfiguration: ProjectConfiguration,
project: Project[S],
rawClassFiles: Traversable[(da.ClassFile, S)]
): TraversableOnce[Feature[S]]
// …
}
Identifier,
Project JAR Files,
Library JAR Files,
Statistics
Complete reified
project information
(classes, fields,
methods, bodys, etc.)
Raw class file information
(e.g., for extracting
information from the
constant pool)
39. @benhermannABM @ NJR 2017
Feature Queries
10
trait FeatureQuery {
// …
def apply[S](
projectConfiguration: ProjectConfiguration,
project: Project[S],
rawClassFiles: Traversable[(da.ClassFile, S)]
): TraversableOnce[Feature[S]]
// …
}
Identifier,
Project JAR Files,
Library JAR Files,
Statistics
Complete reified
project information
(classes, fields,
methods, bodys, etc.)
Raw class file information
(e.g., for extracting
information from the
constant pool)List of detected features in
the codebase (id, frequency
of occurrence, (opt.)
locations)
41. @benhermannABM @ NJR 2017
Already Implemented
Queries
11
Existence of
Bytecode Instructions
Class File Versions
Class Types
Trivial Reflection
Fan-In/Fan-Out
Field Access
Method w/o Returns
Method Types
Various Metrics
Recursive
Data Structures
Size of
Inheritance Tree
API Usage
43. @benhermannABM @ NJR 2017
Feature Queries for
API Usage
12
Bytecode
Instrumentation
Class Loader
GUI
Crypto
JDBC
Reflection
System
Thread
Unsafe
44. @benhermannABM @ NJR 2017
Constructing a Minimal
Corpus
• Dead-Path Analysis [FSE15]
• Original evaluation conducted on the complete Qualitas
Corpus
• Minimal corpus only consists of 5 out of the 100
projects in the Qualitas Corpus
• Evaluation cut down from 16.77 minutes to 2.82
minutes (~6x faster) while coverage is only 1.06% below
the original corpus
13
46. @benhermannABM @ NJR 2017
Collection Permanence
14
Permanence
ABM
We store and retain
collection definitions
47. @benhermannABM @ NJR 2017
Collection Permanence
14
Permanence
ABM
Download corpus and
provide on your
infrastructure
Collected
Projects
We store and retain
collection definitions
48. @benhermannABM @ NJR 2017
Collection Permanence
14
Permanence
ABM
Publish
complete corpus
Download corpus and
provide on your
infrastructure
Collected
Projects
We store and retain
collection definitions
49. @benhermannABM @ NJR 2017
Collection Permanence
14
Permanence
ABM
Publish
complete corpus
use DOI
for papers
Download corpus and
provide on your
infrastructure
Collected
Projects
We store and retain
collection definitions
50. @benhermannABM @ NJR 2017
Collection Permanence
14
Permanence
ABM
Publish
complete corpus
use DOI
for papers
Download corpus and
provide on your
infrastructure
Collected
Projects
We store and retain
collection definitions
We would love to see
more services like this
51. @benhermannABM @ NJR 2017
Bringing it all together
15
ABM Hermes
inspect
GitHub
BitBucket
…
collect
build
publish
complete corpus
use DOI
for papers
52. SOFTWARE
TECHNIK
Automating the Generation of
Benchmark Suites
Creation, Assessment, and Management of Effective Test Corpora
Ben Hermann
@benhermann
Joint work Michael Reif, Michael Eichberg, and Mira Mezini
Thank you!