Engineering your cloud infrastructure using CHEF. This presentation was given as part of my application to the University of Ottawa for a role as a tenure track professor in the Faculty of Engineering. The focus was about using CHEF for infrastructure as code, with a small tangent discussion a MapReduce example. This presentation is partially in English and French.
GraphFrames Access Methods in DSE GraphJim Hatcher
GraphFrames is a powerful feature in Spark that allows you to harness Spark's distributed computing framework to operate on your Graph. Tasks like data ingestion, schema migrations, and analytical jobs can all be run against your Graph. In DSE Graph, there are several methods to leverage GraphFrames including Gremlin, Spark SQL, and Motif. This presentation walks through the basics of using GraphFrames with DSE Graph; then shows how these different methods can be used and how you can evaluate which one is the best for your use case.
Session 1.5 supporting virtual integration of linked data with just-in-time...semanticsconference
This document discusses supporting virtual integration of Linked Data through just-in-time query recompilation. It presents a technique for compiling input queries into target SPARQL queries over individual data sources. Microcompilers encode knowledge of data schemas and query skeletons provide templates. Experiments show overhead is mostly standard but it enables queries not otherwise possible and is efficient when expanding queries. Future work includes optimizing overhead and investigating other languages and templating.
This document describes a project that provides methods for estimating the cardinality of conjunctive queries over RDF data. It discusses the key modules including an RDF loader, parser and importer to extract triples from RDF files and load them into a database. The database stores the triples and generates statistics tables. A cardinality estimator takes a conjunctive query and database statistics to output an estimate of the query's cardinality.
This document provides a summary of a guided tour of the Pythonian Museum focused on analyzing NASA HDF data products using Python. The tour covered 4 main Python packages for working with HDF4, HDF5, netCDF4 and GDAL data formats. It provided examples of reading and visualizing different NASA Earth science data products stored in these formats, including tools for handling latitude/longitude variables, projections, scaling and subsetting large datasets. The goal was to demonstrate common patterns and tips for working with NASA HDF Earth observation data in Python.
The document discusses NASA HDF/HDF-EOS data formats and how to work with them for Earth science data analysis. It begins with an overview of the complexity of NASA HDF data for new users. It then covers common HDF data formats like HDF4, HDF5, HDF-EOS2 and HDF-EOS5, explaining the differences and which APIs to use for each. It provides code examples for reading HDF-EOS data in C and Fortran. It also introduces tools that can help dump and extract HDF data in ASCII format for broader tool compatibility. Overall, the document aims to help new users understand and start working with NASA's HDF data products.
Using PostGIS To Add Some Spatial Flavor To Your ApplicationSteven Pousty
- PostGIS adds spatial capabilities like points, lines, polygons, and functions like area, distance to PostgreSQL. It allows spatial queries and analysis.
- To install PostGIS, you need PostgreSQL and libraries like Proj and GEOS. Packages are available for many platforms.
- With PostGIS, you can import spatial data like shapefiles, perform queries using spatial filters and functions, simplify geometries, and more to build mapping and location-based applications.
The document discusses LogicBlox, a database company aiming to create a single "iPhone of databases" that can replace many specialized databases. It presents LogicBlox as offering a declarative query language called LogiQL, ACID transactions, and the ability to handle transactional, analytical, graph, and document data within a single database. The document provides examples of how LogicBlox can be used for graph analysis tasks like counting cliques and calculating PageRank, significantly outperforming other technologies through its optimized algorithms and data structures.
GraphFrames Access Methods in DSE GraphJim Hatcher
GraphFrames is a powerful feature in Spark that allows you to harness Spark's distributed computing framework to operate on your Graph. Tasks like data ingestion, schema migrations, and analytical jobs can all be run against your Graph. In DSE Graph, there are several methods to leverage GraphFrames including Gremlin, Spark SQL, and Motif. This presentation walks through the basics of using GraphFrames with DSE Graph; then shows how these different methods can be used and how you can evaluate which one is the best for your use case.
Session 1.5 supporting virtual integration of linked data with just-in-time...semanticsconference
This document discusses supporting virtual integration of Linked Data through just-in-time query recompilation. It presents a technique for compiling input queries into target SPARQL queries over individual data sources. Microcompilers encode knowledge of data schemas and query skeletons provide templates. Experiments show overhead is mostly standard but it enables queries not otherwise possible and is efficient when expanding queries. Future work includes optimizing overhead and investigating other languages and templating.
This document describes a project that provides methods for estimating the cardinality of conjunctive queries over RDF data. It discusses the key modules including an RDF loader, parser and importer to extract triples from RDF files and load them into a database. The database stores the triples and generates statistics tables. A cardinality estimator takes a conjunctive query and database statistics to output an estimate of the query's cardinality.
This document provides a summary of a guided tour of the Pythonian Museum focused on analyzing NASA HDF data products using Python. The tour covered 4 main Python packages for working with HDF4, HDF5, netCDF4 and GDAL data formats. It provided examples of reading and visualizing different NASA Earth science data products stored in these formats, including tools for handling latitude/longitude variables, projections, scaling and subsetting large datasets. The goal was to demonstrate common patterns and tips for working with NASA HDF Earth observation data in Python.
The document discusses NASA HDF/HDF-EOS data formats and how to work with them for Earth science data analysis. It begins with an overview of the complexity of NASA HDF data for new users. It then covers common HDF data formats like HDF4, HDF5, HDF-EOS2 and HDF-EOS5, explaining the differences and which APIs to use for each. It provides code examples for reading HDF-EOS data in C and Fortran. It also introduces tools that can help dump and extract HDF data in ASCII format for broader tool compatibility. Overall, the document aims to help new users understand and start working with NASA's HDF data products.
Using PostGIS To Add Some Spatial Flavor To Your ApplicationSteven Pousty
- PostGIS adds spatial capabilities like points, lines, polygons, and functions like area, distance to PostgreSQL. It allows spatial queries and analysis.
- To install PostGIS, you need PostgreSQL and libraries like Proj and GEOS. Packages are available for many platforms.
- With PostGIS, you can import spatial data like shapefiles, perform queries using spatial filters and functions, simplify geometries, and more to build mapping and location-based applications.
The document discusses LogicBlox, a database company aiming to create a single "iPhone of databases" that can replace many specialized databases. It presents LogicBlox as offering a declarative query language called LogiQL, ACID transactions, and the ability to handle transactional, analytical, graph, and document data within a single database. The document provides examples of how LogicBlox can be used for graph analysis tasks like counting cliques and calculating PageRank, significantly outperforming other technologies through its optimized algorithms and data structures.
This document appears to be a collection of technical information related to learning technologies and standards. It includes descriptions of specifications such as SCORM, xAPI, IMS Caliper, and standards organizations like ISO/IEC and IMS Global. Sections cover topics including learning analytics, virtual reality, mobile learning, and predictions for the year 2030 related to artificial intelligence and personalized education. The document contains many copyright notices and does not provide any clear overall summary or context.
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
Main actor of this alchemic wonder will be a peculiar substance called "R". Some peasants describe "R" as "a software environment for statistical computing and graphics" – but they'll get burned at the stakes anyway...
Every company generates various kinds of data. Let it be accounting data, time records of employees, quality assurance sensor data ... or anything else.
Most of this data just exists and your company doesn't profit from it. I'd like to show you how to get started making more out of these hidden treasures using R.
We'll start with a very quick introduction to R. I will show you how R basically works, how it can be compared to Excel and why it will speed up your journey of transforming data into gold.
Equipped with a basic understanding of R I will take you on an expedition to what you can do with R – especially in combination with several other open source tools. Two very interesting tools in combination with R are LaTeX and Freeboard.
LaTeX allows you to generate beautiful looking reports based on your data.
Freeboard is an open source dashboard software that can access and present data of various sources.
By means of real world examples I will demonstrate you how different kinds of reports are created at dkd to support our team and management in their controlling tasks.
After upgrading an Oracle database from version 18.7 to 19.12, queries began encountering errors such as ORA-07445 and ORA-00600, causing instance crashes. The errors seemed related to parsing and transformation components in the SQL processing pipeline. Adding more physical memory resolved the issue, indicating the new database version required more memory than the previous configuration.
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
Lucene and Solr provide a number of options for query parsing, and these are valuable tools for creating powerful search applications. This presentation given at the 2013 Lucene Revolution will review the role that advanced query parsing can play in building systems, including: Relevancy customization, taking input from user interface variables such as the position on a website or geographical indicators, which sources are to be searched and 3rd party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. Chief Architect Paul Nelson provides this compelling presentation.
Search Technologies provides relevancy tuning services for Solr. For further information, see http://www.searchtechnologies.com/solr-lucene-relevancy.html
http://www.searchtechnologies.com
The NPOESS program uses Unified Modeling Language (UML) to describe the format of the HDF5 files produced. For each unique type of data product, the HDF5 storage organization and the means to retrieve the data is the same. This provides a consistent data retrieval interface for manual and automated users of the data, without which would require custom development and cumbersome maintenance. The data formats are described using UML to provide a profile of HDF5 files.
This poster will show each unique data type so far produced by NPOESS, and the contents of the files. We will also have overhead snapshots of the data contents.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
The process of sorting has been one of those problems in computer science that have been around almost from the beginning of time. For example, the tabulating machine (IBM, 1890’s Census) was the first early data processing
unit able to sort data cards for people in the USA. After all the first census took around 7 years to be finished, making all the stored data obsolete. Therefore, the need for sorting. It is more, studying the different techniques of sorting
allows for a more precise introduction of the algorithm concept. Some corrections were done to a bound for the max-heapfy... My deepest excuses for the mistakes!!!
Spark is an open-source cluster computing framework. It started as a project in 2009 at UC Berkeley and was open sourced in 2010. It has over 300 contributors from 50+ organizations. Spark uses Resilient Distributed Datasets (RDDs) that allow in-memory cluster computing across clusters. RDDs provide a programming model for distributed datasets that can be created from external storage or by transforming existing RDDs. RDDs support operations like map, filter, reduce to perform distributed computations lazily.
This document discusses Inmobi's analytics platform called Grill, which provides a unified analytics experience. Grill supports multiple execution engines and storage systems for Hive queries on data cubes. It rewrites queries to the most efficient execution engine and stores query histories. Grill provides a pluggable architecture and analytics capabilities on Inmobi's large Hadoop data warehouse.
Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle
Slides from a talk I will give in early 2016 at the Luxembourg Data Science Meetup. Aim is to give an introduction to Apache Spark, from a Machine Learning experts point of view. Based on various other tutorials out there. This will be aimed at non-specialists.
The MathWorks introduced MATLAB support for HDF5 in 2002 via three high-level functions: HDF5INFO, HDF5READ, and HDF5WRITE. These functions worked well for their purpose-providing simple interfaces to a complicated file format-but MATLAB users requested finer control over their HDF5 files and the HDF5 library. MATLAB 7.3 (R2006b) adds this precise level of support for version 1.6.5 of the HDF5 library via a close mapping of the HDF5 C API to MATLAB function calls.
This presentation will briefly introduce the earlier, high-level HDF5 interface (and its limitations) before showing in detail the low-level HDF5 functions. It will show how to interact with the HDF5 library and files using the thirteen classes of functions in MATLAB, which encapsulate groupings of functionality found in the HDF5 C API. But because MATLAB is itself a higher-level language than C, we will also present MATLAB's extensions and modifications of the HDF5 C API that make it more MATLAB-like, work with defined values, and perform ID and memory management.
Wrapping a library like HDF5 requires a great deal of effort and design, and we will briefly present a general-purpose mechanism for creating close mappings between library interfaces and an application like MATLAB. One of our goals in this presentation is to facilitate communication with The HDF Group about how The MathWorks builds our HDF5 interfaces in order to ease adoption of future versions of the HDF5 library in large, general-purpose applications.
The document discusses using cURL and API documentation tools like rspec_api_documentation and raddocs to test and document APIs. It provides information on getting order lists from an example API using cURL, notes that tests can serve as documentation, and lists the rspec_api_documentation and raddocs gems for generating API documentation from tests. The author invites questions and provides contact information.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
R is an open-source statistical programming language that can be used for data analysis and visualization. The document provided an introduction to R including how to install R, create variables, import and assemble data, perform basic statistical analyses like t-tests and linear regression, and create plots and graphs. Key functions and concepts introduced included using c() to combine values into vectors, reading in data from CSV files, using lm() for linear regression, and the basic plot() function.
The document discusses Apache Arrow and DataFusion. DataFusion is a query execution framework written in Rust that uses Apache Arrow as its in-memory format. It allows for customizable query execution through extensible logical and execution plans. DataFusion provides SQL query support and a DataFrame API, and powers systems like FLOCK and ROAPI through its optimized execution engine.
OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...Ganesan Narayanasamy
This presentation discusses the on-going project on building a validation and verification (V&V) testsuite of the widely popular directive-based parallel programming model, OpenMP. The talk will present results of the OpenMP offloading features implemented in various compilers targeting Summit among other systems. This project is open-source and the SOLLVE V&V team welcomes collaborations.
Is there a way that we can build our Azure Data Factory all with parameters b...Erwin de Kreuk
Is there a way that we can build our Data Factory all with parameters all based on MetaData? Yes there's and I will show you how to. During this session I will show how you can load Incremental or Full datasets from your sql database to your Azure Data Lake. The next step is that we want to track our history from these extracted tables. We will do this with Azure Databricks using Delta Lake. The last step that we want, is to make this data available in Azure SQL Database or Azure Synapse Analytics. Oh and we want to have some logging as well from our processes A lot to talk and to demo about during this session.
The document outlines an agenda for a two-day programming and AI event. Day 1 covers introduction to Python programming, essential Python for AI, machine learning programming, and machine learning project deployment. Day 2 covers fullstack web development, machine learning integration, AI integration, and a case study. Each day includes registration periods, topic sessions, and coffee breaks.
Pivotal Greenplum is a massively parallel processing (MPP) database for analytics. It provides high performance for data warehousing and big data analytics workloads. Key features include its ability to load and query data in parallel across multiple CPUs and disks, support for SQL and analytical functions and libraries like MADlib, and deployment on public clouds or on-premises. Pivotal Greenplum can be used for both structured and unstructured data and integrates with other Pivotal products like GemFire, Data Flow, and the Pivotal Data Suite for analytics workflows.
This document appears to be a collection of technical information related to learning technologies and standards. It includes descriptions of specifications such as SCORM, xAPI, IMS Caliper, and standards organizations like ISO/IEC and IMS Global. Sections cover topics including learning analytics, virtual reality, mobile learning, and predictions for the year 2030 related to artificial intelligence and personalized education. The document contains many copyright notices and does not provide any clear overall summary or context.
The document discusses providing easy access to HDF data via NCL, IDL, and MATLAB. It presents examples and code snippets for reading HDF data from various NASA data centers like GES DISC, MODAPS, NSIDC, and LP-DAAC into the three software packages. Common issues when working with HDF files like HDF-EOS2 swaths with dimension maps and different ways metadata is stored are also addressed. The overall goal is to help lower the learning curve for users who want to analyze HDF data in their favorite analysis packages.
Main actor of this alchemic wonder will be a peculiar substance called "R". Some peasants describe "R" as "a software environment for statistical computing and graphics" – but they'll get burned at the stakes anyway...
Every company generates various kinds of data. Let it be accounting data, time records of employees, quality assurance sensor data ... or anything else.
Most of this data just exists and your company doesn't profit from it. I'd like to show you how to get started making more out of these hidden treasures using R.
We'll start with a very quick introduction to R. I will show you how R basically works, how it can be compared to Excel and why it will speed up your journey of transforming data into gold.
Equipped with a basic understanding of R I will take you on an expedition to what you can do with R – especially in combination with several other open source tools. Two very interesting tools in combination with R are LaTeX and Freeboard.
LaTeX allows you to generate beautiful looking reports based on your data.
Freeboard is an open source dashboard software that can access and present data of various sources.
By means of real world examples I will demonstrate you how different kinds of reports are created at dkd to support our team and management in their controlling tasks.
After upgrading an Oracle database from version 18.7 to 19.12, queries began encountering errors such as ORA-07445 and ORA-00600, causing instance crashes. The errors seemed related to parsing and transformation components in the SQL processing pipeline. Adding more physical memory resolved the issue, indicating the new database version required more memory than the previous configuration.
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
This document provides an introduction to the R programming language presented by Alex Storer at ComputeFest 2012. It discusses why R should be used over other languages like MATLAB and Python, provides examples of basic R syntax and functions, and walks through an example of loading climate data and creating plots to visualize rainfall anomalies over time. The goal is to provide attendees with a foundation of R basics while working through a real data analysis problem.
Lucene and Solr provide a number of options for query parsing, and these are valuable tools for creating powerful search applications. This presentation given at the 2013 Lucene Revolution will review the role that advanced query parsing can play in building systems, including: Relevancy customization, taking input from user interface variables such as the position on a website or geographical indicators, which sources are to be searched and 3rd party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. Chief Architect Paul Nelson provides this compelling presentation.
Search Technologies provides relevancy tuning services for Solr. For further information, see http://www.searchtechnologies.com/solr-lucene-relevancy.html
http://www.searchtechnologies.com
The NPOESS program uses Unified Modeling Language (UML) to describe the format of the HDF5 files produced. For each unique type of data product, the HDF5 storage organization and the means to retrieve the data is the same. This provides a consistent data retrieval interface for manual and automated users of the data, without which would require custom development and cumbersome maintenance. The data formats are described using UML to provide a profile of HDF5 files.
This poster will show each unique data type so far produced by NPOESS, and the contents of the files. We will also have overhead snapshots of the data contents.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
The process of sorting has been one of those problems in computer science that have been around almost from the beginning of time. For example, the tabulating machine (IBM, 1890’s Census) was the first early data processing
unit able to sort data cards for people in the USA. After all the first census took around 7 years to be finished, making all the stored data obsolete. Therefore, the need for sorting. It is more, studying the different techniques of sorting
allows for a more precise introduction of the algorithm concept. Some corrections were done to a bound for the max-heapfy... My deepest excuses for the mistakes!!!
Spark is an open-source cluster computing framework. It started as a project in 2009 at UC Berkeley and was open sourced in 2010. It has over 300 contributors from 50+ organizations. Spark uses Resilient Distributed Datasets (RDDs) that allow in-memory cluster computing across clusters. RDDs provide a programming model for distributed datasets that can be created from external storage or by transforming existing RDDs. RDDs support operations like map, filter, reduce to perform distributed computations lazily.
This document discusses Inmobi's analytics platform called Grill, which provides a unified analytics experience. Grill supports multiple execution engines and storage systems for Hive queries on data cubes. It rewrites queries to the most efficient execution engine and stores query histories. Grill provides a pluggable architecture and analytics capabilities on Inmobi's large Hadoop data warehouse.
Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle
Slides from a talk I will give in early 2016 at the Luxembourg Data Science Meetup. Aim is to give an introduction to Apache Spark, from a Machine Learning experts point of view. Based on various other tutorials out there. This will be aimed at non-specialists.
The MathWorks introduced MATLAB support for HDF5 in 2002 via three high-level functions: HDF5INFO, HDF5READ, and HDF5WRITE. These functions worked well for their purpose-providing simple interfaces to a complicated file format-but MATLAB users requested finer control over their HDF5 files and the HDF5 library. MATLAB 7.3 (R2006b) adds this precise level of support for version 1.6.5 of the HDF5 library via a close mapping of the HDF5 C API to MATLAB function calls.
This presentation will briefly introduce the earlier, high-level HDF5 interface (and its limitations) before showing in detail the low-level HDF5 functions. It will show how to interact with the HDF5 library and files using the thirteen classes of functions in MATLAB, which encapsulate groupings of functionality found in the HDF5 C API. But because MATLAB is itself a higher-level language than C, we will also present MATLAB's extensions and modifications of the HDF5 C API that make it more MATLAB-like, work with defined values, and perform ID and memory management.
Wrapping a library like HDF5 requires a great deal of effort and design, and we will briefly present a general-purpose mechanism for creating close mappings between library interfaces and an application like MATLAB. One of our goals in this presentation is to facilitate communication with The HDF Group about how The MathWorks builds our HDF5 interfaces in order to ease adoption of future versions of the HDF5 library in large, general-purpose applications.
The document discusses using cURL and API documentation tools like rspec_api_documentation and raddocs to test and document APIs. It provides information on getting order lists from an example API using cURL, notes that tests can serve as documentation, and lists the rspec_api_documentation and raddocs gems for generating API documentation from tests. The author invites questions and provides contact information.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
The document describes a Big Data workshop held on March 10, 2012 at the Microsoft New England Research & Development Center in Cambridge, MA. The workshop focused on using R and Hadoop, with an emphasis on RHadoop's rmr package. The document provides an introduction to using R with Hadoop and discusses several R packages for working with Hadoop, including RHIPE, rmr, rhdfs, and rhbase. Code examples are presented demonstrating how to calculate average departure delays by airline and month from an airline on-time performance dataset using different approaches, including Hadoop streaming, hive, RHIPE and rmr.
R is an open-source statistical programming language that can be used for data analysis and visualization. The document provided an introduction to R including how to install R, create variables, import and assemble data, perform basic statistical analyses like t-tests and linear regression, and create plots and graphs. Key functions and concepts introduced included using c() to combine values into vectors, reading in data from CSV files, using lm() for linear regression, and the basic plot() function.
The document discusses Apache Arrow and DataFusion. DataFusion is a query execution framework written in Rust that uses Apache Arrow as its in-memory format. It allows for customizable query execution through extensible logical and execution plans. DataFusion provides SQL query support and a DataFrame API, and powers systems like FLOCK and ROAPI through its optimized execution engine.
OpenPOWER Webinar from University of Delaware - Title :OpenMP (offloading) o...Ganesan Narayanasamy
This presentation discusses the on-going project on building a validation and verification (V&V) testsuite of the widely popular directive-based parallel programming model, OpenMP. The talk will present results of the OpenMP offloading features implemented in various compilers targeting Summit among other systems. This project is open-source and the SOLLVE V&V team welcomes collaborations.
Is there a way that we can build our Azure Data Factory all with parameters b...Erwin de Kreuk
Is there a way that we can build our Data Factory all with parameters all based on MetaData? Yes there's and I will show you how to. During this session I will show how you can load Incremental or Full datasets from your sql database to your Azure Data Lake. The next step is that we want to track our history from these extracted tables. We will do this with Azure Databricks using Delta Lake. The last step that we want, is to make this data available in Azure SQL Database or Azure Synapse Analytics. Oh and we want to have some logging as well from our processes A lot to talk and to demo about during this session.
The document outlines an agenda for a two-day programming and AI event. Day 1 covers introduction to Python programming, essential Python for AI, machine learning programming, and machine learning project deployment. Day 2 covers fullstack web development, machine learning integration, AI integration, and a case study. Each day includes registration periods, topic sessions, and coffee breaks.
Pivotal Greenplum is a massively parallel processing (MPP) database for analytics. It provides high performance for data warehousing and big data analytics workloads. Key features include its ability to load and query data in parallel across multiple CPUs and disks, support for SQL and analytical functions and libraries like MADlib, and deployment on public clouds or on-premises. Pivotal Greenplum can be used for both structured and unstructured data and integrates with other Pivotal products like GemFire, Data Flow, and the Pivotal Data Suite for analytics workflows.
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
The document describes the C3 Compute Capacity Calculator tool, which calculates the compute capacity needed for Hadoop jobs to meet required processing time Service Level Agreements (SLAs). The tool estimates the number of map and reduce slots or containers needed based on analyzing job execution data. It helps projects estimate capacity needs for onboarding to production Hadoop clusters and integrates with other tools for job submission and monitoring. The tool uses a rule-based approach and handles various Hadoop job types including Pig scripts and Oozie workflows.
En esta charla miraremos al futuro introduciendo Spark como alternativa al clásico motor de Hadoop MapReduce. Describiremos las diferencias más importantes frente al mismo, se detallarán los componentes principales que componen el ecosistema Spark, e introduciremos conceptos básicos que permitan empezar con el desarrollo de aplicaciones básicas sobre el mismo.
Decentralized Evolution and Consolidation of RDF GraphsAksw Group
This document discusses decentralized evolution and consolidation of RDF graphs. It proposes using techniques from distributed version control systems (DVCS) like Git to track changes to RDF graphs. Key contributions include formalizing operations for committing changes, branching, merging graphs, and reverting commits. Strategies for merging graphs with conflicts like three-way merging are presented. An evaluation of a prototype implementation demonstrates it can correctly track changes and merge graphs while providing good performance. Future work includes improving support for the full framework and applying it to real world knowledge bases.
Python business intelligence (PyData 2012 talk)Stefan Urbanek
What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data.
Video: https://vimeo.com/53063944
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
Semplificare l'observability per progetti ServerlessLuciano Mammino
Hai mai pensato che le tue lambda functions possano fallire senza che tu te ne accorga? Se la risposta é "SI" probabilmente é perché ti sei giá "bruciato" giocando con il cloud, dove errori e fallimenti sono sempre dietro l'angolo. Purtroppo non possiamo prevenire tutti i fallimenti, pero' possiamo essere notificati quando qualcosa va storto cosí da poter reagire tempestivamente. Ma come fare a configurare il nostro ambiente AWS per raggiungere un buon livello di "Observability"? Se hai giá provato ad utilizzare CloudWatch saprai giá quanto possa essere complesso. In questo talk, esploreremo il tema dell'observability per applicazioni Serverless su AWS. Discuteremo problemi e best practices. Infine vi proporró un tool che permette di automatizzare la configurazione di CloudWatch per l'80% delle esigenze in pochi minuti!
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
A talk from given by Julian Hyde and Tomer Shiran at Hadoop Summit, Dublin.
Data scientists and analysts want the best API, DSL or query language possible, not to be limited by what the processing engine can support. Polyalgebra is an extension to relational algebra that separates the user language from the engine, so you can choose the best language and engine for the job. It also allows the system to optimize queries and cache results. We demonstrate how Ibis uses Polyalgebra to execute the same Python-based machine learning queries on Impala, Drill and Spark. And we show how to build Polyalgebra expressions in Calcite and how to define optimization rules and storage handlers.
The document discusses polyalgebra, an extended form of relational algebra that can handle complex data types like nested records and streaming data. It allows various data processing engines and SQL query engines to operate over different data sources using a single optimization framework. The document outlines the ecosystem of data stores, engines, and frameworks that can be used with polyalgebra and Calcite's rule-based query planning system. It provides examples of how relational algebra expressions capture the logic of SQL queries and how rules are used to optimize query plans.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
1. Le contour
• Motivations
• Le programmation de
l’infrastructure
• Exemples
• Les possibilités de
recherche
• Movations
• Infrastructure as
Code
• Examples
• Research
Opportunities
2013-06-19
1
2. L'ingénierie dans les nuages
Engineering your cloud infrastructure using CHEF
Dr. Andrew Forward
aforward@gmail.com / @a4word
Le 19 juin, 2013
2013-06-19
2
8. Gestion de l’infrastructure (2)
2013-06-19
8
Manuel, et sujette à
l'erreur, même avec
documentation.
Fedora
Java 1.6
PHP 5.1
Ubuntu
Java 1.7
PHP
9. Gestion des projets
2013-06-19
Web App Monitoring Build MachineDocumentation
Production Staging Test
Demo LoadQA
Comment intégrer les meilleurs outils dans nos projets?
10. Donc, on utilise DevOps
2013-06-19
10
Et, en particuler l’infrastructure comme code (utilisant Chef).
11. Origins of DevOps
2013-06-19
11
Patrick Dubois
(DevOpDays)
Andrew Shafer
“Agile Infrasture”
(Agile 2008)
JohnAllspaw, Paul Hammond
10+ deploys per day (Velocity)
Août, 2008
23 Juin, 2009
30/31 Oct, 2009
Cameron Haight
New IT Support Model
(DevOps)
18 mars, 2011
2012
entreprise
et DevOps
22. MapReduce
• GFS (Google File System) and MapReduce in 2004
• HFS and Hadoop open sourced under Apache
• Parallel processing on hundreds of nodes
• BigTable in 2006, and Hbase was born
• Store data in massive tables (billion rows / million columns)
• Retrieve key/value pairs in real-time
• Google later released
• Sawzall (query language) in 2005
• Pig & Hive (batch queries) in 2008
• Spanner (online queries like joins / transactions) in 2012
2013-06-19
22
23. Map, Shuffle, Reduce
2013-06-19
23
Both Map and Reduce are stateless – so can be parallelized with ease (the
MapReduce algorithms manage the distribution of the processing parts and the
consolidation of the results)
24. Example: Top Collaborators
• Analyze author collaborations, e.g.
• For simplification, we will refer to the authors as
• A : Andrew Forward
• O : Omar Badreddin
• T : Timothy C. Lethbridge
• G : Gunther Mussbacher
• J : Janice Singer
2013-06-19
24
Omar Badreddin, Andrew Forward: Model Oriented Programming: An Empirical Study of
Comprehension. CASCON 2012
Lethbridge, T., Mussbacher G., and Badreddin, O, (2011) "Teaching UML Using Umple: Applying
Model-Oriented Programming in the Classroom", CSEE&T 2011, pp. 421-428.
Lethbridge, T.C., Singer, J and Forward, A., (2003) "How software engineers use documentation:
the state of the practice", IEEE Software special issue: The State of the Practice of Software
Engineering, Nov/Dec 2003, pp 35-39.
Badreddin, O and Lethbridge, T. (2012) 'Combining Experiments and Grounded Theory to
Evaluate a Research Prototype: Lessons from the Umple Model-Oriented Programmin
25. Group Authors (Map)
• First, map the list of authors to each other
• Which is grouped / shuffled as
2013-06-19
25
O: (O A)
A: (O A)
T: (T G O)
G: (T G O)
O: (T G O)
T: (T J A)
J: (T J A)
A: (T J A)
O: (O T)
T: (O T)
O: (O A) (T G O) (O T)
A: (O A) (T J A)
T: (T G O) (T J A) (O T)
G: (T G O)
J: (T J A)
26. Count Collaborations (Reduce)
• Identify all collaborations (union the lists)
• Or, count the collaborations between authors
2013-06-19
26
O : (O A G T)
A : (A O T J)
T : (O A T G J)
G : (T G O)
J : (T J A)
O : (O 3) (A 1) (T 2) (G 1)
A : (O 1) (A 2) (T 1) (J 1)
T : (T 3) (G 1) (A 1) (O 2) (J 1)
G : (T 1) (G 1) (O 1)
J : (T 1) (J 1) (A 1)
27. Common Collaborators (Chain)
• What if we wanted to know collaborators in commun?
• Map the reduction for more results
2013-06-19
27
(O O): (O A G T)
(A O): (O A G T)
(G O): (O A G T)
(O T): (O A G T)
(G T): (T G O)
(G G): (T G O)
(O G): (T G O)
(A A): (A O T J)
(A O): (A O T J)
(A T): (A O T J)
(A J): (A O T J)
(J T): (T J A)
(J J): (T J A)
(A J): (T J A)
(O T): (O A T G J)
(A T): (O A T G J)
(T T): (O A T G J)
(G T): (O A T G J)
(J T): (O A T G J)
28. Common Collaborators (con’t)
2013-06-19
28
• Join collaboration pairs (shuffle)
(O O): (O A G T) (A A): (A O T J)
(A O): (O A G T) (A O T J) (A T): (A O T J) (O A T J)
(O T): (O A G T) (O A T J) (G T): (T G O) (O A T J)
(J T): (T J A) (O A T J) (G G): (T G O)
(A J): (T J A) (A O T J) (G O): (T G O)
(J J): (T J A) (T T): (O A T G J)
29. Common Collaborators (reduce)
2013-06-19
29
• Reduce by taking the union of the lists, and then removing the
collaborators
• So, if A visits T’s research profile, he would see that they have
both collaborated with O and J
(O O): (A T) (A A) : (O T J)
(A O): (T) (A T) : (O J)
(O T): (A) (G T) : (O)
(J T): (A) (G G) : (T O)
(A J) : (T) (O G) : (T)
(J J) : (T A) (T T) : (O A G J)
30. Chef + MapReduce
• Manage the
installation of your
hadoop / HFS
deployment
• Configure Single-
Node servers for
algorithm testing, but
Multi-Node for
production
• Enable dynamic /
elastic provisioning
2013-06-19
30
39. Related Work
Roberto Di Cosmo, Stefano Zacchiroli, and Gianluigi Zavattaro
discuss a formal component model for managing infrastruture in
the cloud. [1]
J. Weinman quantifies the benefits of cloud computing and
defines a mechanism to aximatically define and analyze cloud
benefits called Cloudonomics [2].
Gunawi [3] introduced Failure Scenario's as a server (FSaaS),
probably most known in "Chaos Monkey" [4] a product open
sourced by Netflix. Faraz Faghri [5] developeed FSaaS for
Hadoop Clusters.
2013-06-19
39
40. Related Work (cont)
[1] Towards behavior driven operations (BDOps)
Gohil, Komal ; Alapati, Nagalakshmi ; Joglekar, Sunil
Advances in Recent Technologies in Communication and Computing (ARTCom 2011), 3rd
International Conference on
Digital Object Identifier: 10.1049/ic.2011.0095
Publication Year: 2011 , Page(s): 262 - 264
[2] J. Weinman. Cloudonomics: A rigorous approach to cloud benefit quantification. The
Journal of Software
Technology, 14:10–18, October 2011.
[3] H. S. Gunawi, T. Do, J. M. Hellerstein, I. Stoica,
D. Borthakur, and J. Robbins. Failure as a Service (FaaS): A cloud service for large-
scale, online failure drills. Technical Report UCB/EECS-2011-87, EECS Department, University
of California, Berkeley, Jul 2011.
[4] ChaosMonkey. http://techblog.netflix.com/2010/12/5-lessons- weve-learned-using-
aws.html.
[5] Failure Scenario as a Service (FSaaS) for Hadoop Clusters
[6] X. Zhang, S. Dwarkadas, G. Folkmanis, and K. Shen.
Processor hardware counter statistics as a first-class system resource. In Proceedings of the
11th USENIX workshop on Hot topics in operating systems, HOTOS’07, pages 14:1–
14:6, Berkeley, CA, USA, 2007. USENIX Association.
[7] M. Isard, V. Prabhakaran, J. Currey, U. Wieder,
K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP
’09, pages 261–276, New York, NY, USA, 2009. ACM.
[8] Devops: A software revolution in the making? Cutter IT Journal, 24(8), 2011. Special
issue.
*9+ S. McIntosh, B. Adams, Y. Kamei, T. Nguyen, and A. E. Hassan, “An
empirical study of build maintenance effort,” in Proc. of Intl. Conf. on
Software Engineering (ICSE), 2011, pp. 141–150.
[10] J. A. Whittaker, J. Arbon, and J. Carollo, How Google Tests Software.
Addison-Wesley Professional, April 2012.
[11] R. DeLine, “Avoiding packaging mismatch with flexible packaging,” in
Proc. of Intl. Conf. on Software Engineering (ICSE), 1999, pp. 97–106.
[12] A. van der Hoek and A. L. Wolf, “Software release management for
component-based software,” Softw. Pract. Exper., vol. 33, pp. 77–98,
January 2003.
[13] J. Humble and D. Farley, Continuous Delivery, 1st ed. Addison Wesley,
August 2010.
*14+ T. Fitz, “Continuous deployment at IMVU: Doing the impossible fifty
times a day,” http://goo.gl/qPT6, February 2009.
[15] S. Shankland, “Google ethos speeds up Chrome release cycle,”
http://goo.gl/vNvlr, July 2010.
[16] F. Khomh, T. Dhaliwal, Y. Zou, and B. Adams, “Do faster releases
improve software quality? an empirical case study of mozilla firefox,”
in Proc. of the Working Conf. on Mining Software Repositories (MSR),
2012.
[17] M. Armbrust et al. Above the clouds: A berkeley view
of cloud computing. In Tech. Rep.
UCB/EECS-2009-28, EECS Department, University of
California, Berkeley, 2009.
[18] Don't Install Software by Hand
Spinellis, D.
Software, IEEE
Volume: 29 , Issue: 4
Digital Object Identifier: 10.1109/MS.2012.85
Publication Year: 2012 , Page(s): 86 - 87
[19] Fast Development Platforms and Methods for Cloud Applications
Hosono, S. ; Jiafu He ; Xuemei Liu ; Lin Li ; He Huang ; Yoshino, S.
Services Computing Conference (APSCC), 2011 IEEE Asia-Pacific
Digital Object Identifier: 10.1109/APSCC.2011.75
Publication Year: 2011 , Page(s): 94 - 101
[20] Building IaaS Clouds and the art of virtual machine management
Montero, R.S.
High Performance Computing and Simulation (HPCS), 2012 International Conference on
Digital Object Identifier: 10.1109/HPCSim.2012.6266975
Publication Year: 2012 , Page(s): 573
2013-06-19
40
42. Les bases Chef
On peut…
• Installer des
systèmes
d'exploitation
• Installez les logiciels
• Démarrer / Arrêter
les services
• Configurer plusieurs
fois (idempotentes)
En utilisant…
• Ohai
• Chef-client
• Chef-server
• Chef-solo
• Knife
• Shef
2013-06-19
42
http://www.opscode.com/chef/
44. Ohai Can…
Attribute Description
node['platform'] The platform on which a node is running.
node['platform_version'] The version of the platform.
node['ipaddress'] The IP address for a node. If the node has a
default route, this is the IPV4 address for the
interface.
node['macaddress'] The MAC address for a node.
node['fqdn'] The fully qualified domain name for a node.
node['hostname'] The host name for the node.
node['domain'] The domain for the node.
node['recipes'] A list of recipes associated with a node (and part
of that node’s run-list).
node['roles'] A list of roles associated with a node (and part of
that node’s run-list).
node['ohai_time'] The time at which Ohai was last run.
2013-06-19
44
46. Common Chef Resources
• Package (e.g. yum, apt-get)
• File
• Directory
• Template (using ERB)
• Service (e.g. Upstart)
• Execute (i.e. tar –zxfv …)
• Cron
• Git
• Group
• Mount
• User
2013-06-19
46
47. Example Chef Roles
• Load Balancer
• Database Master / Slave
• File / Media Server
• Web Server
• Build Server
• Application Specific Server
2013-06-19
47
Aujourd’hui, jevaistraiter le subjet de l’infrastructurecomme code et montrer comment nous pouvonsutiliser des outilscomme Chef pour gérernossystèmes.Mon subjet et quelque chose qui esttrès pertinent etantdonnéquenos infrastructure devient de plus et plus virtualisée et élastique.
En d'autrestermes, c’estl'ingénieriedans les nuages.
C'est quoi un nuage en informatiqure: un nuage se definit par deux choses en principals:
D'abord, c'estuneoffre de service virtuelqu'onhebergen'inporteoù,n’inportequand (e.g. Amazon EC2, ou Google Apps)Ensuite, la capacité des ressourcesestdynamique, et on peutl'augmenterou la diminueruniquementsurdemandeJusqu'àprésent, on connait 3 modèled’usagecommuna - le IAAS - infrustrúcturescomme un service: un service de serveursvirtuelsqu'onpeutconfiguruersurdemande.b - le PAAS - platform comme un service: de bases de données, de logiciels, et des API (interface de programmation), (e.g. Heroku)c - le SAAS - software comme un service, (e.g. Salesforce, Google Apps)d - le NAAS - network comme un service, une nouvelle conception qui considère la connectivitée du network (e.g. VPN or bandwith-on-demand)E – le HAAS – hardwarecomme un serviceF – le MaaS – metal as a serviceG – FSaaS – failure scenarios as a serviceIII En plus, les modèlespeuxétredéployer de 3 faćona – Nuageprivéeb – Nuagepubliquec – NuagehybrideC’est definitions ontétédefinit en Septembre 2011 pare NIST, InstitueNationale des Standards et de la Technologie (National Institute of Standards and Technology)
Voiciplusieursexemples des types de services disponsible par les companiecomme Google, Microsoft et Amazon.Danscette exposé nous allonstraitéuniquement le IAAS ainsiquel’application de la programation de l’infrastructure en utilisant le logiciel CHEF.http://en.wikipedia.org/wiki/Infrastructure_as_a_serviceAmazon EC2, AirVM, Azure Services Platform, DynDNS, Google Compute Engine, HP Cloud, iland, Joyent, LeaseWeb, Linode, NaviSite, Oracle Infrastructure as a Service, Rackspace, ReadySpace Cloud Services, ReliaCloud, SAVVIS, SingleHop, and Terremarkhttp://en.wikipedia.org/wiki/Cloud_computing AWS Elastic Beanstalk, Cloud Foundry, Heroku, Force.com, EngineYard, Mendix, OpenShift, Google App Engine, AppScale, Windows Azure Cloud Services, OrangeScape and Jelastic.Google Apps, Microsoft Office 365, Petrosoft, Onlive, GT Nexus, Marketo, Casengo, TradeCard, Salesforce and CallidusCloud.
Suivant, je présente des problèmes qui agit comme motivation pour cette séminaire.
En premier, la gestion de l’infrastructure. Comment gérons-nous nosserveurd'unemanièremesurée, contrôlée et cohérent? Pour elaborer je vaisprendre un petit exemple.Image quevousavez un serveur Linux avec PHP et Java, et quevotre application fonctionneparfaitement. Le départementdécidequ'ilest temps de mettreàniveau et vousdonne un nouveau serveur Linux avec PHP et Java. Maisvotre application ne fonctionne plus du tout, pourquoi?
Votreserveur a étéchangé de Fedora pour Ubuntu, a étéinstallé avec Java 1.7 et PHP n'a pas étéinstallécorrectement. Le processusmanuel de configuration des serveurs, mêmesivousavez un "script” estsujet aux erreurs, et surtoutinconsistant, ce qui conduit à des problèmes.Ou, bien des opportunités pour l’étudier et l’ améliorer.
Deuxiement, comment estceque on peutintegrer les outils pour mieuxmesurer, controller et surveillernoslogiciels (comme un ingenieur) meme si on travail dans un petit equipe.En plus de votre logiciel, votre environnement peut consister des tests automatisés, surveillance, de build-serveur, de machines d'assurance qualité et de documentation (wiki, API, etc). Et vous pouvez avoir plusieurs environnements comme la production, la mise en scène, test, etcC'est impratique de configurer individuellement chaque environnement, qui peut avoir plusieurs serveurs, a cause de l’infrastructure qui change fréquemment, avec nouveau version des logicielles, des configurations, etc. C'est possible, oui, mais c'est pas pratique et ce n’est pas l’ingénierie.
DevOps nous permetd’avoir nouveaux langages de programmation pour faciliter les opérations des systems, suivant les principes CAMS -- communauté, de l'automatisation, de mesure et de partage.Ref: http://techli.com/collabnet-UC4-softwarehttp://www.opscode.com/blog/2010/07/16/what-devops-means-to-me/
Le concept a prisforme en 2009 avec les interactions entre Andrew Shaffer et Patrick Dubois, ainsiqu’unséminairedonné par John Allspaw et Paul Hammond. La première conférence de DevOpsDaysétait en Octobre 2009. Momentum a continuéà twitter et en ligne et en 2011 un analyste de Gartner, Cameron Haight, a préditune participation de 20% dans les «DevOps» en 2020. Peu de temps après, DevOpsparudansl'entreprise.REFERENCEShttp://www.devopsdays.org/http://agile2008.agilealliance.org/http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://a4.mzstatic.com/us/r30/Podcasts/v4/bd/86/50/bd8650bb-51a4-5dfe-9366-0c561ada02ee/mza_3135987482758484457.170x170-75.jpghttp://na2.www.gartner.com/imagesrv/images/gartner136.gif;pvab88126431201058http://www-03.ibm.com/ibm/history/exhibits/logo/images/920911.jpghttp://www.www8-hp.com/us/en/images/THP_S_K_RGB_150_LG_65x65_Ctcm2451096198_Ttcm245108559832_F.png
Nous allonsmaintenantdiscuter Chef, un langage de programmation qui faciliteDevOps. Mais, ilya beaucoup plus d'optionsdisponibleshttp://www.opscode.com/chef/Chef is an open source systems integration framework built to bring the benefits of configuration management to your entire infrastructure. You write source code to describe how you want each part of your infrastructure to be built, then apply those descriptions to your servers. The result is a fully automated infrastructure: when a new server comes on line, the only thing you have to do is tell Chef what role it should play in your architecture.http://saltstack.com/community.htmlSalt is a powerful remote execution manager that can be used to administer and provision servers in a fast and efficient way. Salt allows commands to be executed across large groups of servers. This means systems can be easily managed, but data can also be easily gathered. Quick introspection into running systems becomes a reality. Remote execution is usually used to set up a certain state on a remote system. Salt addresses this problem as well, the salt state system uses salt state fileshttp://cfengine.com/Cfengine, the world technology leader in datacenter automation, based on state-of-the-art research and development, is used by more than 5000 companies on millions of machines world-wide. Versatile and lightweight, Cfengine is the preferred solution for the most exacting system administrators.http://puppetlabs.com/puppet/what-is-puppet/Puppet Data Center Automation Solution helps you save time, gain visibility into your server environment, and ensure consistency across your IT infrastructure.http://www.nico.schottelius.org/software/cdist/cdist is an alternative to other configuration management systems like cfengine, bcfg2, chef and puppet. But cdist ticks differently.http://www.cobblerd.org/Cobbler is a Linux installation server that allows for rapid setup of network installation environments. It glues together and automates many associated Linux tasks so you do not have to hop between lots of various commands and applications when rolling out new systems, and, in some cases, changing existing ones. It can help with installation, DNS, DHCP, package updates, power management, configuration management orchestration, and much more.https://github.com/crafterm/sprinkleSprinkle is a software provisioning tool you can use to build remote servers with. eg. to install a Rails, or Sinatra stack on a brand new slice directly after its been created.http://www.opscode.com/http://palletops.com/Pallet is platform for agile and programmatic automation of infrastructure in the cloud, on server racks or directly on virtual machines. Pallet provides cloud provider and operating system independence, and allows for an unprecedented level of customization.http://www.ansibleworks.com/Ansible is a radically simple model-driven configuration management, deployment, and command execution framework. Other tools in this space have been too complicated for too long, require too muchhttp://rexify.org/(R)?ex - manage all your boxes from a central point - Datacenter Automation and Configuration Managementhttps://code.google.com/p/munki/munki is a set of tools that, used together with a webserver-based repository of packages and package metadata, can be used by OS X administrators to manage software installs (and in many cases removals) on OS X client machines.http://rundeck.org/RunDeck is an open source automation service with a web console, command line tools and a WebAPI. It lets you easily run automation tasks across a set of nodes.http://pongasoft.github.io/glu/docs/latest/html/index.htmlglu is an open source deployment and monitoring automation platform.http://nadarei.co/mina/Mina is a really fast deployer and server automation tool written in Ruby.http://docs.fabfile.org/en/1.6/Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.https://github.com/crowbar/crowbarCrowbar is an Open Source platform for server provisioning and deployment from bare metal. It provides server discovery, firmware upgrades, and operating system installation using PXE Boot. It deploys applications on top of functioning operating systems using chef.http://trac.mcs.anl.gov/projects/bcfg2Bcfg2 helps system administrators produce a consistent, reproducible, and verifiable description of their environment, and offers visualization and reporting tools to aid in day-to-day administrative tasks. It is the fifth generation of configuration management tools developed in the Mathematics and Computer Science Division of Argonne National Laboratory. It is based on an operational model in which the specification can be used to validate and optionally change the state of clientshttp://commando.io/A web-based interface for streamlining the use of SSH for deployments & system administration tasks across groups of remote servers.http://wiki.smartfrog.org/wiki/display/sf/SmartFrog+HomeSmartFrog is a powerful and flexible Java-based software framework for configuring, deploying and managing distributed software systems.SmartFrog helps you to encapsulate and manage systems so they are easy to configure and reconfigure, and so that that they can be automatically installed, started and shut down. It provides orchestration capabilities so that subsystems can be started (and stopped) in the right order. It also helps you to detect and recover from failures.Such systems typically have multiple software components running across a network of computing resources, where the components must work together to deliver the functionality of the system as a whole. It's critical that the right components are running in the right places, that the components are individually and collectively correctly configured, and that they are correctly combined to create the complete system. This profile fits many of the services and applications that run on today's computing infrastructures.SmartFrog consists of:+ A Language for defining configurations, providing powerful system modelling capabilities and an expressive notation for describing system configurations+ A secure, distributed Runtime System for deploying software components and managing running software systems+ A Library of SmartFrog Components that implement the SmartFrog component model and provide a wide range of services and functionalityAnother projecthttp://www.aeolusproject.org/about.htmlhttp://www.aeolusproject.org/images/aeolus_logo-header.pnghttps://juju.ubuntu.com/https://juju.ubuntu.com/wp-content/themes/juju-website/img/logo-ubuntu.pnghttp://www.cloudfoundry.com/http://www.cloudfoundry.com/assets/logo-cloudfoundry.png
Chefest un langageécrit en Ruby, et avec, nous pouvons:Installer des systèmesd'exploitationInstallez les logiciels, les fichers et les configurationsDémarrer et Arrêter les servicesEn fin, on peutconfigurerplusieursfois (un processusidempotentes)
Les ressourcessontconfiguréesàl'aide des recettes.Danscesrecettes, on peutcréer des fichiers et des documents, puis installer les logiciels, oubienconfigurerune connection à un base de données. Uneouplusieursrecettes font partie d'un livre de recettes (cookbook).Un rôlec'est un groupes de recettesutilisées ensembles (e.g. unecombinaison de recettes pour installer dansserveur LAMP). Le DNA estunecombinaison de recettes / rôle qui sert pour définir la configuration d'un serveur, qui estl'équivolent d'un node. En fin, Les nodes peuventêtreregroupées par environnement, par exemple production oumise en scène.Inspired by and thanks to:Adam Jacob (@adamhjk), released in 2009 under Apache License, Version 2.0.http://learnchef.getharvest.com/introduction.html#mapping
Uneautrefaçon de visualiser Chef est de la structure des fichiers d'un livre de recettes. Ici, nous voyonsque les recettespeuventcomporter des attributs, des définitions, des fichiers, des modèles,des recettes, etc.Comme le temps le permet, nous allonsprendreplusieurs examples.Images:https://github.com/json-schema
D’abord, bootstrapping CHEF. Chef est un logiciel complex. Bootstrap est un project qui nous permetd’installer Chef dansnotreserveur. Cette configuration de bootstrap estutiliser avec chef-solo et on n’a pas besoin d’un serveurspecifique a Chef.En suite, nous pourrions installer un service de surveillance Monit,ou un build-serveur.En fin, et l'exemplequ'onvadiscuter le plus et un exemples de MapReduce.
Il y a un livres de menu à source ouvert pour cetterecettes, chef-monit.https://github.com/aforward/chef-monitJe vaisvous presenter certain examples en fonction du temps qui m’estalloué
c = Collabs.CLI.run(["-o"])Dict.fetch(c,"Timothy Lethbridge")Dict.keys(c)
Je vousaiparlé de la conception futuriste desnuages en informatique et de l’infrastructurecomme code en utilisant Chef.C'est un projet de grandeampleurque je comptedévelopper au sein du départment.Àmonavis, les avenues de recherchecomprennent:La research avec la programation de l’infrastructure. On peutétudier, expérimenter, et d'améliorer la façondont nous gérons les services de cloud computing. Celacomprend des activités de recherchetelsque la construction d'unetaxonomie pour les catégories les logicels de DevOps, l'analyse et l'expérimentation de logicielsdisponiblesainsique la rechercheàcoorelations entre le développementd'applications et le développement de l'infrastructure.En plus, la voitconcernant la research dans les nuages en generaleoù on utilise un logicellecomme Chef pour faciliter la configuration de notrenuages. Cela nous laisselibre de se concentrersurnosproblèmes au Big Data. Par exemple, fournir de meilleursoutils pour l'examen de la littérature en analysant les auteurs et des références dedans les publications.
Je suis a vous pour repondres a vos questionsJe vousremerciemesdame et mousiers pour votre attention.Why work at university?J’aime la recherche, et j’aimetransmettremon savoir, et pourtravailler en collaboration avec l’industrieprivés. J’aime faire de la recherche en informatiques et en fin j’aimetravailler en collaboration avec le secteurindustrieldans le caudre de recherche et development.J’aimetravailleràl’universiteparcequej’aimetrav
[1] In particular to deal with elastic infrastructure where you must manage the upgrading of multiple servers, called Aeolus. Di Cosmo formalize through state machines the core scenarios for infrastruture management including: Package installation; Services, Redundancy / Capacity Planning / Conflicts; and creating and destorying resources.
Chefcomprend 4 outils.D’abordOhai, c’est un system qui nous permet to connettretoutes la composition du system: le logicial, versionne, etc.Ensuite, le client qui install et configure un system et le serveur qui coordonne les activitées de tous les serveurs, et chef-solo qui permet de jouer le role d’un system de distributionPuis, Knife qui geertoutes les commandesd’installation de chefEt enfin, Shef avec un S et non pas un C, est un programmeutilisé pour corriger les fautes qui surgissentRevenant a Chef avec un grande COhaiSystem profiler / asset collectorChef-clientChef-serverKnifePackage management, installerShefInteractive debugger