Probabilistic algorithms for fun and pseudorandom profitTyler Treat
There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.
The document describes running a sample analysis using the CloudBurst tool on Hadoop to map 100,000 Illumina reads against a Streptococcus suis reference genome, allowing up to 3 mismatches. It provides details of the sample input data formats and locations, commands to format the data and run CloudBurst, and snippets of the output showing the mapping job progress and completing successfully in under 20 minutes.
This document summarizes lecture slides from a university computer science class on operating system processes and virtual memory. The slides covered:
- Last week's discussion of process creation in Rust and the fork system call.
- The plan for this week's lectures, including how the kernel makes processes and diving into the fork.c source code.
- An overview of virtual memory and how it provides memory isolation between processes using paging and segmentation.
- Details of how x86 processors implement virtual memory using segmentation and paging tables, address translation, and handling of page faults.
- The history and evolution of virtual memory from early mainframe systems to modern desktop processors.
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
The document discusses several topics related to computer science class cs4414 at University of Virginia:
- Updates were due Sunday at 11:59pm including progress updates and scheduling design reviews.
- Tuesday's class will feature a guest lecture on authentication using single sign-on.
- The last class covered translation lookaside buffers and paging/segmentation concepts.
- A code sample is shown and analyzed that causes a segmentation fault due to accessing memory outside the allocated space.
- Details are provided on limiting resources and viewing process limits.
The document discusses PIG on Storm. It begins with an example of using PIG to perform tokenization, grouping, and counting of sentences. It then shows how the same operations can be done in Storm directly and in PIG running on Storm. The document outlines different Storm execution modes and how PIG can provide state management and sliding windows. It also introduces the concept of a hybrid mode where parts of a PIG script can run on Storm or MapReduce automatically.
The document discusses the process for context switching between tasks in Rust. It explains that the current task is grabbed from thread-local storage and its ability to sleep is checked. The next task and cleanup function are prepared. Unsafe transmutes are used to get mutable references to tasks. The task contexts are swapped using a raw operation, placing the scheduler and next task in the proper locations. On the return swap, the cleanup function is immediately run.
Kurator is an open-source workflow platform for data curation tools. It aims to detect and flag data quality issues, repair issues when possible with human curation as needed, and track provenance of automatic and human edits. Kurator uses scientific workflow systems like Kepler to automate computational aspects of curation. It also employs script-based approaches and YesWorkflow annotations to provide workflow views and capture provenance from scripts. This allows leveraging existing tools and programming expertise while providing workflow benefits such as automation, scaling, and provenance tracking.
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
The document discusses how an easy-to-use and fast database can have a complicated implementation for developers. It outlines four key areas: 1) Flexible writing schema requires schema merging at read time. 2) Fast reads prune non-covered data chunks through predicate push-down. 3) Loading duplicated data necessitates data deduplication and compaction operations. 4) Quick data deletion still needs data elimination at read time or in the background. The document provides examples to illustrate the tradeoffs between user and developer requirements.
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.
The document describes running a sample analysis using the CloudBurst tool on Hadoop to map 100,000 Illumina reads against a Streptococcus suis reference genome, allowing up to 3 mismatches. It provides details of the sample input data formats and locations, commands to format the data and run CloudBurst, and snippets of the output showing the mapping job progress and completing successfully in under 20 minutes.
This document summarizes lecture slides from a university computer science class on operating system processes and virtual memory. The slides covered:
- Last week's discussion of process creation in Rust and the fork system call.
- The plan for this week's lectures, including how the kernel makes processes and diving into the fork.c source code.
- An overview of virtual memory and how it provides memory isolation between processes using paging and segmentation.
- Details of how x86 processors implement virtual memory using segmentation and paging tables, address translation, and handling of page faults.
- The history and evolution of virtual memory from early mainframe systems to modern desktop processors.
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
The document discusses several topics related to computer science class cs4414 at University of Virginia:
- Updates were due Sunday at 11:59pm including progress updates and scheduling design reviews.
- Tuesday's class will feature a guest lecture on authentication using single sign-on.
- The last class covered translation lookaside buffers and paging/segmentation concepts.
- A code sample is shown and analyzed that causes a segmentation fault due to accessing memory outside the allocated space.
- Details are provided on limiting resources and viewing process limits.
The document discusses PIG on Storm. It begins with an example of using PIG to perform tokenization, grouping, and counting of sentences. It then shows how the same operations can be done in Storm directly and in PIG running on Storm. The document outlines different Storm execution modes and how PIG can provide state management and sliding windows. It also introduces the concept of a hybrid mode where parts of a PIG script can run on Storm or MapReduce automatically.
The document discusses the process for context switching between tasks in Rust. It explains that the current task is grabbed from thread-local storage and its ability to sleep is checked. The next task and cleanup function are prepared. Unsafe transmutes are used to get mutable references to tasks. The task contexts are swapped using a raw operation, placing the scheduler and next task in the proper locations. On the return swap, the cleanup function is immediately run.
Kurator is an open-source workflow platform for data curation tools. It aims to detect and flag data quality issues, repair issues when possible with human curation as needed, and track provenance of automatic and human edits. Kurator uses scientific workflow systems like Kepler to automate computational aspects of curation. It also employs script-based approaches and YesWorkflow annotations to provide workflow views and capture provenance from scripts. This allows leveraging existing tools and programming expertise while providing workflow benefits such as automation, scaling, and provenance tracking.
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
The document discusses how an easy-to-use and fast database can have a complicated implementation for developers. It outlines four key areas: 1) Flexible writing schema requires schema merging at read time. 2) Fast reads prune non-covered data chunks through predicate push-down. 3) Loading duplicated data necessitates data deduplication and compaction operations. 4) Quick data deletion still needs data elimination at read time or in the background. The document provides examples to illustrate the tradeoffs between user and developer requirements.
This document provides an overview of troubleshooting streaming replication in PostgreSQL. It begins with introductions to write-ahead logging and replication internals. Common troubleshooting tools are then described, including built-in views and functions as well as third-party tools. Finally, specific troubleshooting cases are discussed such as replication lag, WAL bloat, recovery conflicts, and high CPU recovery usage. Throughout, examples are provided of how to detect and diagnose issues using the various tools.
R can be used for large scale data analysis on Hadoop through tools like Rhadoop, R + Hadoop Streaming, and Rhipe. The document demonstrates using Rhadoop to analyze a large mortality dataset on Hadoop. It shows how to install Rhadoop, run a word count example, and analyze causes of death from the dataset using MapReduce jobs with R scripts. While Rhadoop allows scaling R to big data on Hadoop, its development is ongoing so backward compatibility must be considered. Other options like Pig with R may provide better integration than Rhadoop alone.
This document discusses using PostgreSQL statistics to optimize performance. It describes various statistics sources like pg_stat_database, pg_stat_bgwriter, and pg_stat_replication that provide information on operations, caching, and replication lag. It also provides examples of using these sources to identify issues like long transactions, temporary file growth, and replication delays.
This document provides an overview of pgCenter, an open source tool for monitoring and managing PostgreSQL databases. It summarizes pgCenter's main features, which include displaying statistics on databases, tables, indexes and functions; monitoring long running queries and statements; managing connections to multiple PostgreSQL instances; and performing administrative tasks like viewing logs, editing configuration files, and canceling queries. Use cases and examples of how pgCenter can help optimize PostgreSQL performance are also provided.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
The webinar covered new features and updates to the Nephele 2.0 bioinformatics analysis platform. Key updates included a new website interface, improved performance through a new infrastructure framework, the ability to resubmit jobs by ID, and interactive mapping file submission. New pipelines for 16S analysis using DADA2 and quality control preprocessing were introduced, and the existing 16S mothur pipeline was updated. The quality control pipeline provides tools to assess data quality before running microbiome analyses through FastQC, primer/adapter trimming with cutadapt, and additional quality filtering options. The webinar emphasized the importance of data quality checks and highlighted troubleshooting tips such as examining the log file for error messages when jobs fail.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
國立臺灣大學電機所博士生,平時致力於推廣 R 語言,曾主辦多場 R 語言推廣講座,並經常於 Taiwan R User Group 分享 R 的使用心得。有豐富的 R 語言實務經驗,包含資料的收集、整理、分析到報告製作。擅長根據專案需求,量身打造 R 的資料分析系統,以及運用 R 和 C++ 撰寫高效能演算法。
This document provides an introduction to using RHadoop to interface R with Hadoop. It recommends downloading a Cloudera VM with CentOS, CDH5.3, R 3.x, and Java 1.7 installed. It then recommends downloading RHadoop packages and installing rhdfs, rhbase, rmr2, and plyrmr packages in R. It provides guidance on getting started with RHadoop, including ensuring required packages are installed, enabling HDFS, and potentially configuring the JAVA_HOME environment variable. It also provides pointers for debugging RHadoop programs when run with different Hadoop backends and modes.
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves.
As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
PostgreSQL has advanced in many ways but bloat remains a challenge. A solution for this in development is zheap, a new storage format in which only the latest version of the data is kept in main storage and the old version will be moved to an undo log. In this presentation delivered at Postgres Vision 2018, Robert Haas, a Major Contributor to the PostgreSQL project who is leading development of zheap at EnterpriseDB, where he is Vice President, Chief Database Architect, explains the project.
This document discusses PostgreSQL backups and disaster recovery. It covers the need for different types of backups like logical and physical backups. It discusses how to store backups and automate the backup process. The document also covers how to validate backups are working properly and tools that can be used. It emphasizes that both logical and physical backups are important to have for different recovery scenarios. Automation is recommended to manage the complex backup processes.
This document summarizes a presentation on a Stata module called "parallel" for parallel computing. It discusses the motivation for parallel computing given large administrative datasets and powerful computers. It describes how parallel works by splitting datasets and tasks across computer clusters to accelerate computations. Benchmarks show parallel processing provides near-linear speedups on tasks like Monte Carlo simulations and reshaping large databases. The syntax and usage focuses on applications well-suited for parallel like simulations and loops, while noting commands like regressions may not benefit as much.
Pig Latin is a data flow language for analyzing large datasets that provides an alternative to SQL and MapReduce. It allows programmers to write scripts as a sequence of steps and handles optimization and parallelization. Pig Latin supports user-defined functions, flexible nested data models, and interactive debugging. The language is implemented by compiling logical query plans into physical MapReduce jobs and allows for lazy execution. It is well suited for tasks like temporal analysis, session analysis, and rollups on large log and web crawl data.
GridSQL is an open source distributed database built on PostgreSQL that allows it to scale horizontally across multiple servers by partitioning and distributing data and queries. It provides significantly improved performance over a single PostgreSQL instance for large datasets and queries by parallelizing processing across nodes. However, it has some limitations compared to PostgreSQL such as lack of support for advanced SQL features, slower transactions, and need for downtime to add nodes.
Performance Tuning Cheat Sheet for MongoDBSeveralnines
Bart Oles - Severalnines AB
Database performance affects organizational performance, and we tend to look for quick fixes when under stress. But how can we better understand our database workload and factors that may cause harm to it? What are the limitations in MongoDB that could potentially impact cluster performance?
In this talk, we will show you how to identify the factors that limit database performance. We will start with the free MongoDB Cloud monitoring tools. Then we will move on to log files and queries. To be able to achieve optimal use of hardware resources, we will take a look into kernel optimization and other crucial OS settings. Finally, we will look into how to examine performance of MongoDB replication.
This document provides an overview of troubleshooting streaming replication in PostgreSQL. It begins with introductions to write-ahead logging and replication internals. Common troubleshooting tools are then described, including built-in views and functions as well as third-party tools. Finally, specific troubleshooting cases are discussed such as replication lag, WAL bloat, recovery conflicts, and high CPU recovery usage. Throughout, examples are provided of how to detect and diagnose issues using the various tools.
R can be used for large scale data analysis on Hadoop through tools like Rhadoop, R + Hadoop Streaming, and Rhipe. The document demonstrates using Rhadoop to analyze a large mortality dataset on Hadoop. It shows how to install Rhadoop, run a word count example, and analyze causes of death from the dataset using MapReduce jobs with R scripts. While Rhadoop allows scaling R to big data on Hadoop, its development is ongoing so backward compatibility must be considered. Other options like Pig with R may provide better integration than Rhadoop alone.
This document discusses using PostgreSQL statistics to optimize performance. It describes various statistics sources like pg_stat_database, pg_stat_bgwriter, and pg_stat_replication that provide information on operations, caching, and replication lag. It also provides examples of using these sources to identify issues like long transactions, temporary file growth, and replication delays.
This document provides an overview of pgCenter, an open source tool for monitoring and managing PostgreSQL databases. It summarizes pgCenter's main features, which include displaying statistics on databases, tables, indexes and functions; monitoring long running queries and statements; managing connections to multiple PostgreSQL instances; and performing administrative tasks like viewing logs, editing configuration files, and canceling queries. Use cases and examples of how pgCenter can help optimize PostgreSQL performance are also provided.
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
The webinar covered new features and updates to the Nephele 2.0 bioinformatics analysis platform. Key updates included a new website interface, improved performance through a new infrastructure framework, the ability to resubmit jobs by ID, and interactive mapping file submission. New pipelines for 16S analysis using DADA2 and quality control preprocessing were introduced, and the existing 16S mothur pipeline was updated. The quality control pipeline provides tools to assess data quality before running microbiome analyses through FastQC, primer/adapter trimming with cutadapt, and additional quality filtering options. The webinar emphasized the importance of data quality checks and highlighted troubleshooting tips such as examining the log file for error messages when jobs fail.
So you want to get started with Hadoop, but how. This session will show you how to get started with Hadoop development using Pig. Prior Hadoop experience is not needed.
Thursday, May 8th, 02:00pm-02:50pm
國立臺灣大學電機所博士生,平時致力於推廣 R 語言,曾主辦多場 R 語言推廣講座,並經常於 Taiwan R User Group 分享 R 的使用心得。有豐富的 R 語言實務經驗,包含資料的收集、整理、分析到報告製作。擅長根據專案需求,量身打造 R 的資料分析系統,以及運用 R 和 C++ 撰寫高效能演算法。
This document provides an introduction to using RHadoop to interface R with Hadoop. It recommends downloading a Cloudera VM with CentOS, CDH5.3, R 3.x, and Java 1.7 installed. It then recommends downloading RHadoop packages and installing rhdfs, rhbase, rmr2, and plyrmr packages in R. It provides guidance on getting started with RHadoop, including ensuring required packages are installed, enabling HDFS, and potentially configuring the JAVA_HOME environment variable. It also provides pointers for debugging RHadoop programs when run with different Hadoop backends and modes.
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves.
As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.
The document provides interview questions and answers related to Hadoop. It discusses common InputFormats in Hadoop like TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat. It also describes concepts like InputSplit, RecordReader, partitioner, combiner, job tracker, task tracker, jobs and tasks relationship, debugging Hadoop code, and handling lopsided jobs. HDFS, its architecture, replication, and reading files from HDFS is also covered.
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
PostgreSQL has advanced in many ways but bloat remains a challenge. A solution for this in development is zheap, a new storage format in which only the latest version of the data is kept in main storage and the old version will be moved to an undo log. In this presentation delivered at Postgres Vision 2018, Robert Haas, a Major Contributor to the PostgreSQL project who is leading development of zheap at EnterpriseDB, where he is Vice President, Chief Database Architect, explains the project.
This document discusses PostgreSQL backups and disaster recovery. It covers the need for different types of backups like logical and physical backups. It discusses how to store backups and automate the backup process. The document also covers how to validate backups are working properly and tools that can be used. It emphasizes that both logical and physical backups are important to have for different recovery scenarios. Automation is recommended to manage the complex backup processes.
This document summarizes a presentation on a Stata module called "parallel" for parallel computing. It discusses the motivation for parallel computing given large administrative datasets and powerful computers. It describes how parallel works by splitting datasets and tasks across computer clusters to accelerate computations. Benchmarks show parallel processing provides near-linear speedups on tasks like Monte Carlo simulations and reshaping large databases. The syntax and usage focuses on applications well-suited for parallel like simulations and loops, while noting commands like regressions may not benefit as much.
Pig Latin is a data flow language for analyzing large datasets that provides an alternative to SQL and MapReduce. It allows programmers to write scripts as a sequence of steps and handles optimization and parallelization. Pig Latin supports user-defined functions, flexible nested data models, and interactive debugging. The language is implemented by compiling logical query plans into physical MapReduce jobs and allows for lazy execution. It is well suited for tasks like temporal analysis, session analysis, and rollups on large log and web crawl data.
GridSQL is an open source distributed database built on PostgreSQL that allows it to scale horizontally across multiple servers by partitioning and distributing data and queries. It provides significantly improved performance over a single PostgreSQL instance for large datasets and queries by parallelizing processing across nodes. However, it has some limitations compared to PostgreSQL such as lack of support for advanced SQL features, slower transactions, and need for downtime to add nodes.
Performance Tuning Cheat Sheet for MongoDBSeveralnines
Bart Oles - Severalnines AB
Database performance affects organizational performance, and we tend to look for quick fixes when under stress. But how can we better understand our database workload and factors that may cause harm to it? What are the limitations in MongoDB that could potentially impact cluster performance?
In this talk, we will show you how to identify the factors that limit database performance. We will start with the free MongoDB Cloud monitoring tools. Then we will move on to log files and queries. To be able to achieve optimal use of hardware resources, we will take a look into kernel optimization and other crucial OS settings. Finally, we will look into how to examine performance of MongoDB replication.
This document provides an overview and introduction to PostgreSQL for new users. It covers getting started with PostgreSQL, including installing it, configuring authentication and logging, upgrading to new versions, routine maintenance tasks, hardware recommendations, availability and scalability options, and query tuning and optimization. The document is presented as a slide deck with different sections labeled by letters (e.g. K-0, S-0, U-0).
Beyond Breakpoints: A Tour of Dynamic AnalysisC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2dXUUTG.
Nathan Taylor provides an introduction to the dynamic analysis research space, suggesting integrating these techniques into various internal tools. Filmed at qconnewyork.com.
Nathan Taylor is a software developer currently employed at Fastly, where he works on making the Web faster through high performance content delivery. Previous gigs have included hacking on low-level systems software such as Java runtimes at Twitter and, prior to that, the Xen virtual machine monitor in grad school.
The document discusses GRelC, a project that aims to design and deploy the first Grid Database Management System (Grid-DBMS) for the Globus community. It describes how GRelC allows for dynamic and transparent access to distributed, heterogeneous databases in a grid environment. Key features of GRelC include authentication, authorization, access control policies, data encryption, and support for single and multi-query operations across multiple database management systems.
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
Non-Relational Databases: This hurts. I like it.Onyxfish
The document discusses non-relational databases, providing an overview of their characteristics and comparing them to relational databases. It outlines some popular non-relational database platforms, and uses the example of an open government project to demonstrate how CouchDB could be used to store and query schema-less data in a scalable way.
Reproducible Computational Pipelines with Docker and Nextflowinside-BigData.com
This document summarizes a presentation about using Docker and Nextflow to create reproducible computational pipelines. It discusses two major challenges in computational biology being reproducibility and complexity. Containers like Docker help address these challenges by creating portable and standardized environments. Nextflow is introduced as a workflow framework that allows pipelines to run across platforms and isolates dependencies using containers, enabling fast prototyping. Examples are given of using Nextflow with Docker to run pipelines on different systems like HPC clusters in a scalable and reproducible way.
This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
Oracle GoldenGate is a software package for real-time data integration and replication between heterogeneous systems. It enables solutions for high availability, real-time data integration, transactional change data capture, data replication, transformations, and verification. Oracle GoldenGate provides core data movement capabilities as well as visual management and monitoring tools. It supports replication between various database technologies like Oracle, IBM DB2, Microsoft SQL Server, and others. GoldenGate allows organizations to access and deliver real-time enterprise data across different systems.
Logs: Can’t Hate Them, Won’t Love Them: Brief Log Management Class by Anton C...Anton Chuvakin
Logging is essential for security, operations, and compliance. However, common mistakes in log management include not logging at all, not reviewing logs, retaining logs for too short a time, prioritizing log collection, ignoring application logs, and only searching for known bad events. Effective log management requires collecting all relevant logs and retaining them for appropriate time periods according to a well-defined strategy.
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
Comparisons are fundamental to computing - and comparing strings is not nearly as straightforward as you might think. Come learn about the history, nuance and surprises of “putting words in order” that you never knew existed in computer science, and how that nuance impacts both general programming and SQL programming. Next, walk through a few actual scenarios and demonstrations using PostgreSQL as a user and administrator, which you can re-run yourself later for further study, including one way you could easily corrupt your self-managed PostgreSQL database if you aren't prepared. Finally we’ll dive into an explanation of the surprising behaviors we saw in PostgreSQL, and learn more about user and administrative features PostgreSQL provides related to localized string comparison.
Whitepaper: Mining the AWR repository for Capacity Planning and VisualizationKristofferson A
This scenario discusses how to mine Oracle AWR reports to better understand database performance. AWR provides detailed performance data that can be queried and visualized to analyze workload characteristics, notice trends, and identify bottlenecks over time. The document shows an example SQL query that retrieves disk I/O metrics from the AWR for specified time intervals. It emphasizes correlating performance across the database, operating system, and application to fully diagnose overall response time issues. Mining AWR effectively in a time-series format with relevant statistics side-by-side allows for quick trend analysis, statistical modeling, and faster performance optimization.
PostgreSQL is a free and open-source relational database management system that provides high performance and reliability. It supports replication through various methods including log-based asynchronous master-slave replication, which the presenter recommends as a first option. The upcoming PostgreSQL 9.4 release includes improvements to replication such as logical decoding and replication slots. Future releases may add features like logical replication consumers and SQL MERGE statements. The presenter took questions at the end and provided additional resources on PostgreSQL replication.
This document discusses TensorFlow, an open-source machine learning framework. It describes how TensorFlow works using graphs to represent computations and can be run on CPUs, GPUs, or in a distributed manner across multiple devices. It also introduces elearn, a TensorFlow as a Service platform that handles infrastructure concerns like distributed storage, GPU/CPU resource management, and model versioning to simplify machine learning development.
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks
ExxonMobil leveraged machine learning at scale using Databricks to extract insights from equipment maintenance logs and improve operations. The logs contained both structured and unstructured text data across a global fleet maintained in legacy systems, limiting traditional analysis. By ingesting and enriching over 60 million records using natural language processing, the system identified outliers, enabled capacity planning, and prioritized maintenance tasks, projected to save millions annually through more effective reliability and maintenance guidance.
This document discusses polyglot persistence using Spring Data. It describes how Spring Data provides a common programming model for data access across different data stores like SQL databases, NoSQL databases and more. It provides examples of defining entities, repository interfaces and queries using Spring Data's JPA, MongoDB and QueryDSL modules. Spring Data aims to improve developer productivity by simplifying data access code and enabling applications to use multiple data sources.
Similar to tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart-data (20)
tranSMART Community Meeting 5-7 Nov 13 - Session 3: The TraIT user stories fo...David Peyruc
This document provides an overview of the TraIT project and existing demonstrators using tranSMART. It discusses the TraIT roadmap and user stories being implemented at the Netherlands Cancer Institute. Key points include:
- TraIT aims to support translational research through integrated data and tools across clinical, imaging, biobanking and experimental domains.
- Existing demonstrators using tranSMART include DeCoDe (colorectal cancer) and PCMM (prostate cancer).
- The roadmap involves enhancing tranSMART functionality based on user needs and integrating additional data sources.
- At NKI, tranSMART will provide an integrated research data warehouse with clinical and research data from various sources and departments.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the cell phenotypes involved in metastasis
Characterization of the cell phenotypes involved in metastasis: Using tranSMART to enable high-throughput heterogeneous data integration and analysis
Brian Athey, University of Michigan
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analytical Capabilities with Knowledge Content
Sirimon Ocharoen, Thomson Reuters
To effectively analyze data in tranSMART, biological analysis/knowledge-based approach is needed. Through a case study, we will demonstrate how system biology content can be integrated in tranSMART to enable functional analysis and biological interpretation. We will also share our experience and user feedbacks from various projects.
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons Learned in Academic and Life Science Settings
Dan Housman, Recombinant by Deloitte
The Recombinant by Deloitte team has worked with organizations such as Kimmel Cancer Center as a model to adapt existing mature i2b2 implementations to meet business and scientific needs. Other organizations are increasingly focused on how to use cloud and high performance computing models to achieve different performance levels. Advanced initiatives are progressing to link commercial tools such as Qlikview to explore tranSMART data and to solve for key gaps in scientific pipelines. Dan will present recent lessons learned, new capabilities, and some of the impact on the path forwards for future tranSMART updates.
tranSMART Community Meeting 5-7 Nov 13 - Session 5: EMIF (European Medical In...David Peyruc
The document discusses the European Medical Information Framework (EMIF) project. EMIF aims to create a platform and framework to integrate patient-level health data from across Europe to enable new research insights. Specifically, EMIF is developing tools and standards to pool data from various sources on over 48 million subjects from 7 EU countries. This will support research on predictors of metabolic diseases and Alzheimer's disease. EMIF is using the tranSMART platform to load clinical trial data and cohorts on over 33,000 subjects for analysis. The goal is for EMIF to become a trusted European hub for healthcare data to optimize clinical research.
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Project MS Repository Dataset as a Case Study The Accelerated Cure Project MS Repository Dataset as a Case Study
Stephen Wicks, Rancho Biosciences
The Accelerated Cure Project for Multiple Sclerosis is a non-profit focused on accelerating research for a cure for MS. One of their major projects over the last decade has been the generation of the ACP Repository, a collection of biological samples and associated clinical data from approximately 3200 case or control participants. More than 75 studies are underway or have been completed, in both industry and academic settings, using samples from the ACP Repository. Rancho BioSciences has partnered with ACP through Orion Bionetworks to curate and load these datasets and associated clinical CRFs into tranSMART. In this talk, we will describe the rich ACP dataset and discuss our experiences in preparing the data for analysis in tranSMART
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Modularization (Plug‐Ins,...David Peyruc
The document discusses the development of new plugins for the TranSMART platform to add genomic visualization capabilities. It describes requirements like adding an HTML5 genome browser and supporting visualization of genomic variants and copy number variation data. It then details the process of consulting the community to choose the Dalliance genome browser and MyDAS backend, and extending the core API to support these plugins. The plugins were implemented and added to TranSMART to provide the new genomic visualization features.
The document outlines the key roles and values of a foundation to support the TRANSMART platform including:
- Stimulating awareness of project activities, functionalities, and data standards through communications
- Coordinating data curation and identifying opportunities for collaboration or common interest data sets
- Providing an app store for translational research plugins with various pricing models
- Ensuring quality, education, and training
It proposes establishing working groups and hiring a full-time community manager to address issues like lack of data transparency, siloed development, and ineffective project communications. The manager would facilitate engagement, updates, and synergies across stakeholders.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...David Peyruc
The document summarizes Pfizer's use of the tranSMART platform for various genomics and clinical data analyses including genome-wide association studies (GWAS), supporting exploratory data types like metabolomics and FACS data, and large collaborative efforts like the Alzheimer's Disease Neuroimaging Initiative (ADNI) and Parkinson's Progression Markers Initiative (PPMI) datasets. It also discusses analytical integration with Genedata Expressionist and plans for future enhancements to tranSMART like improved GWAS support and additional genotype data. Contributors to these efforts are acknowledged.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Mind for Research Data Exchange Portal
Jeff Grethe, One Mind for Research
One Mind for Research (http://1mind4research.org) is an independent, non-partisan, nonprofit
organization dedicated to curing the diseases of the brain and eliminating the stigma
and discrimination associated with mental illness and brain injuries. tranSMART will be a core
application within the One Mind Brain Data Exchange Portal, scheduled to launch publicly in
2014. Traumatic Brain Injury (TBI) affects an estimated 10 million people worldwide, and
tranSMART is one of the core applications within the portal used by researchers who are
looking to improve diagnostics and discover more effective treatments for patients suffering
from CNS- and TBI-related diseases.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehous...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals
International
Dave Marberg, Takeda
We have used the tranSMART platform to construct a warehouse containing data from several
Takeda clinical trials, proprietary preclinical drug activity studies, 1600 Gene Expression
Omnibus studies, and data from TCGA, CCLE, and other sources. All gene expression data has
been globally normalized. We extended the tranSMART platform with a set of R function calls
to enable cross-study queries and analysis via the rich toolset available in R. The utility of the
data warehouse is exemplified by a study in which we built a predictive model for drug
sensitivities. The model was trained on gene expression and IC50 data from cell lines and was
found to correctly predict drug activity in oncology indications.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart’s application t...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART’s Application to Clinical Biomarker Discovery Studies in Sanofi
Sherry Cao, Sanofi
This presentation will discuss challenges we are encountering in clinical biomarker discovery
study and how we are using tranSMART to help to address them.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Simulation in tranSMARTDavid Peyruc
Dave King gave a presentation on November 6th 2013 about interactive visualization with tranSMART. The presentation introduced Dave King and his role in the presentation. It explained that tranSMART allows for modular and abstracted visualization through application programming interfaces, improving connectivity. The presentation concluded that these features of tranSMART are important for interactive data visualization.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker DiscoveryDavid Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART’s Application to Clinical Biomarker Discovery Studies in Sanofi
Sherry Cao, Sanofi
This presentation will discuss challenges we are encountering in clinical biomarker discovery
study and how we are using tranSMART to help to address them.
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Developing a TR Community...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Developing a Translational Research Community around the tranSMART Platform
Keith Elliston, tranSMART Foundation
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Herding CatDavid Peyruc
This document discusses managing open source communities and projects. It notes that open source communities involve not just developers but also users, installers, documentation writers, and support staff. Contributions come from new code, bug fixes, documentation, training materials, and feature requests. Projects need coordination, communication through mailing lists and meetings, and quality assurance through testing. Both incentives like acknowledging contributions and treats like involvement opportunities help encourage participation and "herd the cats" of an open source community.
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhenDavid Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
Massimo Brignoli, MongoDB Inc
The presentation will illustrate what MongoDB is, the advantages of the document based approach and some of the use cases where MongoDB is a perfect fit.
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Creating a Comprehensive ...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Creating a Comprehensive Clinical and 'Omics Information Commons on Autism
Paul Avillach, Harvard University
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
3. Typical Branch Distribution
Grails Code
Database
transmartApp (without full
repo history, always with
wrong ancestry information
⇒ merging quite difficult)
RModules (if you’re lucky),
but analyses definitions in
DB not provided
SQL scripts on top of GPL
1.0 dump or later. Probably
insufficent/won’t apply
Stored procedures for ETL.
Overlapping definitions with
yours, but no history ⇒
merging quite difficult
Manual fixups always
required (even if just
permissions/synonyms)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
3 / 22
4. Typical Branch Distribution (II)
ETL
Solr/Rserve/Configuration
High variablity in strategies
Instructions/sample data
rarely provided
Solr
schemas/dataimport.xml
perpetually forgotten
Kettle scripts are
problematic
Idem for information on R
packages
Sample configuration rarely
provided
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
4 / 22
5. Versioning Control
Version control used ONLY for Grails Code. . .
But often squashed and with wrong ancestor information.
Forget about database, Solr, most of ETL.
Result
Merges are very difficult.
Changes cannot easily be tracked
Changes’ wherefores are unknown
Regressions are introduced (no conflicts)
Collaboration is based on e-mail attachments
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
5 / 22
6. Automation
Even with all the pieces. . .
Setting up a new branch takes days;
weeks for non-basic functionality
No reproductibility in the process!
Result
Devs driven away from fully local
environment (too much work)
Robust environment for CI passed over
(too much work)
Bugs cannot be reliably reproduced (see
also: no consistent usage of VCS)
Time wasted with deployment specific
mistakes/inconsistencies
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
6 / 22
7. Why?!
The “source code” for a work means
the preferred form of the work for
making modifications to it.
— GPL v3, section 1
Is everyone holding back “source code”?
More likely explanation:
No appropriate tooling being used
Guillaume Duchenne (public domain)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
7 / 22
8. Situation for tranSMART 1.1
The situation is much better!
Some problems remain, though.
The Good
Create/populate DB
is easy
Most stuff is
versioned
CI for builds
Image available
Public issue tracking
Gustavo Lopes (The Hyve B.V.)
The Bad
No Oracle support
Changes to DB scripts/seed data are
ad hoc (lax structure)
No mechanism to support/compare
schemas with other branches
R analyses are json blobs in TSVs
No VCS for Solr or Rserve/images’ setup
Set up Sol/Rserve is time-consuming
Population of DB with sample data is still
time-consuming
Config changes required for dev
transmart-data
November 6, 2013
8 / 22
9. Description of transmart-data
We developed transmart-data to address most of these problems:
transmart-data is a set of
scripts for managing tranSMART’s environment and
certain application data (e.g. Solr schemas, DDL, seed data), which
is used by scripts and sometimes generated by them.
It has a makefile based interface.
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
9 / 22
10. transmart-data: Purposes
Purposes of transmart-data:
1
Allow setting up a complete dev environment quickly (< 30 min)
2
Bring versioning to the database schema and Solr files
3
Setup Solr runtime
4
Invoke ETL pipelines
5
Setup Rserve
Target audience: Programmers
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
10 / 22
11. transmart-data: Non-purposes
Non-purposes of transmart-data:
1
Setup a production environment
(some components can be used)
2
New users evaluating tranSMART
(use an pre-built image)
3
Building transmartApp or its plugin dependencies
(build them yourself or use artifacts from Bamboo/Nexus)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
11 / 22
12. Configuration
Environment variable based configuration
cp v a r s . s a m p l e v a r s
vim v a r s #e d i t f i l e
source v a r s
Gustavo Lopes (The Hyve B.V.)
PGHOST=/tmp
PGPORT=5432
PGDATABASE=t r a n s m a r t
PGUSER=$USER
PGPASSWORD=
TABLESPACES=$HOME/ pg / t a b l e s p a c e s /
PGSQL BIN=$HOME/ pg / b i n /
ORAHOST=l o c a l h o s t
ORAPORT=1521
ORASID=o r c l
ORAUSER=” s y s a s s y s d b a ”
ORAPASSWORD=mypassword
ORACLE MANAGE TABLESPACES=0
#c o n t i n u e s . . .
transmart-data
November 6, 2013
12 / 22
13. Database Schema Management
Support for Oracle and Postgres
Oracle
Postgres
Uses pg dump(all)
Queries dba * tables
Parses the dump files
Dumps DDL w/
DBMS METADATA
#Dump
make −C p o s t g r e s / d d l dump
make −C p o s t g r e s / d d l /
GLOBAL e x t e n s i o n s . s q l
roles . sql
#Dump
make −C o r a c l e / d d l dump
#Load
make o r a c l e
#Load
make −C p o s t g r e s / d d l l o a d
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
13 / 22
14. Seed Data
Only Postgres for now
#Dump
#T a b l e s t o dump i n p o s t g r e s / d a t a/<schema> l s t
make −C p o s t g r e s / d a t a dump
make −C p o s t g r e s /common m i n i m i z e d i f f s
#Load
make −C p o s t g r e s / d a t a l o a d
#Load DDL and d a t a
make p o s t g r e s
Only for basic stuff with no ETL!
Pretty fast (DDL+data loaded in 10s)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
14 / 22
15. ETL (I)
Unified interface for ETL
Prepare dataset
Load dataset
1
Prepare ETL-specific source
files
2
Prepare file with ETL
specific params
3
Upload dataset to CDN
(optional)
For each new ETL pipeline,
support must be added
Gustavo Lopes (The Hyve B.V.)
make −C s a m p l e s /{ o r a c l e ,
p o s t g r e s } l o a d <type>
<s t u d y i d >
#Example :
make −C s a m p l e s / p o s t g r e s
load clinical GSE8581
Everything is automated!
transmart-data
November 6, 2013
15 / 22
17. RModules Analyses’(tsApp-DB)
Situation in transmartApp-DB:
u p d a t e searchapp . plugin_module
s e t params = ' {" id ":" survivalAnalysis " ," converter ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y
|| Common / dataBuilders . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common /
E xt ra ct Concepts . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / collapsingData . R ' ')
" ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / BinData . R ' ') " ," source ( ' ' ||
P L U G I N S C R I P T D I R E C T O R Y || Survival / Bui ldS urv iva lDa ta . R ' ') " ," tSurvivalData . build ( n
tinput . dataFile = ' ' || T E M P F O L D E RD I R E C T O R Y || Clinical / clinical . i2b2trans ' ' , n
tconcept . time = ' ' || TIME || ' ' , n tconcept . category = ' ' || CATEGORY || ' ' , n tconcept .
eventYes = ' ' || EVENTYES || ' ' , n tbinning . enabled = ' ' || BINNING || ' ' , n tbinning . bins = ' ' ||
NUMBERBINS || ' ' , n tbinning . type = ' ' || BINNINGTYPE || ' ' , n tbinning . manual = ' ' ||
BINNINGMANUAL || ' ' , n tbinning . binrangestring = ' ' || B I NN IN G RA NG E ST R IN G || ' ' , n tbinning
. variabletype = ' ' || B IN N I N G V A R I AB L E T Y P E || ' ' , n tinput . gexFile = ' ' ||
T E M P F O L D E R D I R E CT O R Y || mRNA / Processed_Data / mRNA . trans ' ' , n tinput . snpFile = ' ' ||
T E M P F O L D E R D I R E CT O R Y || SNP / snp . trans ' ' , n tconcept . category . type = ' ' || TYPEDEP || ' ' , n
tgenes . category = ' ' || GENESDEP || ' ' , n tgenes . category . aggregate = ' ' || AGGREGATEDEP
|| ' ' , n tsample . category = ' ' || SAMPLEDEP || ' ' , n ttime . category = ' ' || TIMEPOINTSDEP
|| ' ' , n tsnptype . category = ' ' || SNPTYPEDEP || ' ') n t "]} ," name ":" Survival Analysis " ,"
d a t a F i l e I n p u t M a p p i n g ":{" CLINICAL . TXT ":" TRUE " ," SNP . TXT ":" snpData " ," MRNA_DETAILED . TXT
":" mrnaData "} ," dataTypes ":{" subset1 ":[" CLINICAL . TXT "]} ," pivotData ": false ," view ":"
S u r v i v a lAnalysis " ," processor ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival /
C o x R e g r e s s i o n L oa d e r . r ' ') " ," CoxRegression . loader ( input . filename = ' ' outputfile ' ') " ,"
source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival / S u r v i v a l Cu r v e L o a d e r . r ' ') " ," SurvivalCurve
. loader ( input . filename = ' ' outputfile ' ' , concept . time = ' ' || TIME || ' ') "]} ," renderer ":{"
GSP ":"/ survivalAnalysis / s u r v i v a l A n a l y s i s O u t p u t "} ,... ( goes on ) '
where module_name = ' p gs u rv iv a lA n al ys i s ';
Not very nice...
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
17 / 22
18. RModules Analyses’ (transmart-data)
In transmart-data:
One file per analysis
Files can be generated from DB data
Sanely formatted
But we really want to remove this from the DB!
array (
'id' => 'heatmap',
'name' => 'Heatmap',
'dataTypes' =>
array (
'subset1' =>
array (
0 => 'CLINICAL.TXT',
),
),
'dataFileInputMapping' =>
array (
'CLINICAL.TXT' => 'FALSE',
'SNP.TXT' => 'snpData',
'MRNA_DETAILED.TXT' => 'TRUE',
),
'pivotData' => false,
...
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
18 / 22
19. Rserve
Targets for Rserve:
Download/build R
Install R packages
Start Rserve
Install System V init
script for Rserve
Idem for systemd
cd R
make - j8 bin / root / R
# some packages don ' t support
concurrent builds
make install_packages
make start_Rserve
make start_Rserve . dbg
TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ in i t
TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ un i t
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
19 / 22
20. Solr
Solr (4.5.0) automatically
downloaded and configured
Solr cores automatically created
User only needs to create a schema
file and dataconfig.xml
# setup & solr ( psql )
make start
# just c o n f i g u r e
make solr_home
make < core > _full_import
make < core > _delta_import
make clean_cores
ORACLE =1 make start
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
20 / 22
21. transmartApp Configuration
Out-of-tree config management:
Targets for installing files
Zero configuration for
dev!
Customization allowed
without touching the target
files
Only supports ours branches
But a lot of configuration
should be in-tree instead!
Gustavo Lopes (The Hyve B.V.)
# install everything
# previous files are backed
up
make install
# just one file :
make install_Config . groovy
make install_ Bu il dC on fi g .
groovy
make install _D at aS ou rce .
groovy
# costumizations in :
# Config - extra . php
# BuildConfig . groovy (
limited )
transmart-data
November 6, 2013
21 / 22