Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
DAT304_Amazon Aurora Performance Optimization with MySQLKamal Gupta
Amazon Aurora services are MySQL and PostgreSQL -compatible relational database engines with the speed, reliability, and availability of high-end commercial databases at one-tenth the cost. This session introduces you to Amazon Aurora, explores the capabilities and features of Aurora, explains common use cases, and helps you get started with Aurora.
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
DAT304_Amazon Aurora Performance Optimization with MySQLKamal Gupta
Amazon Aurora services are MySQL and PostgreSQL -compatible relational database engines with the speed, reliability, and availability of high-end commercial databases at one-tenth the cost. This session introduces you to Amazon Aurora, explores the capabilities and features of Aurora, explains common use cases, and helps you get started with Aurora.
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Many ML Platforms cover data collection, feature engineering, training, deploying, productionalization, and monitoring but few, if any, do all of the above seamlessly.
Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python and Spark and can be used in modular pieces as each ML problem presents unique challenges. Through standardization of the path to production, training environments and the methods for collecting and transforming data on Spark, each model is reproducible and iterable.
This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adapted in Airbnb and we have variety of models running in production. We have seen the overall model development time go down from many months to days on Bighead. We plan to open source Bighead to allow the wider community to benefit from our work.
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
In this exclusive Premier Inside Out, you will hear from Druid committer Slim Bouguerra, Staff Software Engineer and Product Manager Will Xu. These Hortonworkers will explain the vision of these components, review new features, share some best practices and answer your questions.
View the webinar here: https://hortonworks.com/webinar/hortonworks-premier-apache-druid/
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Edureka!
** AI & Deep Learning Training: https://www.edureka.co/ai-deep-learning-with-tensorflow ** )
This Edureka Tutorial on "Keras Tutorial" (Deep Learning Blog Series: https://goo.gl/4zxMfU) provides you a quick and insightful tutorial on the working of Keras along with an interesting use-case! We will be checking out the following topics:
Agenda:
What is Keras?
Who makes Keras?
Who uses Keras?
What Makes Keras special?
Working principle of Keras
Keras Models
Understanding Execution
Implementing a Neural Network
Use-Case with Keras
Coding in Colaboratory
Session in a minute
Check out our Deep Learning blog series: https://bit.ly/2xVIMe1
Check out our complete Youtube playlist here: https://bit.ly/2OhZEpz
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
How RapidJSON is developed in order to achieve highest performance among 20 C/C++ JSON libraries. Benchmarks, some C++ design, algorithm and low-level optimizations are covered.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Distributed Trace & Log Analysis using MLJorge Cardoso
The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Many ML Platforms cover data collection, feature engineering, training, deploying, productionalization, and monitoring but few, if any, do all of the above seamlessly.
Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python and Spark and can be used in modular pieces as each ML problem presents unique challenges. Through standardization of the path to production, training environments and the methods for collecting and transforming data on Spark, each model is reproducible and iterable.
This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adapted in Airbnb and we have variety of models running in production. We have seen the overall model development time go down from many months to days on Bighead. We plan to open source Bighead to allow the wider community to benefit from our work.
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.
In this exclusive Premier Inside Out, you will hear from Druid committer Slim Bouguerra, Staff Software Engineer and Product Manager Will Xu. These Hortonworkers will explain the vision of these components, review new features, share some best practices and answer your questions.
View the webinar here: https://hortonworks.com/webinar/hortonworks-premier-apache-druid/
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Edureka!
** AI & Deep Learning Training: https://www.edureka.co/ai-deep-learning-with-tensorflow ** )
This Edureka Tutorial on "Keras Tutorial" (Deep Learning Blog Series: https://goo.gl/4zxMfU) provides you a quick and insightful tutorial on the working of Keras along with an interesting use-case! We will be checking out the following topics:
Agenda:
What is Keras?
Who makes Keras?
Who uses Keras?
What Makes Keras special?
Working principle of Keras
Keras Models
Understanding Execution
Implementing a Neural Network
Use-Case with Keras
Coding in Colaboratory
Session in a minute
Check out our Deep Learning blog series: https://bit.ly/2xVIMe1
Check out our complete Youtube playlist here: https://bit.ly/2OhZEpz
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
How RapidJSON is developed in order to achieve highest performance among 20 C/C++ JSON libraries. Benchmarks, some C++ design, algorithm and low-level optimizations are covered.
Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella
Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...Alexandre Moneger
This presentation shows that code coverage guided fuzzing is possible in the context of network daemon fuzzing.
Some fuzzers are blackbox while others are protocol aware. Even ones which are made protocol aware, fuzzer writers typically model the protocol specification and implement packet awareness logic in the fuzzer. Unfortunately, just because the fuzzer is protocol aware, it does not guarantee that sufficient code paths have been reached.
The presentation deals with specific scenarios where the target protocol is completely unknown (proprietary) and no source code or protocol specs are accessible. The tool developed builds a feedback loop between the client and the server components using the concept of "gate functions". A gate function triggers monitoring. The pintool component tracks the binary code coverage for all the functions untill it reaches an exit gate. By instrumenting such gated functions, the tool is able to measure code coverage during packet processing.
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
In this talk, I will talk about why log files are horrible, logging log lines, and more structured performance metrics from large scale production applications as well as building reliable, scaleable and flexible large scale software systems in multiple languages.
Why (almost) all log formats are horrible will be explained, and why JSON is a good solution for logging will be discussed, along with a number of message queuing, middleware and network transport technologies, including STOMP, AMQP and ZeroMQ.
The Message::Passing framework will be introduced, along with the logstash.net project which the perl code is interoperable with. These are pluggable frameworks in ruby/java/jruby and perl with pre-written sets of inputs, filters and outputs for many many different systems, message formats and transports.
They were initially designed to be aggregators and filters of data for logging. However they are flexible enough to be used as part of your messaging middleware, or even as a replacement for centralised message queuing systems.
You can have your cake and eat it too - an architecture which is flexible, extensible, scaleable and distributed. Build discrete, loosely coupled components which just pass messages to each other easily.
Integrate and interoperate with your existing code and code bases easily, consume from or publish to any existing message queue, logging or performance metrics system you have installed.
Simple examples using common input and output classes will be demonstrated using the framework, as will easily adding your own custom filters. A number of common messaging middleware patterns will be shown to be trivial to implement.
Some higher level use-cases will also be explored, demonstrating log indexing in ElasticSearch and how to build a responsive platform API using webhooks.
Interoperability is also an important goal for messaging middleware. The logstash.net project will be highlighted and we'll discuss crossing the single language barrier, allowing us to have full integration between java, ruby and perl components, and to easily write bindings into libraries we want to reuse in any of those languages.
Industry - Program analysis and verification - Type-preserving Heap Profiler ...ICSM 2011
Paper: Type-preserving Heap Profiler for C++
Authors: József Mihalicza, Zoltán Porkoláb and Ábel Gábor
Session: "Industry Track Session 4: Program analysis and Verification"
The new Actor representation in Akka Typed allows formulations that lend themselves to monadic interpretation or introspection. This leads us to explore possibilities for expressing and verifying dynamic properties like the adherence to a communication protocol between multiple agents as well as the safety properties of that protocol on a global level. Academic research in this area is far from complete, but there are interesting initial results that we explore in this session: precisely how much purity and reasoning can we bring to the distributed world?
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...EC-Council
Over the past year, Tripwire Security Researchers Tyler Reguly and Andrew Swoboda have invested numerous hours into understanding the Microsoft Remote Desktop Protocol, specifically the pre-authentication portions of RDP. The Microsoft Open Protocol Specifications were heavily utilized for this projected and, while both researchers had used the specifications before, neither had fully realized their usefulness to security researchers. This session will be a discussion of The Microsoft Open Protocol Specification with RDP as the example. The culmination of the session will be the release of a new RDP Fuzzer and a discussion around the vulnerabilities it has already discovered.
Attendees can expect to walk away with a strong understanding of the Microsoft Open Protocol Specifications and how they can leverage them to build protocol implementations and fuzzers, as well as investigate inherent flaws and discover new vulnerabilities. Attendees will have a better understanding of the pre-authentication RDP connection sequence and exactly what data is exchanged and what an attacker can deduce from this communication. Finally, attendees will gain insight into new RDP vulnerabilities.
This is a powerpoint presentation that I put together discussing best practices with Ansible, although it more specifically targets ansible playbooks. The topics include content organization, tips for writing playbooks, discussion around idempotency and it's importance, the power of jinja2 within ansible, and finishes with some lessons learned.
This presentation was delivered on July 30th at WP Engine's office for the Austin Ansible MeetUp.
Security is a very important aspect of web applications. In order to protect sensitive data we should use cryptography. But cryptography means security? Absolutely not, especially if developers do not,especially if developers do not use it properly. In this talk I would like to present some best practices in PHP to implement secure cryptography using the extensions mcrypt, Hash and OpenSSL.
A Process is No One - Jared Atkinson and Robby Winchester
Does your organization want to start Threat Hunting, but you're not sure how to begin? Most people start with collecting ALL THE DATA, but data means nothing if you're not able to analyze it properly. This talk begins with the often overlooked first step of hunt hypothesis generation which can help guide targeted collection and analysis of forensic artifacts. We will demonstrate how to use the MITRE ATTACK Framework and our five-phase Hypothesis Generation Process to develop actionable hunt processes, narrowing the scope of your Hunt operation and avoiding "analysis paralysis." We will then walk through a detailed case study of detecting access token impersonation/manipulation from concept to technical execution by way of the Hypothesis Generation Process. Along the way, we will detail some of the most common access token manipulations in use and detail the defensive detection implications for each of these cases. This comprehensive case study will better arm both attackers and defenders with how to better utilize their toolset to detect or avoid detection of token theft and manipulation.
Natural Language Processing using JavaScript "Natural" Library. This deck covers Natural Language Understanding using JavaScript "Natural" library in detail
Attackers don’t just search for technology vulnerabilities, they take the easiest path and find the human vulnerabilities. Drive by web attacks, targeted spear phishing, and more are commonplace today with the goal of delivering custom malware. In a world where delivering custom advanced malware that handily evades signature and blacklisting approaches, and does not depend on application software vulnerabilities, how do we understand when are environments are compromised? What are the telltale signs that compromise activity has started, and how can we move to arrest a compromise in progress before the attacker laterally moves and reinforces their position? The penetration testing community knows these signs and artifacts of advanced malware presence, and it is up to us to help educate defenders on what to look for.
OpenAI GPT in Depth - Questions and MisconceptionsIvo Andreev
OpenAI GPT in depth – misconceptions and questions you would like answered
Have you ever wondered why GPT models work? Do you ask questions like:
How does GPT work? Why does the same problem receive different answers for different users? Is there a way to improve explainability? Can GPT model provide its sources? Why does Bing chat work differently? What are my ways to have better performance and improve completions? How can I work with data in my enterprise? What practical business cases could a generative AI model fit solving?
If you are tired of sessions just scratching the surface of OpenAI GPT, this one will go deeper and answer questions like why, why not and how.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
2. ● Speculative ideas with specific techniques
● Python is great for NLP, ML, simple text processing
Overview
3. Author of Text Processing with NLTK Cookbook
Contributor to Bad Data Handbook
Blog @ StreamHacker.com
Helped create Seahorse / Gnome Keyring (GPG UI)
CTO @ InsightEngines.com
About me
9. • Edit distance (a.k.a Levenshtein distance)
• Fuzzywuzzy
• Can use to identify similar strings
• Ex: Google vs Go0gle = edit distance 1
Fuzzy Matching
10. • Transform text into discrete values
• Use for data analysis, machine learning
• Art, not science
Feature Extraction
11. • Date parsing with dateutil
• Regex patterns
• Grammars with pyparsing
• Automatic log parsing with Logpai logparser
Parsing
13. • acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Token Shapes
14. Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
15. • Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens
16. 1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
Classification
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options
17.
18. ● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
● Fraud
Real World Classification
19. 1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies
21. ● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
Clustering
22.
23. 1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies
24. ● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer
Anomaly Detection
28. ● Investigator: plain english log search -> multiple
visualizations & recommendations to do next
● Analyzer: data health analysis
● InsightEngines.com
About Insight Engines