This document summarizes Liwei Ren's presentation on binary similarity algorithms. It discusses three existing algorithms - ssdeep, sdhash, and TLSH - and proposes a new algorithm called TSFP. TSFP represents files as bags of blocks and measures similarity based on the overlap of blocks between two files. It is suggested as a way to solve similarity search and clustering problems by creating an index of files represented as TSFPs. The presentation concludes with inviting questions from the audience.
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisShana White
An comprehensive overview of 'classic' gene-set enrichment analysis that was presented for a Biostatistics/Bioinformatics divisional seminar. Supplemental slides (58+) include details for running GSEA with a variety of options (GUI, R script, R package)
Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
Much of data is sequential – think speech, text, DNA, stock prices, financial transactions and customer action histories. Modern methods for modelling sequence data are often deep learning-based, composed of either recurrent neural networks (RNNs) or attention-based Transformers. A tremendous amount of research progress has recently been made in sequence modelling, particularly in the application to NLP problems. However, the inner workings of these sequence models can be difficult to dissect and intuitively understand.
This presentation/tutorial will start from the basics and gradually build upon concepts in order to impart an understanding of the inner mechanics of sequence models – why do we need specific architectures for sequences at all, when you could use standard feed-forward networks? How do RNNs actually handle sequential information, and why do LSTM units help longer-term remembering of information? How can Transformers do such a good job at modelling sequences without any recurrence or convolutions?
In the practical portion of this tutorial, attendees will learn how to build their own LSTM-based language model in Keras. A few other use cases of deep learning-based sequence modelling will be discussed – including sentiment analysis (prediction of the emotional valence of a piece of text) and machine translation (automatic translation between different languages).
The goals of this presentation are to provide an overview of popular sequence-based problems, impart an intuition for how the most commonly-used sequence models work under the hood, and show that quite similar architectures are used to solve sequence-based problems across many domains.
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
The continuous evolution of NGS technology has led to an enormous diversification in NGS applications and dramatically decreased the costs to sequence a complete human genome.
In this presentation, we will discuss the following major topics:
• Basic overview of NGS sequencing technologies
• Next-generation sequencing workflow
• Spectrum of NGS applications
• QIAGEN universal NGS solutions
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
This presentation walks through the problem of multiple hypothesis testing in Statistics, with a special emphasis on procedures for controlling the False Discovery Rate.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Why should you care about Markov Chain Monte Carlo methods?
→ They are in the list of "Top 10 Algorithms of 20th Century"
→ They allow you to make inference with Bayesian Networks
→ They are used everywhere in Machine Learning and Statistics
Markov Chain Monte Carlo methods are a class of algorithms used to sample from complicated distributions. Typically, this is the case of posterior distributions in Bayesian Networks (Belief Networks).
These slides cover the following topics.
→ Motivation and Practical Examples (Bayesian Networks)
→ Basic Principles of MCMC
→ Gibbs Sampling
→ Metropolis–Hastings
→ Hamiltonian Monte Carlo
→ Reversible-Jump Markov Chain Monte Carlo
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
Similarity digesting is a class of algorithms and technologies that generate hashes from files and preserve file similarity. They find applications in various areas across security industry: malware variant detection, spam filtering, computer forensic analysis, data loss prevention and etc.. There are a few schemes and tools available that include ssdeep, sdhash and TLSH. While being useful for detecting file similarity, they define similarity from different perspectives. In other words, they take different approaches to describe what file similarity is about. In order to compare those tools with better evaluation, we introduce a simple mathematical model to describe similarity that would cover all three schemes and beyond. This model enables us to establish a theoretic framework for analyzing essential differences of various similarity digesting tools. The general use cases proposed by NIST are studied. As a result, a few tools are found to be complementary to each other so that we can use them in a hybrid approach in practice. Data experiment results are be provided to support the theoretic analysis.
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisShana White
An comprehensive overview of 'classic' gene-set enrichment analysis that was presented for a Biostatistics/Bioinformatics divisional seminar. Supplemental slides (58+) include details for running GSEA with a variety of options (GUI, R script, R package)
Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
Much of data is sequential – think speech, text, DNA, stock prices, financial transactions and customer action histories. Modern methods for modelling sequence data are often deep learning-based, composed of either recurrent neural networks (RNNs) or attention-based Transformers. A tremendous amount of research progress has recently been made in sequence modelling, particularly in the application to NLP problems. However, the inner workings of these sequence models can be difficult to dissect and intuitively understand.
This presentation/tutorial will start from the basics and gradually build upon concepts in order to impart an understanding of the inner mechanics of sequence models – why do we need specific architectures for sequences at all, when you could use standard feed-forward networks? How do RNNs actually handle sequential information, and why do LSTM units help longer-term remembering of information? How can Transformers do such a good job at modelling sequences without any recurrence or convolutions?
In the practical portion of this tutorial, attendees will learn how to build their own LSTM-based language model in Keras. A few other use cases of deep learning-based sequence modelling will be discussed – including sentiment analysis (prediction of the emotional valence of a piece of text) and machine translation (automatic translation between different languages).
The goals of this presentation are to provide an overview of popular sequence-based problems, impart an intuition for how the most commonly-used sequence models work under the hood, and show that quite similar architectures are used to solve sequence-based problems across many domains.
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
The continuous evolution of NGS technology has led to an enormous diversification in NGS applications and dramatically decreased the costs to sequence a complete human genome.
In this presentation, we will discuss the following major topics:
• Basic overview of NGS sequencing technologies
• Next-generation sequencing workflow
• Spectrum of NGS applications
• QIAGEN universal NGS solutions
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
This presentation walks through the problem of multiple hypothesis testing in Statistics, with a special emphasis on procedures for controlling the False Discovery Rate.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Why should you care about Markov Chain Monte Carlo methods?
→ They are in the list of "Top 10 Algorithms of 20th Century"
→ They allow you to make inference with Bayesian Networks
→ They are used everywhere in Machine Learning and Statistics
Markov Chain Monte Carlo methods are a class of algorithms used to sample from complicated distributions. Typically, this is the case of posterior distributions in Bayesian Networks (Belief Networks).
These slides cover the following topics.
→ Motivation and Practical Examples (Bayesian Networks)
→ Basic Principles of MCMC
→ Gibbs Sampling
→ Metropolis–Hastings
→ Hamiltonian Monte Carlo
→ Reversible-Jump Markov Chain Monte Carlo
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
This K-Means clustering algorithm presentation will take you through the machine learning introduction, types of clustering algorithms, k-means clustering, how does K-Means clustering work and at least explains K-Means clustering by taking a real life use case. This Machine Learning algorithm tutorial video is ideal for beginners to learn how K-Means clustering work.
Below topics are covered in this K-Means Clustering Algorithm presentation:
1. Types of Machine Learning?
2. What is K-Means Clustering?
3. Applications of K-Means Clustering
4. Common distance measure
5. How does K-Means Clustering work?
6. K-Means Clustering Algorithm
7. Demo: k-Means Clustering
8. Use case: Color compression
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
Similarity digesting is a class of algorithms and technologies that generate hashes from files and preserve file similarity. They find applications in various areas across security industry: malware variant detection, spam filtering, computer forensic analysis, data loss prevention and etc.. There are a few schemes and tools available that include ssdeep, sdhash and TLSH. While being useful for detecting file similarity, they define similarity from different perspectives. In other words, they take different approaches to describe what file similarity is about. In order to compare those tools with better evaluation, we introduce a simple mathematical model to describe similarity that would cover all three schemes and beyond. This model enables us to establish a theoretic framework for analyzing essential differences of various similarity digesting tools. The general use cases proposed by NIST are studied. As a result, a few tools are found to be complementary to each other so that we can use them in a hybrid approach in practice. Data experiment results are be provided to support the theoretic analysis.
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
In this talk we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark ML and GraphFrames.
Histogrammar package—a cross-platform suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting in Scala—is introduced to enable interactive data analysis in Spark REPL.
We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
The basic challenge of a data scientist is to unveil information from raw data. Traditional machine learning algorithms have treated “pure” data analytics situations that should comply with a set of restrictions, such as access to labels, a clear prediction objective… However, the reality in practice shows that, due to the wide spread of data science nowadays, the exception is the norm and it is usual to encounter situations that depend on gathering information from raw data which lacks any kind of structure, or objective that classic approaches assume. In these situations, building a graph that encodes the information we are trying to unveil is the most intuitive place to start or even the only one feasible when we lack any field knowledge or previously stated aim. Unfortunately, building a graph when the number of nodes is huge from scratch is a challenging task computationally, and requires some approximations to make it feasible. In this review, we will talk about the most standard way of building those graphs in practice, and how to exploit them to solve data science tasks.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html#spch11.2
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
Data de-duplication, in our case also called "Entity Matching", is not just about reducing multiple instances of the same item to one in order to save some space. It is a challenging task with many practical applications from health care to fraud prevention: "Is this person the same patient as ten years ago, but has moved in the meantime?", "Is this personalized spam mail the same email that was sent to many others, yet customized in each case?", or "Was there a spike resp. what is the base level of similar looking account creations?". In the age of Big Data, we do have the data to answer such questions, but heuristically comparing each item to all other items quickly becomes technically prohibitive for huge data sets.
In this session, we will have a look at past practises and current developments in the world of data de-duplication. After that, we will look at how to leverage locality-sensitive hashing algorithmically to reduce the amount of comparisons to a workable level. A demonstration will feature our implementation of that algorithm on top of Riak and Storm. The session will then finish with an overview of experiments and results using that system on different datasets, including browser fingerprints, tweets, and news articles.
Real time analytics with Spark Streaming by Padma at Bangalore I & D meetup (https://www.meetup.com/Bengaluru-Insights-and-Data-Meetup/events/238459154)
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
In the format of hands-on session, this workshop will introduce participants to the Language Variation Suite (LVS), a user-friendly interactive web application built in R. LVS provides access to advanced statistical methods and visualization techniques, such as mixed-effects modeling, conditional and random tree analyses, cluster analysis. These advanced methods enable researchers to handle imbalanced data, measure individual and group variation, estimate significance, and rank variables according to their significance.
Detection of Embryonic Research Topics by Analysing Semantic Topic NetworksAngelo Salatino
Being aware of new research topics is an important asset for anybody involved in the research environment, including researchers, academic publishers and institutional funding bodies. In recent years, the amount of scholarly data available on the web has increased steadily, allowing the development of several approaches for detecting emerging research topics and assessing their trends. However, current methods focus on the detection of topics which are already associated with a label or a substantial number of documents. In this paper, we address instead the issue of detecting embryonic topics, which do not possess these characteristics yet. We suggest that it is possible to forecast the emergence of novel research topics even at such early stage and demonstrate that the emergence of a new topic can be anticipated by analysing the dynamics of pre-existing topics. We present an approach to evaluate such dynamics and an experiment on a sample of 3 million research papers, which confirms our hypothesis. In particular, we found that the pace of collaboration in sub-graphs of topics that will give rise to novel topics is significantly higher than the one in the control group.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
This presentation will cover all aspects of modeling, from preparing data, training and evaluating the results. There will be descriptions of the mainline ML methods including, neural nets, SVM, boosting, bagging, trees, forests, and deep learning. common problems of overfitting and dimensionality will be covered with discussion of modeling best practices. Other topics will include field standardization, encoding categorical variables, feature creation and selection. It will be a soup-to-nuts overview of all the necessary procedures for building state-of-the art predictive models.
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
The bad character shift rule of Boyer-Moore string search algorithm is studied in this paper for the purpose of extending it to more general string match problems. An abstract problem of string match is defined in general. An optimized string match algorithm based one the bad character heuristics is proposed to solve the abstract match problem efficiently.
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution.
IoT Security: Problems, Challenges and SolutionsLiwei Ren任力偉
As a novel computing platform in network, IoT will bring many security challenges to enterprise networks, and create new opportunities for security industry. This talk will provide a general overview of enterprise network security problems, especially the data security, caused by IoT. After that, a few existing security technologies are evaluated as necessary elements of a holistic network security that cover IoT devices. These technologies include : (a) IoT security monitoring and control; (b) FOTA for firmware vulnerability management; (c) NetFlow based big data security analysis. In the end, the practice of standard security protocols (such as OpenIoC and IODEF) will be strongly advocated for delivering effective IoT security solutions.
Differential compression (aka, delta encoding) is a special category for data de-duplication. It can find many applications in various domains such as data backup, software revision control systems, software incremental update, file synchronization over network, to name just a few. This talk will introduce a taxonomy of how to categorize delta encoding schemes in various applications. Pros & cons of each scheme will be investigated in depth.
Overview of Data Loss Prevention (DLP) TechnologyLiwei Ren任力偉
DLP is a technology that detects potential data breach incidents in timely manner and prevents them by monitoring data in-use (endpoints), in-motion (network traffic), and at-rest (data storage). It has been driven by regulatory compliances and intellectual property protection. This talk will introduce DLP models that describe the capabilities and scope that a DLP system should cover. A few system categories will be discussed accordingly with high-level system architecture. DLP is an interesting technology in that it provides advanced content inspection techniques. As such, a few content inspection techniques will be proposed and investigated in rigorous terms.
DLP Systems: Models, Architecture and AlgorithmsLiwei Ren任力偉
DLP is a data security technology that detects and prevents data breach incidents by monitoring data in-use, in-motion and at-rest. It has been widely applied for regulatory compliances, data privacy and intellectual property protection. This talk will introduce basic concepts and security models to describe DLP systems with high level architecture. DLP is an interesting discipline with content inspection techniques supported by sophisticated algorithms. Special investigation will be taken for a few algorithms: document fingerprinting, data record fingerprinting, scalable M-pattern string match and etc..
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Binary Similarity : Theory, Algorithms and Tool Evaluation
1. Copyright 2011 Trend Micro Inc. 1
Binary Similarity : Theory, Algorithms and Tool
Evaluation
Liwei Ren, Ph.D, Trend Micro™
University of Houston-Downtown, Houston, Texas, October, 2015
2. Copyright 2011 Trend Micro Inc.
Agenda
• What is binary similarity ?
• Similarity Digesting: 3 Algorithms
• A Mathematical Model
• Tool Evaluation
• A Novel Fuzzy Hashing
• Summary and Further Research
Classification 10/2/2015 2
3. Copyright 2011 Trend Micro Inc.
What Is Binary Similarity?
• Binary similarity or approximate matching.
– What is binary similarity ?
• 4 Use Cases specified by a NIST document:
Classification 10/2/2015 3
4. Copyright 2011 Trend Micro Inc.
What Is Binary Similarity?
Classification 10/2/2015 4
5. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• Similarity digesting (aka, fuzzy hashing):
– A class of hash techniques or tools that preserve similarity.
– Typical steps for digest generation:
– Detecting similarity with similarity digesting:
• Three similarity digesting algorithms and tools:
– ssdeep, sdhash & TLSH
Classification 10/2/2015 5
6. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• ssdeep
– Two steps for digesting:
– Edit Distance: Levenshtein distance
Classification 10/2/2015 6
7. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• Sdhash by Dr Vassil Roussev
– Two steps for digesting:
– Edit Distance: Hamming distance
Classification 10/2/2015 7
8. Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• TLSH
– Two steps for digesting :
– Edit Distance: A diff based evaluation function
Classification 10/2/2015 8
9. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Summary of Three Similarity Digesting Schemes:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 10/2/2015 9
10. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Three approaches for similarity evaluation:
Classification 10/2/2015 10
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
11. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 10/2/2015 11
12. Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Four general types of tokens from binary strings:
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• Eight different models to describe a string for similarity.
• Analogical thinking:
– we define different distances to describe a metric space.
Classification 10/2/2015 12
13. Copyright 2011 Trend Micro Inc.
Tool Evaluation
• Data Structure:
– Bag: a bag ignores the order of tokens. It is good at handling content
swapping.
– Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
• Tokens:
– k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation.
– k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation .
– Chunks: This approach takes account of every byte in raw granularity. It
should be OK at handling containment and cross sharing
– Blocks: Depending on different selection functions, even though it does not
take account of every byte, but it may present a string more efficiently and
that is good for generating similarity digests. Due to the nature of fixed
length blocks, it is good at handling containment and cross sharing.
13
14. Copyright 2011 Trend Micro Inc.
Tool Evaluation
Classification 10/2/2015 14
Tool Model Minor
Changes
Containment Cross
sharing
Swap Fragmentation
ssdeep M1.3 High Medium Medium Medium Low
sdhash M2.4 High High High High Low
TLSH M2.2 High Low Medium High High
Sdhash
+ TLSH
Hybrid High High High High High
16. Copyright 2011 Trend Micro Inc.
Tool Evaluation
• Note: vulnerability is not the scope of this evaluation , but worthy for
mentioning.
• My co-worker Dr. Jon Oliver shows in one of his papers :
– Both ssdeep & sdhash are vulnerable in terms of adversary attacks.
– TLSH is not !
16
17. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• We like to design a novel fuzzy hashing scheme based on the
M2.4:
– a string is presented by a bag of blocks.
– Two steps: (1) Feature selection; (2) Digest generation.
Classification 10/2/2015 17
18. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Continuing:
Classification 10/2/2015 18
19. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• This is TSFP
– Trend String Fingerprint
• Similarity measurement of TSFP:
– Given two TSFP H and G where H = h1h2… hn and G= g1g2… gm .
– Similarity is measured by function:
• SIMH(H,G) = 200*|S ⋂T| / (|S| + |T|)
– Where S = {h1, h2, … ,hn } and T = {g1, g2, … , gm }
– 0 ≤ SIMH(G,H) ≤ 100
• Similarity measurement of two strings :
– SIM(s,t) = SMTH(TSFP(s), TSFP(h))
Classification 10/2/2015 19
20. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Why do we need TSFP ?
• We need to solve the following problems
1. Similarity search problem:
• B is a bag of binary strings {t1, t2 , …,tn} Given δ >0 and a binary string s,
find t ϵ {t1, t2 , …,tn} such that SIM(s, t) ≥δ.
2. Similarity based clustering problem:
• B is a bag of binary strings {{t1, t2 , …, tn }. Partition B into groups based
on their binary similarity.
• Why not {ssdeep, sdhash & TLSH} ?
– An obvious solution is applying a Brute Force algorithm.
– NOTE: Jon Oliver uses random forest to solve the search problem
without Brute Force. I will try to prove its feasibility mathematically.
20
21. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Similarity search problem:
• B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string
s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ .
• How does keyword based search engine work?
– Extracting keywords from documents
– Indexing keywords & documents
– Searching via keywords.
• Solution:
– Given a string s, we get its fuzzy hash TSFP(s)= h1h2… hn .
– Let S={h1, h2,…,hn}, each hj is a token of s that we treat it as a
keyword. So we can create the indices TSFP-Index (B).
– We can do two steps to solve the searching problems above.
21
22. Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Similarity search problem:
• B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string
s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ .
• STEP 1:
– Candidate selection
• Let TSFP(s)= h1h2… hn to create the bag of tokens S={h1, h2,…, hn}.
• Use this bag of tokens to search the indices TSFP-Index(B) so that we
retrieve a list of candidates {s1, s2 , …, sm} ⊂ {t1, t2 , …, tn } ranked by
number of common tokens.
• STEP 2:
– Brute force method at smaller scale
• For each t ϵ {s1, s2 , …, sm}, if SIM( s, r) ≥δ , t is what we are searching for.
22
23. Copyright 2011 Trend Micro Inc.
Summary and Further Research
• My practice of academic research in industry:
Classification 10/2/2015 23
24. Copyright 2011 Trend Micro Inc.
Summary and Further Research
Framework of approximate matching, searching and clustering:
Classification 10/2/2015 24
25. Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your interest.
• Any questions?
• My Information:
– Email: liwei_ren@trendmicro.com
– Academic Page: https://pitt.academia.edu/LiweiRen
Classification 10/2/2015 25