This document discusses automatically detecting package clones and inferring security vulnerabilities. It proposes using statistical classification techniques to identify cloned code between software packages. Features like common filenames, hashes, and fuzzy content would be used for classification. Packages found to share code could then be checked against known vulnerabilities to see if any vulnerabilities may affect the cloned code. The approach aims to scale the analysis to thousands of packages and help identify vulnerabilities in packages with cloned code that may not otherwise be tracked.
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...Silvio Cesare
Bugwise is a tool that detects bugs in binaries using decompilation and data flow analysis. It detects issues like use-after-free bugs, double free bugs, and unsafe calls to getenv(). It has scanned over 123,000 Debian binaries and reported 85 getenv() related bugs across 47 packages. The probability of a binary having a vulnerability is 0.00067, and the probability of a package having at least one vulnerable binary is 0.00255. Bugwise is based on strong theoretical underpinnings like data flow analysis and is extensible to detect more bug classes. The presenter aims to make more of their research public and get more people using their tools via their website.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
This document discusses the role of programming in computational biology. It begins by describing different types of programming languages like imperative, object-oriented, and functional languages. It then discusses how programming can reduce time, money, effort and errors in computational biology applications. Some key applications of programming in computational biology mentioned are data mining, genome annotation, microarray analysis, phylogenetics, and next generation sequencing studies. The document also discusses popular bioinformatics programming languages like Perl and describes concepts in programming like objects, modules, and the common gateway interface.
The document discusses Perl programming and is divided into 5 modules: 1) Introduction to Perl, 2) Regular Expressions, 3) File Handling, 4) Connecting to Databases, and 5) Introduction to Perl Programming. It provides an overview of Perl variables, data types, operators, and basic programming structures. It also covers installing Perl, Perl modules, and interacting with files and databases.
From Tek-X Cross Platform interoperability with PHP including history lesson, a bit about each category of operating systems, and gotchas related to PHP
“What should I work on next?” Code metrics can help you answer that question. They can single out sections of your code that are likely to contain bugs. They can help you get a toehold on a legacy system that’s poorly covered by tests.
At a previous JRubyConf, we talked about Thnad, a fictional programming language. Thnad served as a vehicle to explore the joy of building a compiler using JRuby, BiteScript, Parslet, and other tools. Now, Thnad is back with a second runtime: Rubinius. Come see the Rubinius environment through JRuby eyes. Together, we'll see how to grapple with multiple instruction sets and juggle contexts without going cross-eyed.
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...Silvio Cesare
Bugwise is a tool that detects bugs in binaries using decompilation and data flow analysis. It detects issues like use-after-free bugs, double free bugs, and unsafe calls to getenv(). It has scanned over 123,000 Debian binaries and reported 85 getenv() related bugs across 47 packages. The probability of a binary having a vulnerability is 0.00067, and the probability of a package having at least one vulnerable binary is 0.00255. Bugwise is based on strong theoretical underpinnings like data flow analysis and is extensible to detect more bug classes. The presenter aims to make more of their research public and get more people using their tools via their website.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
This document discusses the role of programming in computational biology. It begins by describing different types of programming languages like imperative, object-oriented, and functional languages. It then discusses how programming can reduce time, money, effort and errors in computational biology applications. Some key applications of programming in computational biology mentioned are data mining, genome annotation, microarray analysis, phylogenetics, and next generation sequencing studies. The document also discusses popular bioinformatics programming languages like Perl and describes concepts in programming like objects, modules, and the common gateway interface.
The document discusses Perl programming and is divided into 5 modules: 1) Introduction to Perl, 2) Regular Expressions, 3) File Handling, 4) Connecting to Databases, and 5) Introduction to Perl Programming. It provides an overview of Perl variables, data types, operators, and basic programming structures. It also covers installing Perl, Perl modules, and interacting with files and databases.
From Tek-X Cross Platform interoperability with PHP including history lesson, a bit about each category of operating systems, and gotchas related to PHP
“What should I work on next?” Code metrics can help you answer that question. They can single out sections of your code that are likely to contain bugs. They can help you get a toehold on a legacy system that’s poorly covered by tests.
At a previous JRubyConf, we talked about Thnad, a fictional programming language. Thnad served as a vehicle to explore the joy of building a compiler using JRuby, BiteScript, Parslet, and other tools. Now, Thnad is back with a second runtime: Rubinius. Come see the Rubinius environment through JRuby eyes. Together, we'll see how to grapple with multiple instruction sets and juggle contexts without going cross-eyed.
This document summarizes a talk given by Ian Dees on writing your own JVM compiler. The talk discusses three main reasons why someone may want to write their own compiler: a hardware background, a computer science background, or following a self-made path of learning computer science topics. It then previews the fictional Thnad programming language that will be used to demonstrate creating a compiler. The talk outlines the main stages of creating a compiler: parsing, transforming, and emitting bytecode. Various Ruby tools like Parslet and BiteScript that will be used are also introduced.
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
This document discusses playfulness in the workplace and how it relates to the Ruby programming language. It begins by thanking the organizers and hosts for the event. It then highlights some Ruby libraries created by the host city's Ruby community. The document goes on to discuss reasons for using Ruby at work, including dealing with data formats, scripting other software, sharing code with coworkers, and deploying software to customers. It provides examples using Ruby libraries and tools like Parslet, FFI, ChunkyPNG, and WinGui to parse data formats, control the mouse through its API, read PNG files, and click on points on the screen. The overall message is that Ruby can be used playfully to get tasks done at
The document contains details about various UNIX and shell programming experiments conducted in a lab. It includes shell scripts to perform tasks like displaying lines between given line numbers in a file, deleting lines containing a specified word, checking file permissions, finding number of lines/words/characters in a file etc. It also includes C programs to implement UNIX commands like cat, ls, mv using system calls and to copy contents of one file to another. The document provides scripts/code along with descriptions and expected outputs for each experiment.
This document provides an overview of Python for bioinformatics. It discusses what Python is, why it is useful for bioinformatics, and how to get started with Python. It also covers Python IDEs like Eclipse and PyDev, code sharing with Git and GitHub, strings, regular expressions, and other Python concepts.
This document provides an overview and tutorial on streaming jobs in Hadoop, which allow processing of data using non-Java programs like Python scripts. It includes sample code and datasets to demonstrate joining and counting data from multiple files using mappers and reducers. Tips are provided on optimizing streaming jobs, such as padding fields for sorting, handling errors, and running jobs on Hadoop versus standalone.
This is part 1 of fuzzing, an introduction to the subject. This presentation covers some of theory and thought process behind the subject, as well as an introduction to environment variable fuzzing and file format fuzzing.
This document provides a tutorial on using the argparse module in Python to parse command line arguments. It begins with simple examples of defining positional and optional arguments, and progresses to combining the two and adding more advanced features like different argument types and actions. The goal is to introduce argparse concepts gradually through examples, building up an understanding of how to define, accept, and handle different types of arguments from the command line.
This document provides an overview of the Python programming language, including its history, key features, syntax examples, and common uses. It also discusses how Python can be used under Linux and some potential issues.
This document discusses Biopython, a Python package for biological data analysis. It provides concise summaries of key Biopython concepts:
1) Biopython is an object-oriented Python package that consists of modules for common biological data operations like working with sequences.
2) Key Biopython classes include Alphabet for sequence alphabets, Seq for representing sequences, SeqRecord for sequences with metadata, and SeqIO for reading/writing sequences to files.
3) Classes specify attributes (data) and methods (functions) that objects can have. For example, Seq objects have attributes like sequence and alphabet, and methods like translate() and complement().
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
The document discusses using Python for MapReduce development with Hadoop Streaming. It explains that Hadoop Streaming allows any language to be used as long as mapper and reducer functions are defined that use standard input/output. Examples of Python mapper and reducer code are provided that count word frequencies in a text file using Hadoop Streaming.
This presentation is about using Boost.Python library to create modules with С++.
Presentation by Andriy Ohorodnyk (Lead Software Engineer, GlobalLogic, Lviv), delivered GlobalLogic C++ TechTalk in Lviv, September 18, 2014.
More details -
http://www.globallogic.com.ua/press-releases/lviv-cpp-techtalk-coverage
This document summarizes the basics of memory management in Python. It discusses key concepts like variables, objects, references, and reference counting. It explains how Python uses reference counting with generational garbage collection to manage memory and clean up unused objects. The document also covers potential issues with reference counting like cyclic references and threads, and how the global interpreter lock impacts multi-threading in Python.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
The document provides instructions for preparing for a bioinformatics course on December 17th, 2014. It instructs students to install Perl and Java software by specified dates. The outline for the course covers topics like scripting with Perl and Python, working with databases and genome browsers, and using artificial intelligence tools like WEKA for classification and clustering.
This document provides an introduction and overview of Linux shell scripting. It begins by explaining key concepts like the kernel, shell, processes, redirection and pipes. It then covers variables, writing and running scripts, quotes, arithmetic, arguments, exit status, wildcards, and basic programming commands like echo, if/test, loops, case. The document concludes with more advanced commands like functions, I/O redirection, traps and examples.
The document outlines a proposal for a tool called CnP to detect and prevent errors from copy-and-pasted code during software development. It describes how copy-pasted code can lead to inconsistencies when modified. It then details a proof-of-concept tool called CReN that tracks copy-pasted code, automatically renames identifiers consistently across clones when modified, and demonstrates how it could catch errors in examples from literature. It proposes evaluating CReN and exploring using it to detect other types of inconsistencies from copy-pasted code.
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)imanmahsa
This presentation is given in Working Conference on Reverse Engineering (WCRE 2012).
The paper title is: "Detecting Clones across Microsoft .NET Programming Languages"
Abstract:
The Microsoft .NET framework and its language
family focus on multi-language development to support
interoperability across several programming languages. The
framework allows for the development of similar applications
in different languages through the reuse of core libraries. As a
result of such a multi-language development, the identification
and traceability of similar code fragments (clones) becomes a
key challenge. In this paper, we present a clone detection
approach for the .NET language family. The approach is based
on the Common Intermediate Language, which is generated by
the .NET compiler for the different languages within the .NET
framework. In order to achieve an acceptable recall while
maintaining the precision of our detection approach, we define
a set of filtering processes to reduce noise in the raw data. We
show that these filters are essential for Intermediate Languagebased
clone detection, without significantly affecting the
precision of the detection approach. Finally, we study the
quantitative and qualitative performance aspects of our clone
detection approach. We evaluate the number of reported
candidate clone-pairs, as well as the precision and recall (using
manual validation) for several open source cross-language
systems, to show the effectiveness of our proposed approach.
Renaming parts of identifiers inconsistently within code clones can introduce errors. The CReN and LexId tools help address this issue by tracking code clones and consistently renaming all instances of an identifier when one instance is edited. A user study found LexId helped programmers rename identifiers more quickly and consistently compared to performing renames manually without tool support.
This document summarizes a talk given by Ian Dees on writing your own JVM compiler. The talk discusses three main reasons why someone may want to write their own compiler: a hardware background, a computer science background, or following a self-made path of learning computer science topics. It then previews the fictional Thnad programming language that will be used to demonstrate creating a compiler. The talk outlines the main stages of creating a compiler: parsing, transforming, and emitting bytecode. Various Ruby tools like Parslet and BiteScript that will be used are also introduced.
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
This document discusses playfulness in the workplace and how it relates to the Ruby programming language. It begins by thanking the organizers and hosts for the event. It then highlights some Ruby libraries created by the host city's Ruby community. The document goes on to discuss reasons for using Ruby at work, including dealing with data formats, scripting other software, sharing code with coworkers, and deploying software to customers. It provides examples using Ruby libraries and tools like Parslet, FFI, ChunkyPNG, and WinGui to parse data formats, control the mouse through its API, read PNG files, and click on points on the screen. The overall message is that Ruby can be used playfully to get tasks done at
The document contains details about various UNIX and shell programming experiments conducted in a lab. It includes shell scripts to perform tasks like displaying lines between given line numbers in a file, deleting lines containing a specified word, checking file permissions, finding number of lines/words/characters in a file etc. It also includes C programs to implement UNIX commands like cat, ls, mv using system calls and to copy contents of one file to another. The document provides scripts/code along with descriptions and expected outputs for each experiment.
This document provides an overview of Python for bioinformatics. It discusses what Python is, why it is useful for bioinformatics, and how to get started with Python. It also covers Python IDEs like Eclipse and PyDev, code sharing with Git and GitHub, strings, regular expressions, and other Python concepts.
This document provides an overview and tutorial on streaming jobs in Hadoop, which allow processing of data using non-Java programs like Python scripts. It includes sample code and datasets to demonstrate joining and counting data from multiple files using mappers and reducers. Tips are provided on optimizing streaming jobs, such as padding fields for sorting, handling errors, and running jobs on Hadoop versus standalone.
This is part 1 of fuzzing, an introduction to the subject. This presentation covers some of theory and thought process behind the subject, as well as an introduction to environment variable fuzzing and file format fuzzing.
This document provides a tutorial on using the argparse module in Python to parse command line arguments. It begins with simple examples of defining positional and optional arguments, and progresses to combining the two and adding more advanced features like different argument types and actions. The goal is to introduce argparse concepts gradually through examples, building up an understanding of how to define, accept, and handle different types of arguments from the command line.
This document provides an overview of the Python programming language, including its history, key features, syntax examples, and common uses. It also discusses how Python can be used under Linux and some potential issues.
This document discusses Biopython, a Python package for biological data analysis. It provides concise summaries of key Biopython concepts:
1) Biopython is an object-oriented Python package that consists of modules for common biological data operations like working with sequences.
2) Key Biopython classes include Alphabet for sequence alphabets, Seq for representing sequences, SeqRecord for sequences with metadata, and SeqIO for reading/writing sequences to files.
3) Classes specify attributes (data) and methods (functions) that objects can have. For example, Seq objects have attributes like sequence and alphabet, and methods like translate() and complement().
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
The document discusses using Python for MapReduce development with Hadoop Streaming. It explains that Hadoop Streaming allows any language to be used as long as mapper and reducer functions are defined that use standard input/output. Examples of Python mapper and reducer code are provided that count word frequencies in a text file using Hadoop Streaming.
This presentation is about using Boost.Python library to create modules with С++.
Presentation by Andriy Ohorodnyk (Lead Software Engineer, GlobalLogic, Lviv), delivered GlobalLogic C++ TechTalk in Lviv, September 18, 2014.
More details -
http://www.globallogic.com.ua/press-releases/lviv-cpp-techtalk-coverage
This document summarizes the basics of memory management in Python. It discusses key concepts like variables, objects, references, and reference counting. It explains how Python uses reference counting with generational garbage collection to manage memory and clean up unused objects. The document also covers potential issues with reference counting like cyclic references and threads, and how the global interpreter lock impacts multi-threading in Python.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
The document provides instructions for preparing for a bioinformatics course on December 17th, 2014. It instructs students to install Perl and Java software by specified dates. The outline for the course covers topics like scripting with Perl and Python, working with databases and genome browsers, and using artificial intelligence tools like WEKA for classification and clustering.
This document provides an introduction and overview of Linux shell scripting. It begins by explaining key concepts like the kernel, shell, processes, redirection and pipes. It then covers variables, writing and running scripts, quotes, arithmetic, arguments, exit status, wildcards, and basic programming commands like echo, if/test, loops, case. The document concludes with more advanced commands like functions, I/O redirection, traps and examples.
The document outlines a proposal for a tool called CnP to detect and prevent errors from copy-and-pasted code during software development. It describes how copy-pasted code can lead to inconsistencies when modified. It then details a proof-of-concept tool called CReN that tracks copy-pasted code, automatically renames identifiers consistently across clones when modified, and demonstrates how it could catch errors in examples from literature. It proposes evaluating CReN and exploring using it to detect other types of inconsistencies from copy-pasted code.
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)imanmahsa
This presentation is given in Working Conference on Reverse Engineering (WCRE 2012).
The paper title is: "Detecting Clones across Microsoft .NET Programming Languages"
Abstract:
The Microsoft .NET framework and its language
family focus on multi-language development to support
interoperability across several programming languages. The
framework allows for the development of similar applications
in different languages through the reuse of core libraries. As a
result of such a multi-language development, the identification
and traceability of similar code fragments (clones) becomes a
key challenge. In this paper, we present a clone detection
approach for the .NET language family. The approach is based
on the Common Intermediate Language, which is generated by
the .NET compiler for the different languages within the .NET
framework. In order to achieve an acceptable recall while
maintaining the precision of our detection approach, we define
a set of filtering processes to reduce noise in the raw data. We
show that these filters are essential for Intermediate Languagebased
clone detection, without significantly affecting the
precision of the detection approach. Finally, we study the
quantitative and qualitative performance aspects of our clone
detection approach. We evaluate the number of reported
candidate clone-pairs, as well as the precision and recall (using
manual validation) for several open source cross-language
systems, to show the effectiveness of our proposed approach.
Renaming parts of identifiers inconsistently within code clones can introduce errors. The CReN and LexId tools help address this issue by tracking code clones and consistently renaming all instances of an identifier when one instance is edited. A user study found LexId helped programmers rename identifiers more quickly and consistently compared to performing renames manually without tool support.
Cloning means the use of copy-paste as method in developing software artefacts. This practice has several problems, such as unnecessary increase of these artefacts, and thereby increased comprehension and change efforts, as well as potential inconsistencies. The automatic detection of clones has been a topic for research for several years now and we have made huge progress in terms of precision and recall. This led to a series of empirical analyses we have performed on the effects and the amount of cloning in code, models and requirements. We continue to investigate the effects of cloning and work on extending clone detection to functionally similar code. This talk will give insights into how clone detection works and the empirical results we have gathered.
"Clone detection in Python": Slides presented at EuroPython 2012
Clone Detection in Python highlights the topic of code duplication detection using Machine Learning techniques.
Some examples on Python code duplications and C-Python implementation duplications are reported as well.
Automated Detection of Software Bugs and Vulnerabilities in LinuxSilvio Cesare
This document summarizes the key points from a technical presentation about detecting software defects and vulnerabilities. It identifies that the presenter is a PhD student researching malware classification and bug detection. Their approach involves combining decompilation with static analysis to find bugs in Linux binaries. They have found previously unknown bugs and vulnerabilities. Their ongoing work aims to automatically identify embedded third-party packages within software distributions in order to detect shared vulnerabilities.
This document discusses challenges in deploying machine learning models into production and potential solutions. It covers:
1. Issues with reproducibility due to dependencies and environment configurations when models are trained and deployed.
2. Problems with serializing models and transferring them between different versions of libraries and software stacks.
3. How containers can help address these issues by encapsulating the full runtime environment and dependencies of a model.
4. Managing both models and Docker containers is still required when using this approach.
This document provides an introduction to software exploitation on Linux 32-bit systems. It covers common exploitation techniques like buffer overflows, format strings, and ret2libc attacks. It discusses the Linux memory layout and stack structure. It explains buffer overflows on the stack and heap, and how to leverage them to alter control flow and execute arbitrary code. It also covers the format string vulnerability and how to leak information or write to arbitrary memory locations. Tools mentioned include GDB, exploit-exercises, and Python. Overall it serves as a crash course on the basic techniques and concepts for Linux exploitation.
CSCI 132 Practical Unix and Programming .docxmydrynan
CSCI
132:
Practical
Unix
and
Programming
Adjunct:
Trami
Dang
Assignment
4
Fall
2018
Assignment 41
This set of exercises will strengthen your ability to write relatively simple shell scripts
using various filters. As always, your goals should be clarity, efficiency, and simplicity. It
has two parts.
1. The background context that was provided in the previous assignment is repeated here
for your convenience. A DNA string is a sequence of the letters a, c, g, and t in any
order, whose length is a multiple of three2. For example, aacgtttgtaaccagaactgt
is a DNA string of length 21. Each sequence of three consecutive letters is called a codon.
For example, in the preceding string, the codons are aac, gtt, tgt, aac, cag, aac,
and tgt.
Your task is to write a script named codonhistogram that expects a file name on the
command line. This file is supposed to be a dna textfile, which means that it contains
only a DNA string with no newline characters or white space characters of any kind; it is
a sequence of the letters a, c, g, and t of length 3n for some n. The script must count the
number of occurrences of every codon in the file, assuming the first codon starts at
position 13, and it must output the number of times each codon occurs in the file, sorted
in order of decreasing frequency. For example, if dnafile is a file containing the dna
string aacgtttgtaaccagaactgt, then the command
codonhistogram dnafile
should produce the following output:
3 aac
2 tgt
1 cag
1 gtt
because there are 3 aac codons, 2 tgt, 1 cag, and 1 gtt. Notice that frequency comes
first, then the codon name.
1
This is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
2
This is really just a simplification to make the assignment easier. In reality, it is not necessarily a
multiple of 3.
3
Tho.
This document provides instructions for an assignment on MapReduce with Hadoop. It includes setting up the environment, writing a WordCount program as a first example, and estimating Euler's constant using a Monte Carlo method as a second example. The document asks multiple questions at each step to help the reader understand and complete the tasks. It covers downloading and configuring Hadoop, writing Map and Reduce functions, compiling and running jobs on a Hadoop cluster, and analyzing the results.
Plunging Into Perl While Avoiding the Deep End (mostly)Roy Zimmer
This document provides an introduction to the Perl programming language. It discusses Perl nomenclature, attributes, variables, scopes, file input/output, string manipulation, regular expressions, and the DBI module for connecting to databases from Perl scripts. Examples are provided for common Perl programming tasks like reading files, splitting strings, formatting output, and executing SQL queries.
The document provides an interview questions and answers guide for C programming language. It includes questions on topics such as the definition of C language, differences between functions like printf and sprintf, static variables, unions, linked lists, storage classes in C, and hashing. For each question, it provides multiple detailed answers explaining concepts in C programming such as memory allocation, strings, pointers, macros and more.
The document discusses the different states that a package's contents can be stored in, including as a source, bundle, binary, or installed in an R library or online repository. It also lists several functions that can be used to move a package between these states, such as install.packages(), devtools::install(), and library(). The bottom portion provides a cheat sheet on common parts of an R package like the DESCRIPTION file, namespaces, documentation, data, testing, and more.
The document discusses the different states that a package's contents can be stored in, including as a source, bundle, binary, or installed in an R library or online repository. It also lists several functions that can be used to move a package between these states, such as install.packages(), devtools::install(), and library(). The bottom portion provides a cheat sheet on common parts of an R package like the DESCRIPTION file, namespaces, documentation, data, testing, and more.
Malware Classification Using Structured Control FlowSilvio Cesare
This document summarizes a system for classifying malware using control flow graph signatures. It discusses:
1) Using entropy analysis to identify and unpack packed malware through application-level emulation.
2) Generating control flow graph signatures using a "structuring" technique and calculating similarities to signatures in a malware database.
3) Evaluating the system on real malware, showing high similarities between variants and low similarities between unrelated programs.
Reaction StatisticsBackgroundWhen collecting experimental data f.pdffashionbigchennai
Reaction Statistics
Background
When collecting experimental data from chemical reactions, it’s often useful to generate
statistics based on the data. One experimental measure is the reaction rate in moles per second,
representing the amount of product formed per unit time. If we have a set of these reaction rates
collected in a data file, we can calculate summary statistical information, such as the minimum
and maximum values, the arithmetic mean, variance, and standard deviation.
Finding the minimum and maximum are straightforward: we scan through all the data, and keep
track of the smallest and largest values encountered. The arithmetic mean (or average) is defined
as:
m = (X1+X2+…+Xn)/n
where n is the number of reaction rates, and xi represents one experimental reaction rate. Once
you have the arithmetic mean, the variance can be calculated as the mean of the squares of the
deviations from the mean:
v= ((Xn-m)^2+(X2 – m^2) + …+(Xn-m)^2)/n
where n is the number of reaction rates, xi represents one experimental reaction rate, and m is the
arithmetic mean of the reaction rates. Once you have the variance, you can calculate the
standarddeviation as:
s = sqrt(v)
Assignment
You will develop a C program that reads data from an input text file containing chemical
reaction rates (in moles per second), and computes the minimum, maximum, arithmetic mean,
variance, and standard deviation for that set of data. Your instructor will provide input text files,
which will each contain a series of double values, each on a line of its own within the file. Your
program will read one of these input files into an array of doubles (i.e., it will populate the array
using the data values from the file). Your program will then calculate statistics using that array of
doubles, and will write the results out to a separate output text file.
The goals of this assignment are to provide you with experience reading and writing text data
files, provide you with experience passing an array into a function, and give you more
experience organizing your program into separate C functions.
When defining your C functions, you may either:
Define the functions before they are used by any other functions, OR
Place function prototypes near the top of your code (after all #include directives), and then define
the functions in any order.
Part 1 – Opening Files and Reading Data
Create a new Visual Studio Win32 Console project named reactionstats. Create a new C source
file named project4.c within that project. At the top of the source file, #define
_CRT_SECURE_NO_WARNINGS, and then include stdio.h, math.h, stdlib.h, stdbool.h, and
float.h.
Inside your main function, define the following:
A one-dimensional array of 600 doubles. They do not need to be initialized to anything at this
stage.
An integer variable to hold the number of elements in the array, initialized using the approach
demonstrated in class, using sizeof.
A FILE pointer variable, which will refer to the input data text file..
BioMake is a language for specifying build networks of interdependent computational tasks. It allows defining targets with logical patterns that represent tasks. Targets have dependencies on other targets and are built by running actions. This allows automating sequencing analysis pipelines by specifying the execution of tasks like formatting databases and running BLAST alignments in a declarative way.
The document provides an introduction to exploit development. It discusses preparing a virtual lab with tools like Immunity Debugger, Mona.py, pvefindaddr.py and Metasploit. It covers basic buffer overflow exploitation techniques like overwriting EIP and using RETURN oriented programming. The document demonstrates a basic stack-based buffer overflow exploit against the FreeFloat FTP server as a tutorial, covering steps like generating a cyclic pattern, finding the offset and using mona to find a JMP ESP instruction to redirect execution. It also discusses using msfpayload to generate Windows bind shellcode and msfencode to escape bad characters before testing the proof of concept exploit.
The Ring programming language version 1.7 book - Part 43 of 196Mahmoud Samir Fayed
The document discusses using the Ring language to develop web applications through a CGI library. It describes how to configure the Apache web server to support Ring CGI scripts by enabling certain options and handlers. It also provides an overview of the CGI library and how to define commands that can be executed through CGI scripts to interface with the web server and return output.
This document discusses how to create a private version of CRAN packages behind a company firewall for internal use. It describes using the miniCRAN R package to selectively download and create a local mirror of specific CRAN packages and their dependencies. This allows controlling the packages available internally. It also discusses using MRAN and the RRT package to reproduce analyses and ensure scripts work across package and R version updates.
The document provides definitions for various computer science and programming terms related to C++ including data types, operators, statements, functions, classes, inheritance, and more. It defines terms such as #include, abstract class, aggregate, alias, allocation, argument, array, assignment, base class, bit, bool, break, byte, call by reference, call by value, case, char, cin, class, class layout, class member, class template, comments, compiler, const, constructor, continue, copy constructor, cout, data structure, debugger, declaration, default argument, definition, delete operator, derived class, destructor, do, double, dynamic memory allocation, else, endl, explicit, expression, expression statement
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docxdonnajames55
COMP 2103X1 Assignment 2
Due Thursday, January 26 by 7:00 PM
General information about assignments (important!):
http://cs.acadiau.ca/~jdiamond/comp2103/assignments/General-info.html
Information on passing in assignments:
http://cs.acadiau.ca/~jdiamond/comp2103/assignments/Pass-in-info.html
Information on coding style:
http://cs.acadiau.ca/~jdiamond/comp2103/assignments/C-coding-style-notes
[1] A filter program is a program which reads its input from “standard input” (“stdin”) and writes
its output to “standard output” (“stdout”). Filter programs are useful because they make it easy
to combine the functions they provide to solve more complex problems using the standard shell
facilities. Filter programs are also nice to write, because the programmer doesn’t have to worry
about writing code to open and close files, nor does the programmer have to worry about dealing
with related error conditions. In some respects, filter programs are truly “win-win”.
Write a filter program which uses getchar() to read in characters from stdin, continuing until
end of file (read the man page and/or textbook to see the details on getchar(), or, heaven forbid,
review the class slides). Your program must count the number of occurrences of each character in
the input. After having read all of the input, it outputs a table similar to the one below which, for
each character seen at least once, lists the total number of times that character was seen as well as
its relative frequency (expressed as a percentage). Note that the characters \n, \r, \t, \0, \a, \b,
\f, and \v (see man ascii) must be displayed with the appropriate “escape sequence”. Ordinary
printable characters must be output as themselves. Non-printable characters (see man isprint)
must be printed with their three-digit octal code (see man printf).
You can get input into a filter program (a2p1 in this case) in three ways:
(a) “pipe” data from another program into it, like
$ echo blah blah | a2p1
(b) “redirect” the contents of a file into the program, like
$ a2p1 < some-file
(c) type at the keyboard, and (eventually) type ^D (control-d) at the beginning of a line to
signify end of file.
Your output must look like the following, for this sample case:
$ echo ^Aboo | a2p1
Char Count Frequency
--------------------------
001 1 20.00%
\n 1 20.00%
b 1 20.00%
o 2 40.00%
Note: in the above examples, and from now on in this course’s assignments, text in red is text that
the human types, and a “$" at the beginning of a line like that represents the shell prompt.
1
http://cs.acadiau.ca/~jdiamond/comp2103/assignments/General-info.html
http://cs.acadiau.ca/~jdiamond/comp2103/assignments/Pass-in-info.html
http://cs.acadiau.ca/~jdiamond/comp2103/assignments/C-coding-style-notes
Note that I entered a ^A (control-a, not the circumflex character followed by the capital A) by
typing ^V^A. The ^V tells your shell that you want it to interpret the next character literally, rather
than to use any special meani.
Here are the key points covered in the essay:
- Exercise 15.1 involves creating a custom backup job in Windows 7 to back up selected files and folders to a hard disk partition.
- The C: system drive does not appear as a backup destination because you cannot back up a drive to itself.
- A warning appears when selecting the X: drive for backup because although it appears as a separate drive letter, it is physically located on the same hard disk as the system drive C:. Backing up to this location would not provide the benefits of an off-site backup if the hard disk failed.
- When selecting folders and files for backup, you must ensure the selected items are not part of an operating system
This PPT File helps IT freshers with the Basic Interview Questions, which will boost there confidence before going to the Interview. For more details and Interview Questions please log in www.rekruitin.com and click on Job Seeker tools. Also register on the and get employed.
By ReKruiTIn.com
This document provides an overview of advanced Perl concepts including:
1. Finer points of looping, including the continue block and multiple loop variables. Subroutine prototypes allow specifying the number and types of arguments.
2. Working with files using filehandles for reading, writing, and appending. Functions like open, close, rename, unlink for file operations.
3. Working with directories using opendir, readdir, rewinddir and related functions.
4. EVAL for evaluating strings and blocks of code. Packages for namespacing and modules. BEGIN and END blocks act as constructors and destructors.
3 sentences or less.
Similar to Clonewise - Automatically Detecting Package Clones and Inferring Security Vulnerabilities (20)
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGSilvio Cesare
This document provides an introduction to hardware hacking for software engineers. It outlines several beginner hardware hacking projects, including interfacing with UART to gain serial console access on devices, ripping firmware from chips to analyze code and find passwords/strings, manipulating IR alarm systems by learning codes and repurposing remotes, and building an Arduino-controlled backyard irrigation system networked to a PC. The document explains how to identify important chips, interfaces, and voltages, and techniques for reading serial flash and desoldering chips to extract firmware. It presents hardware hacking as an accessible new hobby that can build skills in electronics and low-level programming.
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSSilvio Cesare
This document provides an overview of various academic techniques that can be useful for security researchers, including mathematical objects, comparing objects, similarity searching, classification, clustering, and program analysis. It discusses representing problems as different objects like strings, vectors, and graphs and using techniques like n-grams, vector distances, and graph decomposition. Case studies of projects that applied these techniques are also summarized.
Simseer.com - Malware Similarity and Clustering Made EasySilvio Cesare
Simseer.com provides free web services to analyze malware using program structure as a signature. The services include Simseer, which compares malware samples and visualizes relationships; SimseerCluster, which groups samples into clusters identifying potential families; and SimseerSearch, which finds similar samples to a query. The services leverage control flow graph signatures and machine learning to provide robust malware analysis without traditional string signatures.
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Silvio Cesare
This document describes two web services called Simseer and Bugwise for software defect detection and similarity analysis. Simseer performs malware variant and plagiarism detection by generating control flow signatures and comparing similarities. Bugwise detects bugs like double frees through decompilation and data flow analysis. The services are implemented through a PHP frontend and C++ backend called Malwise that performs analysis through plugins. Initial results found the web services had minimal overhead compared to command line usage.
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
The document discusses using static analysis techniques like data flow analysis and decompilation to detect bugs in binary files. It describes decompiling binaries into an intermediate representation and then performing intraprocedural and interprocedural data flow analysis on the representation. This allows detecting bugs involving unsafe functions like getenv() and memory issues like use-after-free and double free errors. The approach involves lifting x86 into a RISC-like intermediate language, inferring stack pointers, and decompiling locals and arguments to perform analysis and optimization.
Wire - A Formal Intermediate Language for Binary AnalysisSilvio Cesare
The document describes an intermediate representation (IR) for representing low-level machine code instructions. It defines the syntax and semantics for the IR, including instruction operations, register operands, memory addresses, and condition codes. It also provides examples of x86 assembly instructions translated to the defined IR format.
Silvio Cesare presented an effective approach for flowgraph-based malware variant detection. The approach transforms control flow graphs into strings that are then compared using an assignment problem dissimilarity metric for sets of strings. Evaluation on Roron malware variants showed the approach was more effective at detecting variants than previous exact matching approaches. The system was also shown to have low false positive rates and efficient processing times for malware detection. The techniques developed have also been applied to other software analysis tools for similarity detection, bug finding, and more.
Simseer - A Software Similarity Web ServiceSilvio Cesare
This document summarizes an overview talk on software similarity. It introduces the speaker and their research focus on malware detection and vulnerability detection. It then provides an overview of the core topics of software similarity, how it is approached in academia, and introduces a new web service that identifies software similarity. It discusses how software similarity can be used for malware detection, software theft detection, plagiarism detection, and software clone detection. It also provides taxonomy of different program features that can be analyzed and examples of how features like ASTs and control flow can be represented. Finally, it introduces resources like a wiki, book, and new web service called Simseer for software similarity.
Faster, More Effective Flowgraph-based Malware ClassificationSilvio Cesare
Silvio Cesare is a PhD candidate at Deakin University researching malware detection and automated vulnerability discovery. His current work extends his Masters research on fast automated unpacking and classification of malware. He presented this work last year at Ruxcon 2010. His system uses control flow graphs and q-grams of decompiled code as "birthmarks" to detect unknown malware samples that are suspiciously similar to known malware, reducing the need for signatures. He evaluated the system on 10,000 malware samples with only 10 false positives. The system provides improved effectiveness and efficiency over his previous work in 2010.
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Silvio Cesare
The document discusses techniques for detecting software similarity, including control flow graphs, birthmarks, and algorithms like q-grams and optimal distance. It also evaluates these techniques on malware samples, showing detection rates and false positives for different algorithms and similarity thresholds. Processing times for analyzing benign and malicious files are presented.
Simple Bugs and Vulnerabilities in Linux DistributionsSilvio Cesare
This talk discusses automated techniques for finding bugs and vulnerabilities in Linux software packages. The techniques were able to find:
- 27+ bug reports submitted to Debian after scanning for memset function bugs
- 741 programs that crashed when passed a null argv[0] parameter in Debian (27% crash rate)
- 3 segmentation faults when fuzzing most SUID/SGID programs in Debian
- 16 vulnerabilities found in Debian packages and 15 in Fedora packages after scanning for signatures of embedded vulnerable libraries
Linux distributions are using the results to improve security testing and patch vulnerabilities.
Fast Automated Unpacking and Classification of MalwareSilvio Cesare
This document summarizes Silvio Cesare's research presentation on fast automated unpacking and classification of malware. The research aims to efficiently and effectively detect and classify malware using static analysis. It involves developing an automated unpacker using emulation and entropy analysis to unpack malware. It then extracts control flow graphs from unpacked malware and uses graph matching techniques to classify malware and identify variants by similarity. The techniques were evaluated on real malware samples and shown to accurately unpack and classify malware with low processing times suitable for real-time systems.
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...Silvio Cesare
We propose an algorithm to identify malware variants by determining program similarity through estimating isomorphic control flow graphs. We implement this approach in a prototype system that demonstrates its ability to detect real malware variants with low false positives and logarithmic performance scalability, making it suitable for endhost adoption. Control flow graphs provide a more invariant characteristic than traditional static features like byte sequences for identifying polymorphic malware variants. Our system generates signatures for control flow graphs to efficiently compare programs and classify unknown samples.
The document discusses using emulation for security applications like reverse engineering Cisco IOS's heap management, tracing program execution to evaluate binaries, implementing dynamic taint analysis, and developing automated unpacking tools. It describes how emulation allows intercepting program execution at the instruction level and adding instrumentation to perform these dynamic analyses, avoiding detection by anti-debugging techniques. Specific tools mentioned include Dynamips, TTAnalyze, Argos, Pandora's Bochs, and the author's own unpacker and emulator.
The document provides an overview of a presentation on kernel auditing research, including:
- Three parts to the presentation covering kernel auditing research, exploitable bugs found, and kernel exploitation.
- Audits were conducted on several open source kernels, finding over 100 vulnerabilities across them.
- A sample of exploitable bugs is then presented from the audited kernels to provide evidence that kernels are not bug-free and vulnerabilities can be relatively simple to find and exploit.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Clonewise - Automatically Detecting Package Clones and Inferring Security Vulnerabilities
1. Automatically Detecting Package Clones and Inferring Security
Vulnerabilities
Silvio Cesare
Deakin University
<silvio.cesare@gmail.com>
2. PhD Candidate at Deakin University, AU.
Research interests:
Malware detection
Automated vulnerability detection
Book author
Software similarity and classification, Springer.
http://www.springer.com/computer/security+and+cryptology/
book/978-1-4471-2908-0
Spoke at Black Hat in 2003 on Open source kernel auditing.
http://www.FooCodeChu.com
3. Developers may “embed” or “clone” code
from 3rd party sources
Static linking
Maintaining a internal copy of a library.
Forking a library.
Lots of examples
XML parsing libxml in various programs
Image processing libpng in Firefox
Networking Open SSL in Cisco IOS
Compression zlib everywhere
4. Linuxpolicies generally disallow (image
below).
It still happens.
Multiple versions of packages now exist.
Each copy needs patches from upstream.
Copies become insecure over time from
unapplied patches.
5. Scan binaries for version strings.
Done in 2005 on mass scale for zlib in Debian
Linux.
bzlib_private.h:#define BZ_VERSION "1.0.5, 10-Dec-2007"
tiffvers.h:#define TIFFLIB_VERSION_STR "LIBTIFF, Version
3.8.2nCopyright (c) 1988-1996 Sam LefflernCopyright (c)
1991-1996 Silicon Graphics, Inc."
png.h:#define PNG_HEADER_VERSION_STRING
" libpng version 1.2.27 - April 29, 2008n"
6. 10,000 – 20,000 packages in Linux distros.
Debian tracks over 420 libraries (see below).
Most distros don‟t track at all.
How many vulnerabilities are there?
How to automate?
7. 1. Problem definition and our approach
2. Statistical classification
3. Scaling the analysis
4. Inferring security vulnerabilities
5. Implementation and evaluation
6. Discussion
7. Related work
8. Future work and conclusion
Remember to complete the Black Hat speaker feedback survey.
8. Find package code re-use in sources.
Infer vulns caused by out-of-date code .
Firefox Source
libpng Source
9. Consider code re-use detection a binary
classification problem:
Do packages A and B share code? Yes or no?
Features for classification:
Common filenames
Hashes
Fuzzy content
Askthe question does package A share code
with package X, forall X in the repository.
10.
11. Classification assigns classes to objects.
Supervised learning.
Unsupervised learning. Class 1 - Spam
object ?
Class 2 – Not Spam
14. Edit distance between filenames.
Similarity >= 85%
edit _ dist ( s, t )
similarity ( s, t ) 1
max( len( s), len(t ))
15. Use fuzzy hashing (ssdeep).
Number of identical hashes.
Number of > 80% similar hashes.
Number of > 0% similar hashes.
ssdeep,1.0--blocksize:hash:hash,filename
96:KQhaGCVZGhr83h3bc0ok3892m12wzgnH5w2pw+sxNEI58:FIVkH4x73h39LH+2w+sxaD,"config.h"
96:MD9fHjsEuddrg31904l8bgx5ROg2MQZHZqpAlycowOsexbHDbk:MJwz/l2PqGqqbr2yk6pVgrwPV,"INSTALL"
96:EQOJvOl4ab3hhiNFXc4wwcweomr0cNJDBoqXjmAHKX8dEt001nfEhVIuX0dDcs:3mzpAsZpprbshfu3oujjdEN
dp21,"README"
16. README filenames less important.
libpng.c more important .
Scorefilenames using „inverse document
frequency.‟
D
idf (t , D) log
{d D:t d}
Sum scores of matching filenames.
17. Which similar filenames to match?
Each matching has a cost – the filename score.
Choose matchings to maximize sum of costs.
p
Weight(q)
q
Makefile
Makefile.ca
png44.c
png43.c
png.h
png.h
Makefile
README
rules
18. Given two sets, A and T, of equal size, together with a weight function C: A × T → R. Find a
bijection f: A →T such that the cost function:
a A
C (a, f (a))
is optimal.
Known in combinatorial optimisation as „the
assignment problem.‟
Solved optimally in cubic time.
Greedy solution is faster.
19. Not all features are important.
1. Feature1 1. Feature3 (0.80)
Feature ranking. 2. Feature2 2. Feature1 (0.60)
3. Feature3 3. Feature2 (0.01)
Subset selection. 1. Feature1 1. Feature1
2. Feature2 2. Feature2
3. Feature3
We chose not to use it.
20. Consider feature vectors as N-dimensional
points.
Linear classifiers.
Non linear classifiers.
Class A
Class B
Decision trees.
21.
22. Speedup clone detection on a package.
Open MP.
Classify(Package_X, Package_1)
Embarrisingly parallel.
Clone Detection –
Classify(Package_X, Package_2)
Package_X
Classify(Package_X, Package_N)
23. Open MPI.
Single job is clone detection on package.
Clone Detection – Package_1
Slaves consume jobs.
Clone Detection Clone Detection - Package_2
Embarrassingly parallel.
Clone Detection - Package_N
24. 4 Node Amazon EC2 Cluster
Dual CPU
8 cores per CPU
88 EC2 compute units
60.5G memory per node
Clonedetection on embedded libs known by
Debian.
Store the results for later use.
25.
26. Summary: Off-by-one error in the __opiereadrec
function in readrec.c in libopie in OPIE 2.4.1-test1
and earlier, as used on FreeBSD 6.4 through 8.1-
PRERELEASE and other platforms, allows remote
attackers to cause a denial of service (daemon crash)
or possibly execute arbitrary code via a long
username, as demonstrated by a long USER
command to the FreeBSD 8.0 ftpd.
29. 1. Take CVE, match CPE name to Debian package.
2. Parse CVE summary and extract vuln filename.
3. Find clones of package with similar filename.
4. Trim dynamically linked clones.
5. Is vuln affected clone already being tracked?
32. 3,500Lines of C++ and shell scripts.
Open Source
http://www.github.com/silviocesare/Clonewise
33. Ubuntu Linux
3,077,063 unique filenames.
Follows inverse power law distribution.
R square value of regression analysis 0.928.
34. Debian Linux embedded-code-copies.txt.
Not really machine readable.
Cull entries which we can‟t match to packages.
761 labelled positives.
Negatives any packages not in positives
47578 generated labelled negatives.
35. Identified 34 previously unknown clones in
Debian.
Lots more to do.
Statistical classification
4 classifiers - Random Forest gave best accuracy.
Increasing the decision threshold reduces FPs.
Predict 3 FPs in 10,000 classifications.
More likely an upper limit.
37. 4 hours on an Amazon HPC cluster.
MPI_Scatter to do static job assignment was
inefficient.
Better to consume from a work queue.
Need to use multicore to balance load.
40. Write access to Debian‟s security tracker.
Red Hat embedded code copies wiki created.
Debian plan to integrate Clonewise into
infrastructure.
41. Red Hat reference CVEs of embedded libs.
Not every vendor does.
It would be nice if CVE supported this.
42. Clonewise detects code reuse.
If zlib embedded in packages X and Y:
Clonewise detects clones between all X, Y, and zlib.
What we really want to know is:
X is not cloned in Y.
Zlib is cloned in X and Y.
Mitigation
Clone detection on known embedded libraries.
43. Debian Linux zlib audit in 2005
Plagiarism detection
Attribute counting
Structure-based
Code clone detection
Tokenization
if
condition then else
Abstract syntax trees == return =
x 0 x 1
44. Source repositories
Sourceforge
Github
Other OSs – BSD etc
Integration into build/packaging systems?
Integration into Debian Linux infrastructure.
45. More than just Clonewise..
Simseer – Free flowgraph-based malware similarity and
families.
110,000 LOC C++. Happy to talk to vendors.
46. Vendors have 10,000+ packages.
How to audit for clones?
Clonewise can provide a solution.
And help improve security.
http://www.FooCodeChu.com
Remember to complete the Black Hat speaker feedback survey.