Releases the blar.py tool which creates a genomic encoding from text files. This encoding results in a lossy, highly compressible representation of the original file that can be used for rapid anomaly detection and forensic analysis.
Presentation by Fritz Sedlazeck at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on characterizing human structural variation.
DNA Fingerprinting for Taxonomy and Phylogeny.pptxsharanabasapppa
Deoxyribonucleic acid, a self-replicating material which is present in all living organisms as the main constituent of chromosomes.
DNA is made up of molecules called nucleotides. Each nucleotide contains a phosphate group, a sugar group and a nitrogen base.
The four types of nitrogen bases are adenine (A), thymine (T), guanine (G) and cytosine (C). The order of these bases is what determinesDNA's instructions, or genetic code.
Presentation by Fritz Sedlazeck at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on characterizing human structural variation.
DNA Fingerprinting for Taxonomy and Phylogeny.pptxsharanabasapppa
Deoxyribonucleic acid, a self-replicating material which is present in all living organisms as the main constituent of chromosomes.
DNA is made up of molecules called nucleotides. Each nucleotide contains a phosphate group, a sugar group and a nitrogen base.
The four types of nitrogen bases are adenine (A), thymine (T), guanine (G) and cytosine (C). The order of these bases is what determinesDNA's instructions, or genetic code.
DNA barcoding is a standardized approach to identifying plants and animals by minimal sequences of DNA, called DNA barcodes.
DNA barcode - short gene sequences taken from a standardized portion of the genome that is used to identify species
and this presentation gives much introducing about DNA barcodes developed for Prokaryotes and Eukaryotes.
Various barcoding genes which are evolutionary conserved.
techniques to develop a DNA bar-code and its future perspectives
Current technologies and future technologies of DNA barcoding. Applications regarding environment awareness. it also contains 2-3 case studies
Genetic Markers and their importance in ForensicsMrinal Vashisth
A description of Genetic Markers and their applications with focus on Forensic Analysis. Complimentary methods such as RNA Profiling are also discussed.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
DNA fingerprint methods. • The locations for genes for specific traits such as egg number, body weight or carcass quality can be identified using markers and then they can be selected directly.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
More Related Content
Similar to Security's in your DNA: Genomics for InfoSec
DNA barcoding is a standardized approach to identifying plants and animals by minimal sequences of DNA, called DNA barcodes.
DNA barcode - short gene sequences taken from a standardized portion of the genome that is used to identify species
and this presentation gives much introducing about DNA barcodes developed for Prokaryotes and Eukaryotes.
Various barcoding genes which are evolutionary conserved.
techniques to develop a DNA bar-code and its future perspectives
Current technologies and future technologies of DNA barcoding. Applications regarding environment awareness. it also contains 2-3 case studies
Genetic Markers and their importance in ForensicsMrinal Vashisth
A description of Genetic Markers and their applications with focus on Forensic Analysis. Complimentary methods such as RNA Profiling are also discussed.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
DNA fingerprint methods. • The locations for genes for specific traits such as egg number, body weight or carcass quality can be identified using markers and then they can be selected directly.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
5. Basics
• Letters (nucleotides)
– 4 in DNA, A,G,C,T
• Codons
– Triplets of nucleotides e.g. GAA
• Genomes have coding regions (proteins)
& non-coding regions (other)
• One strand can be read forward, the other
in reverse
6. It’s all about the Codons
• The Genetic Code is a dictionary of
Codons
• 64 entries (4^3)
7.
8. Analyzing Genomes
• Compare them to each other
– Alignments (e.g. Smith-Waterman, etc.)
– Distances
• Levenshtein (edit) distance (metric)
• Longest Common Subsequence distance (metric)
• Normalized Compression Distance (metric)
– Optimal Grammars
• Pisa.c: Optimal sequence grammar search using
hyperstring encodings
9. Analyzing Genomes
• Look for interesting regions
– Information gain (Kullback-Leibler Div)
– Coding Costs (Kolmogorov Complexity)
– Decaying Coding Costs (Lossy Kolmogorov
Complexity)
14. Don’t say that again
• Sections of DNA that do not repeat are the
most important
• Protein coding genes and RNA coding
genes are non-repetitive
• Higher-order creatures are largely
repetitive
16. Putting the squeeze on
• Normal compressors ~ 2bit codes
• Special genetic compressors exist
• Compressibility equates to sequence
predictability for the model in use
18. A Question
If we could convert sequences of logs,
packets, etc. to a genomic encoding, could
we use genomic analysis to dramatically
speed up & improve forensics, incident
response and anomaly detection?
19.
20. How?
• Step 1: Convert events into alphabet
• Step 2: Convert stream into string of
letters
• Step 3: Money bath
21. A Naïve Solution
• Step 1: Hash each input, use hash value
as a letter
• Step 2: Create stream of hash values
• Step 3: #fail
Why?
22. Answer
• The alphabet is too big
• The stream will need at least
2^(2^<hash_key_size) examples
• Stream is virtually unpredictable
24. WTF is a ‘blarp’?
• Let’s ask Google
• The sound a fat person makes being fat
• The sound of taking big fat data and
making it useful & efficient small data
• A cool little python tool for creating and
analyzing genomic encodings
• The last two will not be found on Google…yet
25. Idea
• We want similar events to be represented
by a single letter
• Hashes are random projections
• Let’s use geometry instead
26. Position in space
• To precisely locate something in space D,
you need dist. to n=D+1 reference points
• Key notion: To get something’s general
area you can use n<<D+1 reference
points
27. Locality-Sensitive Hashing
• Created by Yahoo in late 90’s
• Used within indexing for text lookups on
massive data sets
• Many hashes; data-type dependent
• Question: What if you thought about it as a
‘general area’ hash instead?
28. How it works
• Basic type: Random Projection
• Given a numeric vector (e.g. 1, 15, 3,
14.8) calculate its dot product vs. a
random vector
• If result is positive, call it a ‘1’
• If negative, call it a ‘0’
• Repeat
• Concatenate binary together, result is LSH
30. Vectorizing
• Idea: Count things that matter, take
measurements, etc. and create an array to
hold that information
• Where the rubber meets the road
• Lots of chances for domain expertise
31. Basic Vectorizing in Blar.py
• Basic model: character n-grams
• Also known as Markov chains or Bag of
Letters
• Counts up sliding windows of text
• E.G. 2-grams for ‘sassyfrassy’
sa: 1 as: 2 ss: 2 sy: 2 yf: 1 fr: 1 ra: 1
For 256^2 length array
(1,0…0,2,0…0,2,0…
32. Let’s Vectorize Better
• Use Feature Hashing otherwise known as
the hashing trick
• Find hash mod length and increment
counter for each model pattern
• Permits lossy counting with graceful
random collisions
• Blar.py uses length 64 by default and
xxHash
33. Blar.py code
1. def feature_hash_string(s, window, dim):
2. # Generate window-char Markov chains & create feature hashes
3. chains = [(xxhash.xxh32(s[i:i+window]) % dim) for i in
xrange(len(s)-(window-1))]
4. # Initialize counter array
5. counters = numpy.zeros(dim)
6. # Count instances of feature hashes
7. for i in range(len(chains)):
8. counters[chains[i]] += 1
9. # Return feature hash count vector
10. return counters
34. Now let’s find the LSH
1. # Use random projection for LSH and output a UTF char for
the locality-sensitive hash
2. def locality_hash_vector(v, width):
3. hash = numpy.zeros(width, dtype=int)
4. for x in range(0, width - 1):
5. projection = numpy.dot(COMP_VECTORS[x], v)
6. if projection < 0:
7. hash[x] = 0
8. else:
9. hash[x] = 1
10. # Return unicode char equal to the LSH
11. return unichr(int(''.join(map(str, hash)),2))
35. Blar.py analysis
• Analyzes 4 character sequences and
assigns a decaying version of the optimal
coding cost to each line
• Tells you how interesting a certain event is
relative to everything else in the genome,
accounting for ordering
• Blar.py Genomes are extremely
compressible using bzip especially
36. Blar.py defaults (ATM)
• 4 character sliding windows
• 4 bit hashes
• 64d feature hashes
• Outputs a list of the most interesting
scores
• Outputs a few bad charts
37. Blar.py vs. Toy File
1. Mary had a little lamb whose fleece was white as snow.
2. Mary had a little lamb whose fleece was white as snow.
3. Mary had a little lamb whose fleece was white as snow.
4. Mary had a little lamb whose fleece was white as snow.
5. Mary had a little lamb whose fleece was white as snow.
6. Gary had a little hand whose hair was as white as blow.
7. some more strings
8. some more strings
9. some more strings
10.some more strings
11.some more strings
12.John McAfee was the keynote for Skytalks.
13.John McAfee was the keynote for Skytalks.
14.John McAfee was the keynote for Skytalks.
15.some more strings
16.some more strings
17.some more strings
18.John McAfee was the keynote for Skytalks.
19.John McAfee was the keynote for Skytalks.
20.FOO BAR BAS
39. Blar.py vs. Toy File
(Look Raffy, I’m using the completely inappropriate chart type)
40. Blar.py vs. BlueGene/L
• From the Usenix Computer Failure Data
Repository
• 1.2GB combined log file from 131,072
processors for six months
• 119MB compressed with gzip
• 9.4MB blar.py genome
• Blar.py ~1000 lines/sec
43. TL;DR
• Fast, accurate, free: Blar.py genomic
encoding tool provides very fast, low noise
anomaly detection
• Stop searching in a crisis: Great way to
quickly explore data for IR, forensics, etc.,
especially from unknown sources
• Want it? Follow me @conduit242 for the
GitHub posting announcement