InfoArmor, Threat Intelligence &
Data Ingestion
Christian Lees & Steve Olson
What we will be covering today.
HOW DID WE GET HERE?
A brief history of InfoArmor, and the
greatness that got us to where we are
today.
WHERE ARE WE GOING?
A look at the vision and where we see
InfoArmor going in the future.
HOW DO WE GET THERE?
What will it take for us to achieve our
vision, and what is our process to get
there?
1 2 3
Threat Actors / Dark Web
Source: https://www.economist.com/news/leaders/21721656-data-economy-demands-new-approach-antitrust-rules-worlds-
most-valuable-resource
“The world’s most valuable resource is no longer oil, but data”
- The Economist
Hacked
Inside Job
Poor
Security
Accidental
Publish
Device
lost/stolen
The unseen threats.
Dark web monitoring through InfoArmor Advanced Threat Intelligence.
Forum scraping
Programmatic forum
scraping with bots while
humans operatives gain
access to closed forums.
Human operatives
Combat hackers that are
using technology and
innovating everyday.
Structuring raw data
Compromised data files
must be formatted,
organized and canonized
to be fully leveraged.
Threat actor profiling
Tracking threat actors
moves as we built out
profiles, information and
patterns to thwart risks.
60% of companies can not detect compromised credentials survey says
Source: https://www.csoonline.com/article/3022066/security/60-of-companies-cannot-detect-compromised-credentials-say-security-pros-
surveyed.html
This product will get you 100.000 United Kingdom "HOTMAIL" Emails Leads
Source: http[:]//6qlocfg6zq2kyacl.onion/viewProduct?offer=857044.38586
SpamBot
Lessons from 1 billion
rows
What I learned that allowed me to sleep
again
Bird’s eye view of data
- Relational dbs for web application and storage of known
structured data
- Elasticsearch for unstructured and fulltext searching
- Replication off-site
- MariaDB remote DBAs monitor all InfoArmor
Over 2 billion credentials
45 million forum posts
300 GB and growing of botnet logs
Pretty much all code is in Python.
Don’t Do That!
- Feature worked for some inputs, but not others
- Schema was suboptimal, leading to full table scans
- 4 way join, hundreds of thousands of seconds
- Had to kill ‘em
- With MariaDB assistance, planned out new schema for
credentials
- More intuitive
- Meets business needs in API and GUI
- Listen to end users!
Non tech lesson: Cultivate relationships outside of tech!
Multithreading Mayhem
- Parallelized queries to multiple databases
- In Pyramid, achieved with separate DB Sessions
- Sessions weren’t closed, leaving connections open
- Fell outside of normal Zope/SQLAlchemy flow
- Monyog alerts about max’d connections, restarted application to
clear connections
- Found issue in code, added .close()
Lesson: Configuration changes solve and don’t solve problems at the
same time
Obviously….
Don’t Bring All Groceries in at Once
- Sometimes a ton of rows need to be updated
- Even if something doesn’t get committed….
...Log entries and rollbacks get created
- Gums up replication
- Wastes time
- MAX ALLOWED PACKET
Lesson: Data should be updated in small bites
Programmatic!
Same for import parsing scripts
Where multithreading amplifies binlog size
- Don’t get greedy, nothing is worth screwing up replication or your
application
Non tech lesson: Add 20 to 200 percent to time estimates for imports.
Process and organization will set you free
IDS - Intrusion Detection System
Or rather “Inline Data Shredder”
- Scrape malicious looking javascript, php, python, perl scripts
- Will normally get bounced on the way in from the scraper
- Replication kept mysteriously stopping
- Engineering team getting “WTF?” alerts from all angles
Found the chunk of code in the database. Replication now over SSL.
Lesson: Coincidence...or degree of separation?
Final thoughts...
- Data is business, business is data.
- Let remote dbas do nuts and bolts
- Focus on your application and goal of the data
- Make data available to sales people, but toolify it
- Keep evolving
Fin
Gracias por eschucar

M|18 How InfoArmor Harvests Data from the Underground Economy

  • 1.
    InfoArmor, Threat Intelligence& Data Ingestion Christian Lees & Steve Olson
  • 2.
    What we willbe covering today. HOW DID WE GET HERE? A brief history of InfoArmor, and the greatness that got us to where we are today. WHERE ARE WE GOING? A look at the vision and where we see InfoArmor going in the future. HOW DO WE GET THERE? What will it take for us to achieve our vision, and what is our process to get there? 1 2 3
  • 3.
  • 4.
  • 5.
  • 6.
    The unseen threats. Darkweb monitoring through InfoArmor Advanced Threat Intelligence. Forum scraping Programmatic forum scraping with bots while humans operatives gain access to closed forums. Human operatives Combat hackers that are using technology and innovating everyday. Structuring raw data Compromised data files must be formatted, organized and canonized to be fully leveraged. Threat actor profiling Tracking threat actors moves as we built out profiles, information and patterns to thwart risks.
  • 7.
    60% of companiescan not detect compromised credentials survey says Source: https://www.csoonline.com/article/3022066/security/60-of-companies-cannot-detect-compromised-credentials-say-security-pros- surveyed.html
  • 8.
    This product willget you 100.000 United Kingdom "HOTMAIL" Emails Leads Source: http[:]//6qlocfg6zq2kyacl.onion/viewProduct?offer=857044.38586
  • 11.
  • 12.
    Lessons from 1billion rows What I learned that allowed me to sleep again
  • 13.
    Bird’s eye viewof data - Relational dbs for web application and storage of known structured data - Elasticsearch for unstructured and fulltext searching - Replication off-site - MariaDB remote DBAs monitor all InfoArmor Over 2 billion credentials 45 million forum posts 300 GB and growing of botnet logs Pretty much all code is in Python.
  • 14.
    Don’t Do That! -Feature worked for some inputs, but not others - Schema was suboptimal, leading to full table scans - 4 way join, hundreds of thousands of seconds - Had to kill ‘em - With MariaDB assistance, planned out new schema for credentials - More intuitive - Meets business needs in API and GUI - Listen to end users! Non tech lesson: Cultivate relationships outside of tech!
  • 15.
    Multithreading Mayhem - Parallelizedqueries to multiple databases - In Pyramid, achieved with separate DB Sessions - Sessions weren’t closed, leaving connections open - Fell outside of normal Zope/SQLAlchemy flow - Monyog alerts about max’d connections, restarted application to clear connections - Found issue in code, added .close() Lesson: Configuration changes solve and don’t solve problems at the same time
  • 16.
  • 17.
    Don’t Bring AllGroceries in at Once - Sometimes a ton of rows need to be updated - Even if something doesn’t get committed…. ...Log entries and rollbacks get created - Gums up replication - Wastes time - MAX ALLOWED PACKET Lesson: Data should be updated in small bites Programmatic!
  • 18.
    Same for importparsing scripts Where multithreading amplifies binlog size - Don’t get greedy, nothing is worth screwing up replication or your application Non tech lesson: Add 20 to 200 percent to time estimates for imports. Process and organization will set you free
  • 19.
    IDS - IntrusionDetection System Or rather “Inline Data Shredder” - Scrape malicious looking javascript, php, python, perl scripts - Will normally get bounced on the way in from the scraper - Replication kept mysteriously stopping - Engineering team getting “WTF?” alerts from all angles Found the chunk of code in the database. Replication now over SSL. Lesson: Coincidence...or degree of separation?
  • 20.
    Final thoughts... - Datais business, business is data. - Let remote dbas do nuts and bolts - Focus on your application and goal of the data - Make data available to sales people, but toolify it - Keep evolving
  • 21.

Editor's Notes

  • #2 Good Morning greeting
  • #3 1. How did we get here About InfoArmor Founded in 2007 EPS Story ATI Story 2. Where are we going More established credit alerts More secure’ing alerts such as high risk transactions or fraud relation More underground economy More actionable alerts near real time 3. How do we get there Ingestion of large data sets Correlations of large data sets Near real time, High availability
  • #14 Follow on from Christian’s points. About 700 million when I took over. 700 over 4 years or so, tripled in less than 2 years New breaches, repacks of breaches, Ingest process was disrupting normal use Querying process fell apart High disk consumption due to duplicate data Clobbered with behind-the-scences processes, hidden mines from sales people
  • #15 Forum, pastes, analyst dump files Files include medical records, clinical trial pdfs, emails, xls, pdf Some stuff too hot to put in to production queryable Botnet logs, organized and unorganized, different formats Today: Over 2 billion rows of credentials Several indices on single rows and covering grouped indices on some columns Raid 5 nvme ssds (#yolo) 40 million + forum posts with fulltext via ES Application aware of where to read and write Offsite replication Monitored by remote dbas Improved workflow of analyst communication
  • #16 Long queries running from certain search boxes in the portal or api ( LIKE combined with a 4 way join) “The previous guy told me not to search for bigger domains…..” Ben Stillman came out as part of initial consulting engagement. Evaluated schema for credentials database. Full table scans are the devil. Duplicate data stored across 4 tables, all business uses of data almost always required doing costly joins. Determine minimum useful unit of data for the business. What constitutes the most useful result set? How to quickly and reliably retrieve it? How to keep it updated with new data without new data making old data useless Determine how closely related tables are, is there a 1to1 ratio of rows? Do they describe unique units of data? Find the line between what collection of attributes constitute a useful record, and the cost of updating those records if denormalized too hard. Is there anything you tell an end user not to do? Is there water cooler talk about something that is slow? Show processlist ; Solve the issue. “Don’t search for gmail.com” “don’t query for yahoo” Cause long queries due to joins using low cardinality indices, or indices that are too huge, causing mysql to just scan the entire tables for the results All problems can be solved, treat it like a Zelda dungeon or Metroid. Ask for help, research, MariaDB remote dba...
  • #17 Initial thought was to speed up loading of dashboard by having queries fire off multi threaded queries Random alerts about application being down despite nearly all things quiet. Monyog alerts about max connections, so had remote DBAs increase max connections hi Mitgated issue, but still happened Sometimes bugs make it to production, stay calm Symptoms were immediate 500 errors
  • #19 Story: Scraper went haywire, not storing properly the last post, causing a flood of data. Could see disk usage graph rise and fall. Amplified other export processes. Updated format of posts, had to update old ones with new data, initially did it in one #yolo query. Huge transaction -> huge log -> huger redo log -> huge…. Let remote DBA be your canary. We have a slack channel and i’ll get pinged if something is about to go off the rails. Programatically solve problems in your preferred language, don’t use the mysql command line to update large chunks of data or shell scripts that don’t go in to version control RemoteDBA will ask WTF if you are doing “yolo” update everything queries
  • #21 Consider all aspects of network Story: Replication kept stopping to other datacenter Remote dba flummoxed Getting IDS alerts, engineering and security lost as to what PHP injection was doing with the database replication server Correlated id of row that contained the code in the body of the pastebin paste Resolved with ssl connection
  • #23 Contact info?