1. How did we get here About InfoArmor Founded in 2007 EPS Story ATI Story 2. Where are we going More established credit alerts More secure’ing alerts such as high risk transactions or fraud relation More underground economy More actionable alerts near real time 3. How do we get there Ingestion of large data sets Correlations of large data sets Near real time, High availability
Follow on from Christian’s points. About 700 million when I took over. 700 over 4 years or so, tripled in less than 2 years New breaches, repacks of breaches,
Ingest process was disrupting normal use Querying process fell apart High disk consumption due to duplicate data
Clobbered with behind-the-scences processes, hidden mines from sales people
Forum, pastes, analyst dump files Files include medical records, clinical trial pdfs, emails, xls, pdf Some stuff too hot to put in to production queryable
Botnet logs, organized and unorganized, different formats
Today: Over 2 billion rows of credentials Several indices on single rows and covering grouped indices on some columns Raid 5 nvme ssds (#yolo) 40 million + forum posts with fulltext via ES Application aware of where to read and write Offsite replication Monitored by remote dbas Improved workflow of analyst communication
Long queries running from certain search boxes in the portal or api ( LIKE combined with a 4 way join)“The previous guy told me not to search for bigger domains…..”Ben Stillman came out as part of initial consulting engagement. Evaluated schema for credentials database. Full table scans are the devil.Duplicate data stored across 4 tables, all business uses of data almost always required doing costly joins. Determine minimum useful unit of data for the business. What constitutes the most useful result set? How to quickly and reliably retrieve it? How to keep it updated with new data without new data making old data useless Determine how closely related tables are, is there a 1to1 ratio of rows? Do they describe unique units of data? Find the line between what collection of attributes constitute a useful record, and the cost of updating those records if denormalized too hard.
Is there anything you tell an end user not to do? Is there water cooler talk about something that is slow? Show processlist ; Solve the issue. “Don’t search for gmail.com” “don’t query for yahoo” Cause long queries due to joins using low cardinality indices, or indices that are too huge, causing mysql to just scan the entire tables for the results All problems can be solved, treat it like a Zelda dungeon or Metroid. Ask for help, research, MariaDB remote dba...
Initial thought was to speed up loading of dashboard by having queries fire off multi threaded queries
Random alerts about application being down despite nearly all things quiet. Monyog alerts about max connections, so had remote DBAs increase max connections hi Mitgated issue, but still happened Sometimes bugs make it to production, stay calm
Symptoms were immediate 500 errors
Story:Scraper went haywire, not storing properly the last post, causing a flood of data. Could see disk usage graph rise and fall. Amplified other export processes. Updated format of posts, had to update old ones with new data, initially did it in one #yolo query. Huge transaction -> huge log -> huger redo log -> huge….Let remote DBA be your canary. We have a slack channel and i’ll get pinged if something is about to go off the rails.
Programatically solve problems in your preferred language, don’t use the mysql command line to update large chunks of data or shell scripts that don’t go in to version controlRemoteDBA will ask WTF if you are doing “yolo” update everything queries
Consider all aspects of network Story: Replication kept stopping to other datacenter Remote dba flummoxed Getting IDS alerts, engineering and security lost as to what PHP injection was doing with the database replication server Correlated id of row that contained the code in the body of the pastebin paste Resolved with ssl connection
M|18 How InfoArmor Harvests Data from the Underground Economy
InfoArmor, Threat Intelligence &
Christian Lees & Steve Olson
What we will be covering today.
HOW DID WE GET HERE?
A brief history of InfoArmor, and the
greatness that got us to where we are
WHERE ARE WE GOING?
A look at the vision and where we see
InfoArmor going in the future.
HOW DO WE GET THERE?
What will it take for us to achieve our
vision, and what is our process to get
1 2 3
“The world’s most valuable resource is no longer oil, but data”
- The Economist
The unseen threats.
Dark web monitoring through InfoArmor Advanced Threat Intelligence.
scraping with bots while
humans operatives gain
access to closed forums.
Combat hackers that are
using technology and
Structuring raw data
Compromised data files
must be formatted,
organized and canonized
to be fully leveraged.
Threat actor profiling
Tracking threat actors
moves as we built out
profiles, information and
patterns to thwart risks.
60% of companies can not detect compromised credentials survey says
This product will get you 100.000 United Kingdom "HOTMAIL" Emails Leads
Lessons from 1 billion
What I learned that allowed me to sleep
Bird’s eye view of data
- Relational dbs for web application and storage of known
- Elasticsearch for unstructured and fulltext searching
- Replication off-site
- MariaDB remote DBAs monitor all InfoArmor
Over 2 billion credentials
45 million forum posts
300 GB and growing of botnet logs
Pretty much all code is in Python.
Don’t Do That!
- Feature worked for some inputs, but not others
- Schema was suboptimal, leading to full table scans
- 4 way join, hundreds of thousands of seconds
- Had to kill ‘em
- With MariaDB assistance, planned out new schema for
- More intuitive
- Meets business needs in API and GUI
- Listen to end users!
Non tech lesson: Cultivate relationships outside of tech!
- Parallelized queries to multiple databases
- In Pyramid, achieved with separate DB Sessions
- Sessions weren’t closed, leaving connections open
- Fell outside of normal Zope/SQLAlchemy flow
- Monyog alerts about max’d connections, restarted application to
- Found issue in code, added .close()
Lesson: Configuration changes solve and don’t solve problems at the
Don’t Bring All Groceries in at Once
- Sometimes a ton of rows need to be updated
- Even if something doesn’t get committed….
...Log entries and rollbacks get created
- Gums up replication
- Wastes time
- MAX ALLOWED PACKET
Lesson: Data should be updated in small bites
Same for import parsing scripts
Where multithreading amplifies binlog size
- Don’t get greedy, nothing is worth screwing up replication or your
Non tech lesson: Add 20 to 200 percent to time estimates for imports.
Process and organization will set you free
IDS - Intrusion Detection System
Or rather “Inline Data Shredder”
- Will normally get bounced on the way in from the scraper
- Replication kept mysteriously stopping
- Engineering team getting “WTF?” alerts from all angles
Found the chunk of code in the database. Replication now over SSL.
Lesson: Coincidence...or degree of separation?
- Data is business, business is data.
- Let remote dbas do nuts and bolts
- Focus on your application and goal of the data
- Make data available to sales people, but toolify it
- Keep evolving