Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Background Noise of the Internet

4,584 views

Published on

The last five to ten years has seen massive advancements in open source Internet-wide mass-scan tooling, on-demand cloud computing, and high speed Internet connectivity. This has lead to a massive influx of different groups mass-scanning all four billion IP address in the IPv4 space on a constant basis. Information security researchers, cyber security companies, search engines, and criminals scan the Internet for various different benign and nefarious reasons (such as the WannaCry ransomware and multiple MongoDB, ElasticSearch, and Memcached ransomware variants). It is increasingly difficult to differentiate between scan/attack traffic targeting your organization specifically and opportunistic mass-scan background radiation packets.

Grey Noise is a system that records and analyzes all the collective omnidirectional background noise of the Internet, performs enrichments and analytics, and makes the data available to researchers for free. Traffic is collected by a large network of geographically and logically diverse “listener” servers distributed around different data centers belonging to different cloud providers and ISPs around the world.

In this talk I will candidly discuss motivations for developing the system, a technical deep dive on the architecture, data pipeline, and analytics, observations and analysis of the traffic collected by the system, business impacts for network operators, pitfalls and lessons learned, and the vision for the system moving forward.

Published in: Internet
  • Be the first to comment

The Background Noise of the Internet

  1. 1. The Background Noise of the Internet Andrew Morris @Andrew___Morris
  2. 2. • Thank you • Founders • Committee • Staff • Attendees
  3. 3. About Me Andrew Morris Background in offensive cyber stuff, security research Previously: * Endgame R&D * Intrepidus (NCC Group) * KCG (ManTech) Twitter: @Andrew___Morris
  4. 4. Lots of people scan the Internet. I built a system that collects all of the Internet-wide scan traffic. I analyze the data to find weird stuff. I make that data available to researchers for free via an API
  5. 5. Structure • Background • Previous Work • Architecture • Analysis • Roadmap • Conclusion • Questions
  6. 6. Background • Internet-wide mass scanning is easier than ever • Open source tooling: Masscan, ZMap, UnicornScan, etc • Cloud computing • Instant servers • large amount of recyclable IP addresses • High throughput / faster global Internet connections
  7. 7. What is Internet Mass Scanning? • “Mass Scanning” is scanning every single routable IP address on the Internet for something • The IPv4 address space is 0.0.0.0 – 255.255.255.255 • Give or take a few blocks • That’s 4.2 billion IP addresses • Bandwidth-wise, roughly same as uploading a 240 GB file
  8. 8. What does this mean? • Lots and lots of people scanning the Internet, for lots of different things • From millions of different IP addresses • Benign: Shodan, Censys, Sonar, ShadowServer • Malicious: SSH/Telnet worms (Mirai), IOT worms, CONFICKER, etc • Internet-wide scanning is busier than ever
  9. 9. This creates a problem When you see an IP scanning your network, are they scanning you specifically or the entire Internet? When you see an IP attacking your network, are they attacking you specifically or the entire Internet?
  10. 10. Solution • Collect all the omnidirectional Internet-wide IPv4 scan/attack traffic • Subtract those IPs/activity from your SIEM • All the remaining activity is targeting you
  11. 11. But how? • Stand up a large amount of servers in diverse data centers with no business value • No business value means that ANY traffic that hits it is, by definition, opportunistic • Instrument these servers with extremely aggressive logging and small microservices • Stream the logs of the scan/attack traffic to a central place • Analyze the data and convert into a consumable format
  12. 12. Barriers • It is strategically cheaper to ask a question of the Internet than it is to answer a given question • How many computers are running X version of software is easy • How many computers are scanning for X version of software is hard
  13. 13. Byproducts • Observe changes in Internet-scanning over time • Opt-out of omnidirectional scanning altogether • Collect information on malware campaigns and botnets
  14. 14. History • Like three honeypots (2014) • Animus v1 (2015) • Bash and glue (SHMOOCON 2015 “No Budget Threat Intelligence”) • Related work at a previous company (2015-2016) • EPIPHANY (2016) • THE DATA THAT HONEYPOTS COLLECT IS SHITTY THREAT INTELLIGENCE • IT’S LITERALLY THE OPPOSITE OF THREAT INTELLIGENCE • IT’S ANTI THREAT INTELLIGENCE • Animus GOES COMMERCIAL (2017) • Turns out startups are hard • Grey Noise (2018) • I’m not going to stop until I die • ??? • Become a monk
  15. 15. Which leads me to…
  16. 16. GreyNoise • Read about it here: https://greynoise.io • API docs here: https://github.com/grey-noise- intelligence/api.greynoise.io • Visualizer here: http://viz.greynoise.io
  17. 17. ARCHITECTURE
  18. 18. Architecture • Collection • Orchestration • Data Producers / Services • Log Forwarder • Message Bus • Streamd • Analysis • Cache / Database • Enrichments • Analyticsd • Consumption • API • Front End • Operational Security
  19. 19. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  20. 20. Step one: Stand up lots of servers in different regions of different cloud providers
  21. 21. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  22. 22. Collection: Orchestration • Terraform • Open source tool on GitHub by Hashicorp • Supports lots of different cloud providers • AWS • DigitalOcean • Azure • Google Cloud • Etc
  23. 23. Collection: Orchestration (Lessons) • LESSON: Cloud-init • LESSON: NAT or nah • LESSON: Interface names • Eth0 • Ath1 • whatever
  24. 24. Collection: Data producers / Services • Ridiculously aggressive iptables rules • Log all packets • …on all ports • …on all protocols • SSH • Telnet • HTTP • Others
  25. 25. Collection: Data producers / Services (Lessons) • MISTAKE: Tune your iptables / p0f / sniffers / whatever to ignore garbage / outbound traffic • LESSON: Things will be spoofed (TCP, UDP, and ICMP) • LESSON: Bang for your buck: Iptables, HTTP, Telnet, SSH, and P0f
  26. 26. Step two: Stream the data to a central place
  27. 27. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  28. 28. Collection: Message Bus •RabbitMQ •Message Queue •Topic routing
  29. 29. Collection: Message Bus (Lessons) • MISTAKE: Google PubSub • LESSON: Maintain state • LESSON: Meta message envelop • Time • Provider • Region • Node UUID • POSSIBLE: ZeroMQ, Kafka •Streamd
  30. 30. Collection: Log Forwarder •I wrote my own •Python + Pygtail / iNotify / Watchdog •Can also use something that’s already been written •Logstash •Elasticsearch Filebeat •Rsyslog
  31. 31. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  32. 32. Step three: Put the data in a database
  33. 33. Analysis: Cache / Database • PostgreSQL • N days of data, rotates • Fast-ish • Robust Dumpster Long term storage You’re going to fuck something up Retro load is your friend
  34. 34. Analysis: Cache / Database • MISTAKE: Postgres is awesome but too slow for data this big • MISTAKE: Google BigQuery is the shit but it gets expensive if you're doing batch queries on a very short timeline • LESSON: Postgres + Cassandra is the truth
  35. 35. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  36. 36. Step four: Enrich the IP data
  37. 37. Analysis: Enrichments • We need: • ASN • rDNS • Organization • Country • City • Maxmind is expensive • Neustar is expensive • Ipinfo is CHEAP • Harvesting it yourself is also CHEAP but requires a lot of effort
  38. 38. Analysis: Enrichments (Lessons) • MISTAKE: Collecting the data yourself is hard and inconsistent and involves a lot of work • LESSON: ARIN has an unauthenticated non rate-limited public API for IP ownership • LESSON: Enrichd • LESSON: Cache rules everything around me
  39. 39. Analysis: Enrichments
  40. 40. Step five: Analyze and categorize/tag the data
  41. 41. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  42. 42. Analysis: Analyticsd • Service to analyze some time window of data • E.g. past 4 days of data • Catalogue: • Actors • Shodan • Censys • Sonar • Activity • Scanning for SSH • Scanning for Telnet • LESSON: YOU PROBABLY DON’T NEED REAL TIME ANALYTICS • Batch analytics with small time frames • This is why Postgres will often do the trick • LESSON: Only pay attention to activity that has happened on more than one of your nodes • LESSON: You need to know how many nodes are up collecting data at any point in time to properly do a time-series analysis
  43. 43. AWS DigitalOcean Azure RabbitMQ Long Term Storage Database CacheAnalytics Server Analytics Database Web API
  44. 44. Step six: Make the data available
  45. 45. Consumption: API • Web API • Tell me about this IP address • Tell me about this analytic • Github • Search “Grey Noise API” • Github.com/Grey-Noise- Intelligence
  46. 46. Consumption: Bindings • Bobby Filar: phyler/greynoise • Tek: PyGreyNoise • Bob Rudis: R bindings • Some mystery Go bindings out there
  47. 47. Consumption: FRONT END • Complete 100% credit to Casey Buto (github.com/cbuto) • Point and click interface • Hosted version at viz.greynoise.io • EXPLORE THE DATA
  48. 48. http://viz.greynoise.i o
  49. 49. Consumption: FRONT END • Complete 100% credit to Casey Buto (github.com/cbuto) • Point and click interface • Hosted version at viz.greynoise.io • EXPLORE THE DATA
  50. 50. OpSec (Operational Security) • Hard to fingerprint (mostly custom services) • Encrypt everything • No names • Ops domains • Dockerize • Shift infrastructure constantly • Reduce the oracle surface • IO is hard to opsec • Minimum number / node thresholds • Sleep delays
  51. 51. Cost • AWS: 15 regions • $4.75 per box • Total: $71 • Digitalocean: 11 regions • $5 per box • Total: $55 • Google: 36 regions • $4.28 per box • Total: $154 • Total: $400 per month Vultr: 15 regions $5 per box (they advertise $2.50 but they're never available) Total: $75 Linode: 9 regions $5 per box Total: $45
  52. 52. Cost (notes) • Notes: • No Ops boxes in here (you need these) • This is simply not enough to have complete coverage but it'll give you a good start • You can save money by buying extra IPs, but it complicates engineering
  53. 53. ANALYSIS
  54. 54. Analysis • What am I collecting? • Volume Summary • Data Summary • Actor Summary • Benign • Malicious • Unknown??? • Malware Summary • Hall of Shame (Malware-iest regions of the Internet) • WEIRD SHIT • Misc Lessons
  55. 55. What am I collecting? • Passive • Iptables – Packets on ports • P0f – passive OS fingerprint • JA3 – SSL fingerprint (stick around!) • Active • HTTP • SSH • Telnet • Experimental • RDP • SIP • SMTP • NTP • TFTP • DNS
  56. 56. Data Summary • Iptables: • I don’t have a good way to quantify this yet • HTTP: • Lots of ”/”, spoofed user agents, search engines, people looking for Jboss/Wordpress/Tomcat/PHPMyAdmin • SSH + Telnet • Bots. Defaults cred attempts. Nothing new here. • P0f • Lots of OS visibility
  57. 57. Volume Summary • With the aforementioned numbers ($400 worth of servers): • 1M – 2M iptables events per day • 700k – 1M SSH logins per day • 1M – 10M telnet logins per day • 10K – 100K HTTP requests per day • 100-200 messages per second through your queue • ~60K IPs per day • 1GB of raw data, msgpacked + compressed per day
  58. 58. Actor Summary • Benign: • Shodan: 27 IPs • Censys: 334 IPs • Sonar: 56 IPs • ShadowServer: 228 IPs • IPIP: 63 IPs • BinaryEdge: 253 IPs • PDRLabs: 25 IPs • Pingdom: 9 IPs • ProbeTheNet: 1 IP • NetCraft: 145 IPs • Others • Malicious • Mirai: 249k IPs • SSH Worms: 92k IPs • Popped Routers / residential IPs attacking people: 590k IPs
  59. 59. Hall of Shame
  60. 60. Hall of Shame (cloud)
  61. 61. WEIRD SHIT
  62. 62. Pretenders • Machines advertising client banners that are false • Mismatches between user agent, p0f OS fingerprint, and JA3 • Is the browser hitting this HTTP server really running Safari on a Linux kernel 3.1 box? Is it? • Why? Idk
  63. 63. Dangling DNS • When you spin a bunch of IPs up and down, it’s not uncommon to inherit an IP address from your cloud provider that still has a domain pointing to it. • CDN.whatever123.acme.com • This traffic is dirty, you don’t want it
  64. 64. “WORM FINDER” • Sometimes when Grey Noise observes an IP address scanning for a given TCP port, I’ll turn around and check to see if that port is open on the source machine. • If the answer is yes, this can be a great indicator of a worm • Why else would a computer search for behavior that it also exhibits? • Average lifespan from start to finish is 4 days
  65. 65. Zmap’s hardcoded ID parameter • Zmap hardcodes all packets it creates with an ID parameter of “54321”, making it trivial to fingerprint • Go to “github.com/zmap/zmap” and search / grep the repository for “54321” • Shoutout Oliver Gasser @ Technical University of Munich
  66. 66. Still SO MANY WINDOWS WORMS • LOADS of people blasting SMB traffic on TCP port 445 • More and more RDP worms as well, but these aren’t exploiting vulns, just guessing creds • WinRM is next, in my opinion
  67. 67. People do weird stuff through proxies • Airline price scraping data (???) • Also testing stolen credentials • And probably credit card numbers • News sites??? This is a huge rabbit hole…
  68. 68. Lots of robo calls probably come from popped SIP boxes • People try to make calls to India and Russia through open VOIP servers • Like, LOTS of them • Tens of thousands per day
  69. 69. The things people scan for through Tor is interesting
  70. 70. You can neuter/blow up worms by replaying their own traffic back to them • A box is compromised with a Telnet worm • The worm carries a built in wordlist • The compromised box throws the same wordlist at you • You replay the wordlist back to the compromised box • Chances are, depending on the worm, one of those credentials will work
  71. 71. ROADMAP
  72. 72. What does the future hold? • Version 1.1 API coming very soon • Integrate with everything • Badass machine learning opportunities • Explore identifying anti-threat intelligence in other areas • Intranet traffic • DMZ traffic • Files on a filesystem
  73. 73. CONCLUSION
  74. 74. Conclusion • The Internet is a noisy place • Every packet has a story • It’s possible to collect all of this background noise • If you want to explore the data, hit the API. If the API doesn’t give you what you need, email me or hit me up on Twitter
  75. 75. Acknowledgements • Phil Maddox (twitter.com/foospidy) • Bobby Filar (twitter.com/filar) • Rich Seymour (twitter.com/rseymour) • Casey Buto (github.com/cbuto) • Bob Rudis (twitter.com/hrbrmstr) • Tek (twitter.com/tenacioustek) • Mickey Perre (twitter.com/MickeyPerre) • Michel Oosterhof (twitter.com/micheloosterhof)
  76. 76. QUESTIONS?
  77. 77. THANK YOU! andrew@morris.sc @andrew___morris

×