Leveraging the public internetTonimir Kisasondi, mag.inf., EUCIP
$whois tkisason• Junior resarcher @ www.foi.hr• Head of Open Systems and Security lab• Likes to build and break things• firstname.lastname@example.org• skype:tkisason
What happens when you digitizethe whole world?• Google, Facebook, Twitter• Is it a bubble or a valid business model?• The new buzzword is big data• Storage per capita doubles every three years• Kryders law says that storage density doubles every 18 months• Can you really store the whole world?
What happens when you digitizethe whole world?• Storing 20 Tbps traffic• Map/Reduce like infrastructure to mine and combine data• Why is this interesting to us now? o Storage is cheap o Big data is useful everywhere o Use tricks that intel agencies use to enable cool stuff o It’s not rocket science... o Yes, the most interesting applications are in cross disciplinary fields
First: OSINT• OSINT: Open Source Intelligence o Finding, selecting and acquiring information over open, publicly available sources like newspapers, internet, books, internet, social networks (twitter)... o Various registries (firm, open postings, public listing) o Metadata o Mine those, and you might find a lot of interesting stuffo White zone – Legal and ethicalo Black zone – Illegal and Unethicalo Gray zone – Legal but unethical
First: OSINT• Not everything is OSINT, but you can actually glean interesting data from almost anything• It worked for the guys that wrote Splunk, so they decided to write Splunk.• It works for data mining folks.
Data analysis 101• Data is just data, you have to correlate it or put it in context for it to be useful o Find outliers o Spot differences o Find common attributes o Find connections, not answers o First identify, then try to interpret o Put data into perspective, seek help J o "Data driven design”• A nice showcase of data driven design: o A/B Testing
Do i need advanced statistics?• Most of the time: No• Are statistics awesome? Yup• Well, don’t play with things where you can get hurt. J• Seek professional help• Grep, Google refine/Mojo facets, and your favorite scripting languages are just fine...
How can we approach the problem• There are many (finished) tools, if they help, great• Roll your own script • Duct tape some finished libraries • Most of the times it takes less time then finding a tool. • Cheating and stealing is encouraged. ;)
Bad design 101• If you hack it together, watch out for some gotchas• Line per line analysis o Minimal complexity O(n)• You can easily kill the speed of your script/ parser/*• Best separator is t• .split() is godsent
ignorecase?#!/usr/bin/pythonimport rea = open("access.log")b = open("test.log","w")for line in a: if re.search("DENIED",line,re.IGNORECASE): b.write(line)b.close()$ time ./re-search.pyreal 0m4.516suser 0m4.444ssys 0m0.056s
simple RE#!/usr/bin/pythonimport rea = open("access.log")b = open("test.log","w")for line in a: if re.search("DENIED",line): b.write(line)b.close()$ time time ./re-search.pyreal 0m2.520suser 0m2.456ssys 0m0.056s
find#!/usr/bin/pythona = open("access.log")b = open("test.log","w")for line in a: c = line.find("DENIED") if c >= 0 : b.write(line)b.close()$ time ./testparse.pyreal 0m0.781suser 0m0.728ssys 0m0.044s
To sum it up...Python RE ignorecase : 4.516sPython RE : 2.520sPython find : 0.781sgrep : 0.074s
Primer on useful and interestingtools• ipython o http://ipython.org/• python-nltk o http://nltk.org/ (nltk.clean_html(messy_html))• python-requests o www.python-requests.org• python-graphviz o http://code.google.com/p/pydot/• python-google by Mario Vilas o https://github.com/MarioVilas
So, how about a short showcase ofsome things i did• Yeah, they are lame, and simple• Works for me• Available on github• Hope they can motivate you to do some fun and simple “one afternoon” stuff• Most of the “hard” stuff is easy once you try to hack it together
mkwordlist -https://github.com/tkisason/gcrack• Idea: Create wordlists with google results for a set of keywords• For a keyword return top 5 links (or N)• Scrape and clean with NLTK• Optional lowercasing for future mutations o You can use JtR/HashCat with a ruleset to mutate the lists• Result: Nice targeted wordlist generator
mkwordlist -https://github.com/tkisason/gcrack• Some other cool things o Keywords can be google dorks § site:.bg § filetype:txt § “”• Interesting results for targeted attacks• Broad keywords are also ok o If you are pentesting a company or similar
gcrack -https://github.com/tkisason/gcrack• Idea: Most of the weak password hashes are cracked and leaked on the public internet• Google indexes the pages, and the content of this pages contains the plaintext• Use google searches for password cracking• Create bag of words as a wordlist• Result: Very effective and fast hash cracker• Bonus: hash agnostic
logtoolhttps://github.com/tkisason/logtool• log files are interesting..ish• Especially if you have a compromised machine and the attackers were noobish enough to leave the log files• What can you learn: o IP addresses (known proxyes and tor exit points) o Usernames (are they generic or are they specific) o IP-GeoIP data o Toolmarks (user agents, wordlists for attacks)
linkcrawl and nltkhttps://github.com/tkisason/linkcrawl• Building a simple crawler is easy (or use wget and cURL, man up and write some shell scripts)• NLTK is awesome! o import nltk, nltk.clean_html(data)• http://orange.biolab.si is also a nice platform
conclusion• Well, have just have fun• Problems are all around you, try to solve some J