The document discusses leveraging publicly available internet data and open source intelligence (OSINT) for data analysis purposes. It describes how digitizing information creates opportunities to mine "big data" using techniques from fields like intelligence agencies. Specific tools and techniques are presented for OSINT, including using search engines to build wordlists and crack passwords. Examples of simple Python scripts and libraries for data analysis, visualization and crawling websites are also provided. The document encourages experimenting with publicly available data to gain insights and solve problems.
2. $whois tkisason
• Junior resarcher @ www.foi.hr
• Head of Open Systems and Security lab
• Likes to build and break things
• tonimir.kisasondi@foi.hr
• skype:tkisason
3. What happens when you digitize
the whole world?
• Google, Facebook, Twitter
• Is it a bubble or a valid business model?
• The new buzzword is big data
• Storage per capita doubles every three
years
• Kryder's law says that storage density
doubles every 18 months
• Can you really store the whole world?
5. What happens when you digitize
the whole world?
• Storing 20 Tbps traffic
• Map/Reduce like infrastructure to mine and
combine data
• Why is this interesting to us now?
o Storage is cheap
o Big data is useful everywhere
o Use tricks that intel agencies use to enable cool stuff
o It’s not rocket science...
o Yes, the most interesting applications are in cross
disciplinary fields
6. First: OSINT
• OSINT: Open Source Intelligence
o Finding, selecting and acquiring information over
open, publicly available sources like newspapers,
internet, books, internet, social networks (twitter)...
o Various registries (firm, open postings, public listing)
o Metadata
o Mine those, and you might find a lot of interesting
stuff
o White zone – Legal and ethical
o Black zone – Illegal and Unethical
o Gray zone – Legal but unethical
7. First: OSINT
• Not everything is OSINT, but you can
actually glean interesting data from almost
anything
• It worked for the guys that wrote Splunk, so
they decided to write Splunk.
• It works for data mining folks.
8. Data analysis 101
• Data is just data, you have to correlate it or
put it in context for it to be useful
o Find outliers
o Spot differences
o Find common attributes
o Find connections, not answers
o First identify, then try to interpret
o Put data into perspective, seek help J
o "Data driven design”
• A nice showcase of data driven design:
o A/B Testing
9. Do i need advanced statistics?
• Most of the time: No
• Are statistics awesome? Yup
• Well, don’t play with things where you can
get hurt. J
• Seek professional help
• Grep, Google refine/Mojo facets, and your
favorite scripting languages are just fine...
10. How can we approach the problem
• There are many (finished) tools, if they help,
great
• Roll your own script
• Duct tape some finished libraries
• Most of the times it takes less time then finding a
tool.
• Cheating and stealing is encouraged. ;)
12. Bad design 101
• If you hack it together, watch out for some
gotchas
• Line per line analysis
o Minimal complexity O(n)
• You can easily kill the speed of your script/
parser/*
• Best separator is t
• .split() is godsent
13. ignorecase?
#!/usr/bin/python
import re
a = open("access.log")
b = open("test.log","w")
for line in a:
if re.search("DENIED",line,re.IGNORECASE):
b.write(line)
b.close()
$ time ./re-search.py
real 0m4.516s
user 0m4.444s
sys 0m0.056s
14. simple RE
#!/usr/bin/python
import re
a = open("access.log")
b = open("test.log","w")
for line in a:
if re.search("DENIED",line):
b.write(line)
b.close()
$ time time ./re-search.py
real 0m2.520s
user 0m2.456s
sys 0m0.056s
15. find
#!/usr/bin/python
a = open("access.log")
b = open("test.log","w")
for line in a:
c = line.find("DENIED")
if c >= 0 :
b.write(line)
b.close()
$ time ./testparse.py
real 0m0.781s
user 0m0.728s
sys 0m0.044s
16. grep
$ time grep DENIED access.log > test
real 0m0.074s
user 0m0.040s
sys 0m0.032s
17. To sum it up...
Python RE ignorecase : 4.516s
Python RE : 2.520s
Python find : 0.781s
grep : 0.074s
18. Primer on useful and interesting
tools
• ipython
o http://ipython.org/
• python-nltk
o http://nltk.org/ (nltk.clean_html(messy_html))
• python-requests
o www.python-requests.org
• python-graphviz
o http://code.google.com/p/pydot/
• python-google by Mario Vilas
o https://github.com/MarioVilas
21. So, how about a short showcase of
some things i did
• Yeah, they are lame, and simple
• Works for me
• Available on github
• Hope they can motivate you to do some fun
and simple “one afternoon” stuff
• Most of the “hard” stuff is easy once you try
to hack it together
22. mkwordlist -
https://github.com/tkisason/gcrack
• Idea: Create wordlists with google results for
a set of keywords
• For a keyword return top 5 links (or N)
• Scrape and clean with NLTK
• Optional lowercasing for future mutations
o You can use JtR/HashCat with a ruleset to mutate
the lists
• Result: Nice targeted wordlist generator
23. mkwordlist -
https://github.com/tkisason/gcrack
• Some other cool things
o Keywords can be google dorks
§ site:.bg
§ filetype:txt
§ “”
• Interesting results for targeted attacks
• Broad keywords are also ok
o If you are pentesting a company or similar
24.
25.
26. gcrack -
https://github.com/tkisason/gcrack
• Idea: Most of the weak password hashes are
cracked and leaked on the public internet
• Google indexes the pages, and the content
of this pages contains the plaintext
• Use google searches for password cracking
• Create bag of words as a wordlist
• Result: Very effective and fast hash cracker
• Bonus: hash agnostic
27. logtool
https://github.com/tkisason/logtool
• log files are interesting..ish
• Especially if you have a compromised
machine and the attackers were noobish
enough to leave the log files
• What can you learn:
o IP addresses (known proxyes and tor exit points)
o Usernames (are they generic or are they specific)
o IP-GeoIP data
o Toolmarks (user agents, wordlists for attacks)
28. linkcrawl and nltk
https://github.com/tkisason/linkcrawl
• Building a simple crawler is easy (or use
wget and cURL, man up and write some
shell scripts)
• NLTK is awesome!
o import nltk, nltk.clean_html(data)
• http://orange.biolab.si is also a nice platform