Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SHA1 collision analysis and resolving a problem of recursive hashing with xrange and long numbers

171 views

Published on

A friend just told me that there was a job offer that required a problem to be solved and I got into that during the weekend, it basically was about xrange, hashlib, hexdigest and sha1 hash cyclic collisions.

Published in: Software
  • Login to see the comments

  • Be the first to like this

SHA1 collision analysis and resolving a problem of recursive hashing with xrange and long numbers

  1. 1. Recruitingmsa problem Reasoning and resolution of the problem
  2. 2. The problem • Given following Python function: • When executing the following call an Overflow exception is raised: • my_func("0123456789012345678901234567890123456789", 9999999999999999) def my_func(r, n): for i in xrange(n): r = hashlib.sha1(r[:9]).hexdigest() return r
  3. 3. Analysis of the problem • Due the exception message we know that this one is being raised when trying to evaluate a ‘long’ as an ‘int’ but it is quite long to be converted • From what we use to call this function there is only one parameter that is actually a long: • As xrange is the one using that parameter I look at its documentation on the Python library and I just found: The C implementation of Python restricts all arguments to native C longs (“short” Python integers), and also requires that the number of elements fit in a native C long. If a larger range is needed, an alternate version can be crafted
  4. 4. Function logic analysis • In order to find a path to its resolution I decided to start by analyzing its logic by just executing some quick tests. • The first thing I realized is that hashes seems to be “summable”, so I try to confirm this by some easy tests with variations, and if that’s the case may we could go for a recursive solution.
  5. 5. Logic tests • With below tests we confirm it is summable Some concrete examples of this function: my_func("0123456789012345678901234567890123456789", 0) = 0123456789012345678901234567890123456789 my_func("0123456789012345678901234567890123456789", 1) = 9a7149a5a7786bb368e06d08c5d77774eb43a49e my_func("0123456789012345678901234567890123456789", 2) = 747c9a467f90021e5d213e2f6d27ccf82e25d0c9 my_func("9a7149a5a7786bb368e06d08c5d77774eb43a49e", 1) = 747c9a467f90021e5d213e2f6d27ccf82e25d0c9 #Checking it is summable with even numbers my_func("0123456789012345678901234567890123456789", 2) = 747c9a467f90021e5d213e2f6d27ccf82e25d0c9 my_func("0123456789012345678901234567890123456789", 4) = 09c39ceafeec24479c8598ee622a399b2e753e2b my_func("747c9a467f90021e5d213e2f6d27ccf82e25d0c9", 2) = 09c39ceafeec24479c8598ee622a399b2e753e2b # good #Looks like summable, let’s try with odd numbers my_func("0123456789012345678901234567890123456789", 2) = 747c9a467f90021e5d213e2f6d27ccf82e25d0c9 my_func("0123456789012345678901234567890123456789", 5) = 14b51f5ee4250c7f238363b604dd5201ba2bbeb7 my_func("747c9a467f90021e5d213e2f6d27ccf82e25d0c9", 3) = 14b51f5ee4250c7f238363b604dd5201ba2bbeb7 # good #Bigger even numbers my_func("0123456789012345678901234567890123456789", 10) = 2dd637caf35019298ca1909e0ea644d5babadbff my_func("0123456789012345678901234567890123456789", 98) = 168c666606aa8feb0c91a420e68cbe32d841eb5b my_func("2dd637caf35019298ca1909e0ea644d5babadbff", 88) = 168c666606aa8feb0c91a420e68cbe32d841eb5b # good #Bigger odd numbers my_func("0123456789012345678901234567890123456789", 27) = eef3f00f84909ec5e3177fd86d5e6550ac782d7f my_func("0123456789012345678901234567890123456789", 319) = 92b1f1f3e8447874272de50f22ccd99b9baaebb9 my_func("eef3f00f84909ec5e3177fd86d5e6550ac782d7f", 319-27) = 92b1f1f3e8447874272de50f22ccd99b9baaebb9 # good
  6. 6. Logic tests – Results analysis • When executing this logic tests I just found that times to resolve this process linearly was super extensive and non acceptable, on the other hand a recursive solution with big numbers is not acceptable neither as we easily reach max recursion limits. Also a new idea comes to my mind, we are really just using first nine characters to create our hexdigests, and because of that, and because we use SHA1 without any custom salt it means that we have a big chance of find cyclic collisions. • So in order to start testing these new ideas we modify the code a bit so it saves hashes on a list and it will just check if that new hash already existed on that list, if we are lucky collisions will appear before we reach any limit.
  7. 7. Further investigation • While investigating about collisions and cycles I found some algorythims, ideas and documentation of why and how it happens: • https://en.wikipedia.org/wiki/Cycle_detection • http://pythoncentral.io/hashing-strings-with-python/
  8. 8. New code for logical tests import sys import hashlib def my_func(r, n): temp = [] # temporal list strikes = 0 # A counter for repetitions if isinstance( n, long ): # This was originally used to try out a recursive print "calling recursion" # solution, but the time it requires to solve r = my_func(r, n-sys.maxint) # it this way is super extensive and n = sys.maxint # it reaches out the max recursion limit for i in xrange(n): r = hashlib.sha1(r[:9]).hexdigest() if r in temp: # checking if it exists on the list print str(i) + " - " + r + " index " + str(temp.index(r)) # temp.index is used to understand the distance between collisions strikes += 1 # increasing strikes temp.append(r) # Added to the list just after printing the information. if strikes == 3: # 3 strikes and you are out break return r
  9. 9. Results of new tests • We just found collisions that appear in an acceptable time: Collisions appear in cycles of 109019 starting at 264088 (the hash that is calculated at iteration 264088 collides with the hash at iteration 155069). We just confirm this distance by doing some small “print” and “if”s modifications to the original code:
  10. 10. Problem resolution • As it is mentioned on Python library documentation xrange only works with int objects, and if we need something bigger tan that it should be crafted by us. First thing to evaluate here is what is the max number considered as an int in our platform, I can use “sys” library to find what would that be: • Just with that is not enough as time to solve this in an iterative or recursive way would be super extensive, so I start thinking about other alternatives and that’s how we found that collisions do happen in cycles, cycles of 109019 since iteration 155069 (First collision at iteration 155069+109019: 264088)
  11. 11. Problem resolution - 2 • Given the facts mentioned on previous slide I modify the original code so that any n bigger than 264088 (155069+109019) gets reduced to a number between 155069 and 264087 to find its matching collision. • This is how the code to resolve this issue looks like: • def my_func(r, n): if n >= 264088: # 155069+109019 n = int((n-((n/109019)*109019))+109019) for i in xrange(n): r = hashlib.sha1(r[:9]).hexdigest() return r
  12. 12. Execution result • The result of running my_func("0123456789012345678901234567890123456789", 9999999999999999) was: 'd82fd2b1b9c82df7c199ad716033aeb33785d2a0’
  13. 13. Performance enhancements • As any n bigger that 264088 gets reduced to a number between 155069 and 264087 times are reduced drastically. Shown with examples bellow. • Execution before modification: • Execution after modification:

×