Author Topic: Perfectly uniform distribution  (Read 8878 times)

Offline pgimeno

  • Jr. Member
  • **
  • Posts: 11
Perfectly uniform distribution
« on: November 28, 2005, 02:12:31 PM »
Hi,

After examining the current algorithm and there are some security problems related to the distribution of the generated passwords.

Assuming the hash function is a perfect pseudorandom number generator which generates numbers uniformly distributed in the range 0 through 2^HS - 1 (where HS is the hash size in bits), with the current code the resulting password does not, unfortunately, generate digits uniformly distributed in the range 0 through C - 1 (where C is the number of characters in the character set used in passwords).

A compromise solution is to use the zero padding used in version 0.6, but the current CVS version has a strong bias in the first digit. In order to solve it, that first digit must be rejected. The solution is to change in rstr2any(), the lines that read:

Code: [Select]
    full_length = (int)ceil((double)length * 8
  / (log((double)encoding.length()) / log((double)2)));

to the following:

Code: [Select]
    if ((encoding.length() & (encoding.length() - 1)) == 0)
   {
     int log2length = 0;
     int len = encoding.length();
     while (len != 1)
       {
  log2length++;
  len >>= 1;
       }
     full_length = length * 8 / log2length;
   }
    else
   full_length = (int)floor(8.0L * length
  / (log((double)encoding.length()) / log(2.0L)));

(not tested; there may be errors)

The first check tests whether encoding.length() is a power of 2; if so, it uses an integer algorithm to find the log2(encoding.length()), otherwise it uses floating-point math.

This change means using floor() instead of ceil() to reject the first digit; the problem of rounding errors can only arise when encoding.length() is a power of two, which is calculated separately.

Perfectly uniform distribution would require more sophisticated techniques.

-- Pedro Gimeno

Offline Miquel 'Fire' Burns

  • Administrator
  • *****
  • Posts: 1157
  • Programmer
Perfectly uniform distribution
« Reply #1 on: November 29, 2005, 03:46:55 AM »
Important issue to solve before we reintroduce using a padding zero method in rstr2any().
"I'm not drunk, just sleep deprived."

Offline Eric H. Jung

  • grimholtz
  • Administrator
  • *****
  • Posts: 3353
Perfectly uniform distribution
« Reply #2 on: November 29, 2005, 04:32:03 AM »
Pedro,

Thanks for pointing this out. As we've discussed in IRC, if you could write rstr2any() in C/C++ or Javascript (both would be ideal, but me or miquel could probably do a port if you don't have time for both), I'll introduce the change into passwordmaker. There will have to be a period of backwards compatibility, of course.

Please also discuss your ideas about the importance of non-typable delimiters (e.g., \0 or \n) between fields.

Thank you,
Eric

Offline Miquel 'Fire' Burns

  • Administrator
  • *****
  • Posts: 1157
  • Programmer
Perfectly uniform distribution
« Reply #3 on: November 29, 2005, 04:52:53 PM »
I think he gave basic C++ code. I say we figure out if there are infact any errors in the code and include it. But I want more of an understanding on why we need some of the is code in the first place.
"I'm not drunk, just sleep deprived."

Offline Eric H. Jung

  • grimholtz
  • Administrator
  • *****
  • Posts: 3353
Perfectly uniform distribution
« Reply #4 on: November 29, 2005, 05:02:07 PM »
I won't speak for Pedro, but as I recall he explained in an email that the current rstr2any() does not utilize all characters in the character set evenly (uniformly). In other words, some characters are more heavily weighted than others. The 10th character in the character set might be used 5 times more often than the 9th character in the character set (I'm making up those figures, by the way).

In any case, let's wait to see what Pedro writes...this is why I prefer to discuss these topics in the forums rather than one-on-one emails.

Offline pgimeno

  • Jr. Member
  • **
  • Posts: 11
Perfectly uniform distribution
« Reply #5 on: November 30, 2005, 01:46:06 PM »
Very briefly (I hope to be able to explain myself better later):

For example, with SHA-1 hash and character set = 0123456789 (i.e. decimal): the first digit will always be either 0 or 1 and never 2, 3, 4, ... 9 unless that correction is made.

With that correction, the first digit will be 0 through 3 twice more often than it will be 5 through 9 (13.6% of the time vs. 6.8% for each digit, respectively), and will be 4 about 11.6% of the time.

This is not so good as it could be but is better than these digits not appearing at all.

-- Pedro Gimeno

Offline Eric H. Jung

  • grimholtz
  • Administrator
  • *****
  • Posts: 3353
Perfectly uniform distribution
« Reply #6 on: November 30, 2005, 06:42:01 PM »
Thanks for the clarification, Pedro. When you have a moment, could you also explain why the use of non-typable delimiters between fields is important when generating the password? I should have your old emailsso if you'd rather I post snippets of them here, let me know.

Regards,
Eric

Offline pgimeno

  • Jr. Member
  • **
  • Posts: 11
Perfectly uniform distribution
« Reply #7 on: December 16, 2005, 09:18:44 PM »
I'll attempt to outline the problem more precisely.

Each hash generates a certain number of pseudorandom bits HS. The hash SHA-1, for example, generates HS=160 pseudorandom bits which can be interpreted as an integer ranging from 0 to 2^160 - 1. We assume that these bits are never biased, i.e. they pass all statistic tests for randomness. I'll call HR (for Hash Range) the number 2^HS.

What PasswordMaker does with that number is, for the purposes of this explanation, simply a base conversion: if the number of characters in the character set is C, then PasswordMaker computes the output of the hash function converted to base C, then maps each of the digits 0 through C-1 of the result to the corresponding character in the character set. This means that if the character set is for example "0123456789", the output of PasswordMaker will be the number resulting from the hash converted to decimal.

At this moment PasswordMaker does not insert leading zeros in the resulting number; this means that in certain cases the output of the hash can be shorter than the number of characters we've asked for. It also has the side effect that the first digit will 'almost never' (read: never for practical purposes) be zero (or the first character in the character set, whatever it is).

Inserting leading zeros up to the total number of digits, which is what the original C code does, solves the problem but only partially. Why?

Let's examine what HR-1 looks like in decimal:

1461501637330902918203684832716283019655932542975

Let's look at what it looks like in hexadecimal:

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The first digit in hexadecimal is F; if zero padding is used then every digit from 0 to F is equally likely for each of the digits. This is a consequence of 2^160 being a perfect power of the base (16 in this case), which is to say that 2^160 can be expressed in hexadecimal as 1 followed by a certain amount of zeros (forty in this case).

However that's not the case in decimal: the maximum value's first digit is 1. If zero padding is applied verbatim, the range of the resulting conversion will be the following:

0000000000000000000000000000000000000000000000000 through
1461501637330902918203684832716283019655932542975.

Note that the first digit can only be either 0 or 1 and can never be 2 through 9, even if we have specified that the range of characters to generate must be 0 through 9. Skipping the first digit would solve this "impossible digit" problem. There's no need to skip the first digit in the case that the hash size is a perfect power of the base, though. The expression ceiling(log_C(HR)) (where log_a(b) means logarithm in base a of b), which is what is currently used, gives the total amount of digits including that first digit; the expresion floor(log_C(HR)), on the other hand, gives the total amount of digits excluding the first one only when it's not safe to use it, which is what I'm using in the fixed version.

In some special cases where log_C(HR) is an integer, the computation of the logarithm can result, due to rounding problems, in a value slightly below that integer; say, 39.9999999999 instead of 40, which the floor() function would truncate to 39. It's possible that this happens only in certain architectures or languages. For that reason I included special code to handle the case where log_C(HR) can be an integer. This happens when C is a power of two (not in all cases where C is a power of two but it will still give the correct result in the other cases).

This does not solve the issue of the uniformity of the digits' distribution, though. That's a bit more complicated to solve. I'll explain the problem and the solution if there's interest. For now I'll just say that in the above case (adding leading zeros, using SHA-1 hash, using decimal and rejecting the first digit) the first character in the resulting password will be 0 through 3 with probability about 14%; will be 4 with probability about 12% and will be 5 through 9 with probability about 7% (see my previous post for more accurate percentages). Of course, ideally every possible character (digit) should have a probability of 10% for base 10.

-- Pedro Gimeno
« Last Edit: December 18, 2005, 12:18:38 PM by pgimeno »

PasswordMaker Forums

Perfectly uniform distribution
« Reply #7 on: December 16, 2005, 09:18:44 PM »