I'll attempt to outline the problem more precisely.

Each hash generates a certain number of pseudorandom bits HS. The hash SHA-1, for example, generates HS=160 pseudorandom bits which can be interpreted as an integer ranging from 0 to 2^160 - 1. We assume that these bits are never biased, i.e. they pass all statistic tests for randomness. I'll call HR (for Hash Range) the number 2^HS.

What PasswordMaker does with that number is, for the purposes of this explanation, simply a base conversion: if the number of characters in the character set is C, then PasswordMaker computes the output of the hash function converted to base C, then maps each of the digits 0 through C-1 of the result to the corresponding character in the character set. This means that if the character set is for example "0123456789", the output of PasswordMaker will be the number resulting from the hash converted to decimal.

At this moment PasswordMaker does not insert leading zeros in the resulting number; this means that in certain cases the output of the hash can be shorter than the number of characters we've asked for. It also has the side effect that the first digit will 'almost never' (read: never for practical purposes) be zero (or the first character in the character set, whatever it is).

Inserting leading zeros up to the total number of digits, which is what the original C code does, solves the problem but only partially. Why?

Let's examine what HR-1 looks like in decimal:

1461501637330902918203684832716283019655932542975

Let's look at what it looks like in hexadecimal:

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The first digit in hexadecimal is F; if zero padding is used then every digit from 0 to F is equally likely for each of the digits. This is a consequence of 2^160 being a perfect power of the base (16 in this case), which is to say that 2^160 can be expressed in hexadecimal as 1 followed by a certain amount of zeros (forty in this case).

However that's not the case in decimal: the maximum value's first digit is 1. If zero padding is applied verbatim, the range of the resulting conversion will be the following:

0000000000000000000000000000000000000000000000000 through

1461501637330902918203684832716283019655932542975.

Note that the first digit can *only* be either 0 or 1 and can *never* be 2 through 9, even if we have specified that the range of characters to generate must be 0 through 9. Skipping the first digit would solve this "impossible digit" problem. There's no need to skip the first digit in the case that the hash size is a perfect power of the base, though. The expression ceiling(log_C(HR)) (where log_a(b) means logarithm in base a of b), which is what is currently used, gives the total amount of digits including that first digit; the expresion floor(log_C(HR)), on the other hand, gives the total amount of digits excluding the first one only when it's not safe to use it, which is what I'm using in the fixed version.

In some special cases where log_C(HR) is an integer, the computation of the logarithm can result, due to rounding problems, in a value slightly below that integer; say, 39.9999999999 instead of 40, which the floor() function would truncate to 39. It's possible that this happens only in certain architectures or languages. For that reason I included special code to handle the case where log_C(HR) can be an integer. This happens when C is a power of two (not in all cases where C is a power of two but it will still give the correct result in the other cases).

This does not solve the issue of the uniformity of the digits' distribution, though. That's a bit more complicated to solve. I'll explain the problem and the solution if there's interest. For now I'll just say that in the above case (adding leading zeros, using SHA-1 hash, using decimal and rejecting the first digit) the first character in the resulting password will be 0 through 3 with probability about 14%; will be 4 with probability about 12% and will be 5 through 9 with probability about 7% (see my previous post for more accurate percentages). Of course, ideally every possible character (digit) should have a probability of 10% for base 10.

-- Pedro Gimeno