Using different hash algorithm in purple_util_get_image_checksum()

Wed Jul 1 04:29:12 EDT 2009

On Tue, Jun 30, 2009 at 7:56 PM, Ethan Blanton<elb at pidgin.im> wrote:
> Mark Doliner spake unto us the following wisdom:
>> The purple_util_get_image_checksum() function in libpurple/util.c
>> currently uses SHA-1 to generate a checksum for a chunk of image data.
>>  SHA-1 is a cryptographic hash function, which means it's hard for
>> someone to engineer a chunk of data that matches a given hash.  It
>> also means it's slow.
>>
>> Do we need to be using a cryptographic hash function here?  This hash
>> function is one of the more expensive parts of libpurple.  I think
>> it's called once for each buddy icon we receive.  Adler-32 is much
>> faster when you're not concerned about security (it's maybe 8 times
>> faster than SHA-1).  zlib contains an Adler-32 implementation.  I
>> think GLib's g_string_hash() function is also pretty fast (but not as
>> fast as Adler-32 when hashing image data).  I haven't really
>> investigated what problems we would have switching hash functions... I
>> think we would have to migrate or purge buddy icons from
>> ~/.purple/icons/, because the icon filename is the hash.  And there
>> might be other problems.
>>
>> But, uh, how to people feel about this change?
>
> A 32-bit hash brings collisions into the realm of possibility, as it
> requires only about 2^16 icons in the store for the probability of
> collision to rise to about 50%.
>
> We don't necessarily need a SHA-sized hash, but we need more than
> Adler32.  I suggest, however, that there's not much in between that
> will have *practical* implementations which are faster than SHA.  (SHA
> implementations in crypto libraries are heavily optimized, as are many
> Adler32 implementations.  I am unaware of, for example, a 64-bit
> checksum or hashing function with similar widespread optimization.)
> It's possible that AES-128 would demonstrate benefit, or even DES with
> a 56-bit key; you might want to benchmark.

At Meebo we chose to create two 32-bit hashes, one from one chunk of
the icon data and one from a different chunk, then concat them to
create a 64-bit checksum.  Vijay from Meebo tells me that he seems to
remember the sha-1 implementation in glib 2.16 is much faster than the
one in our cipher.c.

> (In reality, I submit that we *do* need a cryptographically secure hash
>  function.  While this is not always the case for content-addressed
>  storage, in our case we are storing objects which are provided by
>  untrusted users from the network.  It is relatively easy to
>  reverse-collide an Adler32 checksum, and possible for an individual
>  to reverse-collide a DES "hashed" block.  This would mean that a
>  malicious user with a timing advantage could pollute your icon store
>  and prevent legitimate buddy icons from being fetched.  I sketch:
>
>  Mallory sees that Alice has changed her buddy icon, but Bob is not
>  online.  Mallory computes an icon which collides with Alice's in the
>  Pidgin buddy icon CAS.  When Bob comes online, Mallory rapidly sends
>  Bob this new buddy icon information as his own buddy icon, storing it
>  in Bob's CAS.  When Bob engages in conversation with Alice, Mallory's
>  buddy icon is now shown for Alice.
>
>  The saving grace here is ... nobody cares that much about buddy
>  icons.  Particularly now that Hulu is around.)

Yeah, we thought about all of this and didn't deem it important enough
for us to keep using sha-1.  As you point out it is relatively easy to
create a chunk of data with an Adler32 checksum that matches another
chunk of data, and so it would be relatively easy to make someone's
icon stop appearing.  But it would be harder to replace someone's icon
with a different image because you would have to create a chunk of
data that 1. has the same checksum and 2. is a valid image.

I'm hoping a few other devs weight in, but it sounds like we want to
stick with sha-1.  I'm not sure the performance advantage of md5 over
sha1 justifies the code changes that would be required.  And the
checksumming matters a lot less when you're running one instance on
your personal computer than when you have a few thousand users on a
single server.

-Mark