Hash functions

Try a demo of HMAC CBC-MAC

The math and definitions explained

In math we are all familiar with the arithmetic operations such as addition, subtraction, multiplication and division. However, a computer works on bit strings, i.e. strings of 0's and 1's, and therefore needs similar operations when given two bit strings. The most important bitwise operations are NOT, AND, OR and XOR. In the following we give the result of the operations with one or two bits as input.

The NOT operation is often represented as !:

  • \( !0 = 1\)
  • \( !1 = 0\)

The AND operation is represented as \( \wedge \):

  • \( 0 \wedge 0 = 0\)
  • \( 1 \wedge 0 = 0\)
  • \( 0 \wedge 1 = 0\)
  • \( 1 \wedge 1 = 1\)

The OR operation is represented as \( \vee \):

  • \( 0 \vee 0 = 0\)
  • \( 1 \vee 0 = 1\)
  • \( 0 \vee 1 = 1\)
  • \( 1 \vee 1 = 1\)

The XOR (exclusive OR) operation is represented as \( \oplus \):

  • \( 0 \oplus 0 = 0\)
  • \( 1 \oplus 0 = 1\)
  • \( 0 \oplus 1 = 1\)
  • \( 1 \oplus 1 = 0\)

The cryptography explained

A cryptographic hash function make assurance of data integrity by constructing a short fingerprint of the data. If the data is altered the fingerprint will no longer be valid, i.e. the data can be stored at an insecure location and its integrity can be checked by recomputing the fingerprint and verifying that the fingerprint has not changed. Another important application of a hash function is that it takes an arbitrary size input and return a fixed size output which makes it very useful in digital signatures such RSA and ElGamal because instead of signing a large document we sign the short fingerprint. This is also illustrated as \( \mathcal{H} : \{0,1\}^{*} \rightarrow \{0,1\}^{k} \) which means that \( \mathcal{H} \) is a mapping from the set \( \{0,1\}^{*} \) of arbitrary size to the set \( \{0,1\}^{k} \) of size \( k \), which is also the security parameter of the hash function.

A hash function \( \mathcal{H} \) is also called a one-way function because given the message \( m \) it's easy to compute its fingerprint \( \mathcal{H}(m) \), but given the fingerprint it's hard to compute the message. Hash functions can be grouped into two categories:

  • Unkeyed hash functions \( \mathcal{H} \)
  • Keyed hash functions \( \mathcal{H}_{K} \) where \( K \) is some key

Suppose that Samantha sends a message \( m \) through an insecure channel to her friend Victor. Because there is a potential risk of that their adversary Eve alter the message during transit, Samantha also sends the message's fingerprint \( \mathcal{H}(m) \) to Victor. When Victor receives the message \( m \) and the fingerprint \( \mathcal{H}(m) \) he first recomputes a fingerprint of \( m \) and then compares it with the received fingerprint. Only if they are equal he know that Eve has not altered the content of \( m \). If the fingerprint can be sent through a secure channel, Samantha and Victor uses an unkeyed hash function because Eve can only intercept and alter the message. Otherwise if the fingerprint can't be sent through a secure channel they have to use a keyed hash function where they have agreed on a key \( K \) in advance with e.g. the Diffie-Hellman key exchange. Now, Eve can intercept and alter both the message and the fingerprint, but since the hash function also require the correct key \( K \) to compute the correct fingerprint \( \mathcal{H}_{K}(m) \) of the message \( m \), which is only known by Samantha and Victor, Eve can't generate a correct fingerprint.

If \( \mathcal{H} \) is a secure unkeyed hash function the following three problems are hard to solve:

  • Preimage: Given the fingerprint \( \mathcal{H}(m) \) the goal is to compute the message \( m \). If this is hard the hash function is preimage resistant or one-way.
  • Second preimage: Given the message \( m \) the goal is to find another message \( m' \) such that \( \mathcal{H}(m) = \mathcal{H}(m') \). If this is hard the hash function is second preimage resistant.
  • Collision: The goal is to find two messages \( m \) and \( m' \) such that \( \mathcal{H}(m) = \mathcal{H}(m') \). If this is hard the hash function is collision resistant.

It's clear that if a hash function isn't second preimage resistant then it's nor collision resistant. It's also the case that finding a collision is easier than finding a preimage for a hash function. I.e. the highest security is obtained if the hash function is collision resistant and by the famous "birthday paradox" it's possible to find a collision in every \( 2^{k/2}\) evaluations of the hash function which implies that \( k \geq 256 \) with today's standard.

So, why do we want the hash function to be collision resistant? Suppose that Samantha sends the message \( m = \mbox{"Transfer 50 USD to Carla from Samanthas account"} \) and its fingerprint \( \mathcal{H}(m) \) to Victor who work in her Bank. If Eve intercept the message and the hash function is not collision resistant, she can compute the message \( m' = \mbox{"Transfer 50 USD to Eve from Samanthas account"} \) where \( \mathcal{H}(m) = \mathcal{H}(m') \). Eve then sends \( m' \) and \( \mathcal{H}(m) \) to Victor and he has no idea that Eve has altered the message.

Another example is where Samantha has stored some data \( d \) in a database. Before she sends the data to the database she computes the fingerprint \( \mathcal{H}(d) \) of the data and stores it at a secure location, e.g. on a USB stick. Now, the next time she downloads the data \( d \) from the database she recomputes the fingerprint of the data and compares it to the one she had stored on her USB stick. If the hash function is not collision resistant her adversary Eve could alter the data in the database to \( d' \), but because \( \mathcal{H}(d) = \mathcal{H}(d') \) Samantha may not realize that the data is altered.

The most used unkeyed hash function is SHA which is an iterated hash function, i.e. the SHA algorithm only takes a fixed sized input and return a fixed size output, but when repeating the algorithm SHA can take an arbitrary sized input.

When using a keyed hash function \( \mathcal{H}_{K} \), where \( K \) is the key, the fingerprint is often called a MAC (Message Authentication Code). To argue about security of a keyed hash function we set up a "game" with Eve as the only player. If Eve wins the game we say that the keyed hash function is insecure, i.e. our goal is to prevent Eve from winning the game. In the game Eve can send an arbitrary number of messages \( m_{1}, m_{2}, \dots \) of her own choice to an oracle. Every time the oracle receive a message \( m_{i} \) it respond with its corresponding MAC \( \mathcal{H}_{K}(m_{i}) \) under the secret key \( K \) which is only known by the oracle. When Eve is done talking to the oracle she return a message \( m' \), that she did not send to the oracle (i.e. the message \( m' \) is different from the messages \( m_{1}, m_{2}, \dots \)), and its corresponding MAC \( \mathcal{H}_{K}(m') \) (remember that Eve does not know the key \( K \)). If the MAC \( \mathcal{H}_{K}(m') \) is a valid fingerprint of \( m' \) under the key \( K \) Eve wins the game and the hash function is insecure.

One common way of construction a MAC is to incorporate the key into an unkeyed hash function as part of the message to be hashed. An example of this is HMAC with e.g. SHA as the unkeyed hash function. Another type of MAC is CBC-MAC that use DES or AES in CBC mode where the last ciphertext block is the MAC of the message.

The HMAC explained

HMAC is a keyed hash function and was published in 1996 by Mihir Bellare, Ran Canetti and Hugo Krawczyk and adopted as a FIPS (Federal Information Processing Standards) standard in 2002.

The used unkeyed hash function \( \mathcal{H} \) in HMAC is often SHA-1, which implies that the used key \( K \) is 512-bit. HMAC also uses two fixed bit strings called the \( ipad \) and the \( opad \), that both are 512-bits when SHA-1 is used. If we use hexadecimal notation the two strings are defined as \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (remember that the hex values "35" and "5C" correspond each to 8 bits, i.e. ipad is "36" repeated 64 times and opad is "5C" repeated 64 times when SHA-1 is used).

First Samantha and Victor agree on the key \( K \) and the hash function \( \mathcal{H} \), e.g. SHA-1. Then Samantha computes the MAC \( HMAC \) of the the message \( m \) by using the key \( K \) and the two strings \( ipad \) and \( opad \) in the hash function: \( HMAC = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) where \( \| \) means concatenation and \( \oplus \) is the XOR operation. She then sends the message \( m \) and the MAC \( HMAC \) to Victor. Remember that it doesn't matter if the communication channel is insecure because they use a keyed hash function where the key \( K \) is only known by Samantha and Victor.

Victor then verifies the integrity of the received message \( m \) by computing its MAC \( HMAC' = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) just as Samantha and checks that it is equal to the received MAC \( HMAC \) from Samantha.

The protocol
Key exchange
Samantha and Victor agree on the key \( K \) and the hash function \( \mathcal{H} \).
Signing
Samantha:
Uses the two fixed strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (written in hexadecimal).
Uses the key \( K \) to compute the MAC \( HMAC = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) of the message \( m \) where \( \| \) is the concatenation operator and \( \oplus \) the XOR operator.
Samantha sends the MAC \( HMAC \) and the message \( m \) to Victor.
Verification
Victor:
Uses the two fixed strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (written in hexadecimal).
Uses the key \( K \) to compute the MAC \( HMAC' = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) of the message \( m \) where \( \| \) is the concatenation operator and \( \oplus \) the XOR operator.
Verifies that \( HMAC' = HMAC \).

Try a demo of the hash function here.

Example

Victor needs to send some secret information to some specific people. It's not a problem that other people see who these people are, but it is important that nobody else get the secret information. Unfortunately Samantha is the only person who have the email addresses on all these people, and she therefore have to send the document \( m \) containing all the email addresses. To make assurance of the integrity of the document Samantha computes the MAC \( HMAC \) of the document.

First Samantha and Victor agree to use SHA-1 as the hash function \( \mathcal{H} \) and the key \( K = C272 \dots 9D \) (written in hexadecimal). Samantha then uses the key, the two strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (both written in hexadecimal) to compute the 160-bit MAC:

\( HMAC = \mathcal{H}((C272 \dots 9D \oplus 5C \dots 5C) \: \| \: \mathcal{H}((C272 \dots 9D \oplus 36 \dots 36) \: \| \: m)) = 6F34 \dots 60 \)

The key and the two strings \( ipad \) and \( opad \) are all 512-bit and the MAC is 160-bit because SHA-1 is used as the hash function. She then sends the MAC \( HMAC \) and the document \( m \) to Victor.

Victor verifies the integrity of the received document \( m \) by computing the MAC \( HMAC' = 6F34 \dots 60 \) just as Samantha and because \( HMAC = HMAC' \) he know nobody have altered the document.

The CBC-MAC explained

CBC-MAC is a keyed hash function based on the block cipher DES or AES in CBC mode with the fixed initialization vector \( IV = 0 \dots 0 \), i.e. \( IV \) only contains zeros. Notice that IV is 56-bit in DES and 128-bit in AES.

First Samantha and Victor agree on the key \( K \) and the encryption algorithm \( E \), i.e. whether they want to use DES or AES. Samantha then divides the message \( m \) into plaintext blocks \( m_{1}, m_{2}, \dots, m_{t} \) of 56-bit or 128-bit depending on whether she uses DES or AES. If the last block is too short she pad it until it become 56-bit or 128-bit. She then set the first ciphertext block equal the initialization vector, i.e. \( c_{0} = IV \), and computes the remaining \( t \) ciphertext blocks by encrypting the \( i \)'te plaintext block XORed with the previous ciphertext block, i.e. \( c_{i} = E_{K}(c_{i-1} \oplus m_{i}) \) where \( \oplus \) is the XOR operator. The last ciphertext block \( c_{t} \) is then the MAC of the message \(m \), which she sends to Victor together with the message \( m \).

Victor verifies the integrity of the received message \( m \) by computing its MAC \( c_{t}' \) as Samantha and checks that it is equal to the received MAC \( c_{t} \) from Samantha.

The protocol
Key exchange
Samantha and Victor agree on the key \( K \) and the encryption algorithm \( E \).
Signing
Samantha:
Uses the fixed initialization vector \( IV = 0 \dots 0 \) (only zeros) with same length as the block size.
Uses the key \( K \) with the encryption algorithm \( E \) to encrypt the plaintext blocks by \( c_{0} = IV \) and \( c_{i} = E_{K}(c_{i-1} \oplus m_{i}) \).
The last ciphertext block \( c_{t} \) is the MAC of the message \( m \).
Samantha sends the MAC \( c_{t} \) and the message \( m \) to Victor.
Verification
Victor:
Uses the fixed initialization vector \( IV = 0 \dots 0 \) (only zeros) with same length as the block size.
Uses the key \( K \) with the encryption algorithm \( E \) to encrypt the plaintext blocks by \( c_{0}' = IV \) and \( c_{i}' = E_{K}(c_{i-1}' \oplus m_{i}) \) where the last ciphertext block \( c_{t}' \) is the MAC of the message \( m \).
Verifies that \( c_{t}' = c_{t} \).

Try a demo of the hash function here.

Example

Samantha has just finished writing an article \( m \) and needs to store it in hers works database. The following day she must print the article, but because she doesn't want to check the integrity of the article again by reading the whole article (because somebody may have altered the article), she computes the MAC of the article by using AES. She then stores the article and its MAC in the database and the used key on her USB stick. The next day she then recomputes the MAC of the article and check if it's equal to the one stored in the database.

Because Samantha uses the AES encryption algorithm she chooses a 128-bit key \( K = 01100001 \dots \). She then sets the first ciphertext block equal the fixed initialization vector \( IV = 0 \dots 0 \), i.e. \( c_{0} = IV = 0 \dots 0 \). Suppose that the article \( m \) can be divided into two 128-bit blocks, i.e. \( m = m_{1} \: \| \: m_{2} \) where \( \| \) is the concatenation operation. She then computes the two remaining ciphertext blocks (which both are 128-bit) by: \( c_{1} = E_{K}(c_{0} \oplus m_{1}) = 01001000 \dots \) and \( c_{2} = E_{K}(c_{1} \oplus m_{2}) = 01100101 \dots \) where \( c_{2} \) is the MAC of the article \( m \). She then stores the article \( m \) and the MAC \( c_{2} \) in the database and the key \( K \) on her USB stick.

The next day she recomputes the MAC \( c_{2}' \) of the article \( m \) from the database and if \( c_{2} = c_{2}' \) she know that nobody has altered the article.