- Home
- Hash functions
- The protocols explained
Hash functions
The math and definitions explained
In mathematics, we are all familiar with arithmetic operations such as addition, subtraction, multiplication, and division. However, computers work with bit strings — that is, sequences of 0s and 1s — and therefore require similar operations that act on two bit strings. The most important bitwise operations are NOT, AND, OR, and XOR. Below, we show the results of these operations with one or two bits as input.
The NOT operation is often represented as !:
The AND operation is represented as \( \wedge \):
- \( 0 \wedge 0 = 0\)
- \( 1 \wedge 0 = 0\)
- \( 0 \wedge 1 = 0\)
- \( 1 \wedge 1 = 1\)
The OR operation is represented as \( \vee \):
- \( 0 \vee 0 = 0\)
- \( 1 \vee 0 = 1\)
- \( 0 \vee 1 = 1\)
- \( 1 \vee 1 = 1\)
The XOR (exclusive OR) operation is represented as \( \oplus \):
- \( 0 \oplus 0 = 0\)
- \( 1 \oplus 0 = 1\)
- \( 0 \oplus 1 = 1\)
- \( 1 \oplus 1 = 0\)
The cryptography explained
A cryptographic hash function ensures data integrity by generating a short fingerprint of the
data. If the data is altered, the fingerprint will no longer be valid. This means the data can be stored in an
insecure location, and its integrity can be checked by recomputing the fingerprint and verifying that it
has not changed. Another important application of a hash function is that it takes an
input of arbitrary size and returns a fixed-size output. This property is very useful in digital signatures,
such as RSA and ElGamal, because instead
of signing a large document, we sign the short fingerprint. This is also illustrated as \( \mathcal{H} :
\{0,1\}^{*} \rightarrow \{0,1\}^{k} \), which means that \( \mathcal{H} \) is a mapping from the
set \( \{0,1\}^{*} \) of arbitrary size to the set \( \{0,1\}^{k} \) of size \( k \), which is also
the security parameter of the hash function.
A hash function \( \mathcal{H} \) is also called a one-way function because, given the message \( m \),
it is easy to compute its fingerprint \( \mathcal{H}(m) \), but given the fingerprint, it is hard to
compute the original message. Hash functions can be grouped into two categories:
- Unkeyed hash functions \( \mathcal{H} \)
- Keyed hash functions \( \mathcal{H}_{K} \), where \( K \) is a key
Suppose Samantha sends a message \( m \) through an insecure channel to her friend Victor. Because there
is a potential risk that their adversary Eve could alter the message during transit, Samantha also sends
the message's fingerprint \( \mathcal{H}(m) \) to Victor. When Victor receives the message \( m \) and the fingerprint \( \mathcal{H}(m) \), he first recomputes the fingerprint of \( m \) and then compares it with the received fingerprint.
Only if they are equal does he know that Eve has not altered the content of \( m \).
If the fingerprint can be sent through a secure
channel, Samantha and Victor use an unkeyed hash function because Eve can only intercept and alter the
message. Otherwise, if the fingerprint cannot be sent through a secure channel, they have to use a keyed hash
function, where they have agreed on a key \( K \) in advance, for example using the Diffie-Hellman key exchange. Now, Eve can
intercept and alter both the message and the fingerprint, but since the hash function also requires the
correct key \( K \) to compute the correct fingerprint \( \mathcal{H}_{K}(m) \) of
the message \( m \), which is only known by Samantha and Victor, Eve cannot generate a valid fingerprint.
If \( \mathcal{H} \) is a secure unkeyed hash function, the following three problems are considered hard to
solve:
- Preimage: Given the fingerprint \( \mathcal{H}(m) \), the goal is to compute the message \( m \). If
this is hard, the hash function is preimage resistant or one-way.
- Second preimage: Given the message \( m \), the goal is to find another message \( m' \) such that \(
\mathcal{H}(m) = \mathcal{H}(m') \). If this is hard, the hash function is second preimage resistant.
- Collision: The goal is to find two messages \( m \) and \( m' \) such that \( \mathcal{H}(m) =
\mathcal{H}(m') \). If this is hard, the hash function is collision resistant.
It's clear that if a hash function isn't second preimage resistant, then it's not collision resistant either.
It's also the case that finding a collision is easier than finding a preimage for a hash function. In other words,
the highest level of security is achieved if the hash function is collision resistant. Due to the famous "birthday
paradox," it is possible to find a collision in approximately \( 2^{k/2} \) evaluations of the hash function, which
implies that \( k \geq 256 \) is recommended by today's standards.
So, why do we want the hash function to be collision resistant? Suppose Samantha sends the message \( m =
\mbox{"Transfer 50 USD to Carla from Samantha's account"} \) and its fingerprint \( \mathcal{H}(m) \) to
Victor, who works at her bank. If Eve intercepts the message and the hash function is not collision
resistant, she can create a different message \( m' = \mbox{"Transfer 50 USD to Eve from Samantha's account"} \)
such that \( \mathcal{H}(m) = \mathcal{H}(m') \). Eve then sends \( m' \) and \( \mathcal{H}(m) \) to Victor,
and he has no way of knowing that Eve has altered the message.
Another example is when Samantha has stored some data \( d \) in a database. Before she sends the data
to the database, she computes the fingerprint \( \mathcal{H}(d) \) and stores it in a secure
location, such as on a USB stick. Later, when she downloads the data \( d \) from the database, she
recomputes the fingerprint and compares it to the one she stored on her USB stick. If the
hash function is not collision resistant, her adversary Eve could alter the data in the database to \( d'
\), but because \( \mathcal{H}(d) = \mathcal{H}(d') \), Samantha may not realize that the data has
been altered.
The most commonly used unkeyed hash function is SHA, which is an iterated hash function. This means the SHA algorithm only takes a fixed-size input and returns a fixed-size output, but by repeating the algorithm, SHA can process an input of arbitrary size.
When using a keyed hash function \( \mathcal{H}_{K} \), where \( K \) is the key, the fingerprint is often called a MAC (Message Authentication Code). To analyze the security of a keyed hash function, we set up a "game" with Eve as the only player. If Eve wins the game, we say that the keyed hash function is insecure; our goal is to prevent Eve from winning. In this game, Eve can send any number of messages \( m_{1}, m_{2}, \dots \) of her own choosing to an oracle. Each time the oracle receives a message \( m_{i} \), it responds with the corresponding MAC \( \mathcal{H}_{K}(m_{i}) \) under the secret key \( K \), which is known only to the oracle. When Eve is done interacting with the oracle, she returns a message \( m' \) that she did not previously send to the oracle (i.e., \( m' \) is different from \( m_{1}, m_{2}, \dots \)), along with its corresponding MAC \( \mathcal{H}_{K}(m') \) (remember, Eve does not know the key \( K \)). If the MAC \( \mathcal{H}_{K}(m') \) is a valid fingerprint of \( m' \) under the key \( K \), Eve wins the game and the hash function is considered insecure.
One common way to construct a MAC is to incorporate the key into an unkeyed hash function as part of the message to be hashed. An example of this is HMAC, which uses SHA as the unkeyed hash function. Another type of MAC is CBC-MAC, which uses DES or AES in CBC mode, where the last ciphertext block is the MAC of the message.
The HMAC explained
HMAC is a keyed hash function that was published in 1996 by Mihir Bellare, Ran Canetti, and Hugo Krawczyk, and was adopted as a FIPS (Federal Information Processing Standards) standard in 2002.
The unkeyed hash function \( \mathcal{H} \) used in HMAC is often SHA-1, which means the key \( K \) is 512 bits long. HMAC also uses two fixed bit strings called \( ipad \) and \( opad \), both of which are 512 bits when SHA-1 is used. In hexadecimal notation, these two strings are defined as \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \). Note that the hex values "36" and "5C" each represent 8 bits, so \( ipad \) is "36" repeated 64 times and \( opad \) is "5C" repeated 64 times when SHA-1 is used.
First, Samantha and Victor agree on the key \( K \) and the hash function \( \mathcal{H} \), for example, SHA-1. Then, Samantha computes the MAC \( HMAC \) of the message \( m \) using the key \( K \) and the two strings \( ipad \) and \( opad \) in the hash function: \( HMAC = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \), where \( \| \) denotes concatenation and \( \oplus \) is the XOR operation. She then sends the message \( m \) and the MAC \( HMAC \) to Victor. Remember, it does not matter if the communication channel is insecure, because they use a keyed hash function where the key \( K \) is known only to Samantha and Victor.
Victor verifies the integrity of the received message \( m \) by computing its MAC \( HMAC' = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) in the same way as Samantha, and checks that it is equal to the received MAC \( HMAC \) from Samantha.
The protocol
Key exchange |
Samantha and Victor agree on the key \( K \) and the hash function \( \mathcal{H} \).
|
Signing |
Samantha: Uses the two fixed strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (written in hexadecimal). Uses the key \( K \) to compute the MAC \( HMAC = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) of the message \( m \), where \( \| \) denotes concatenation and \( \oplus \) denotes the XOR operation.
|
|
Samantha sends the MAC \( HMAC \) and the message \( m \) to Victor. |
Verification |
|
Victor: Uses the two fixed strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (written in hexadecimal). Uses the key \( K \) to compute the MAC \( HMAC' = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) of the message \( m \), where \( \| \) denotes concatenation and \( \oplus \) denotes the XOR operation. Verifies that \( HMAC' = HMAC \).
|
Try a demo of the hash function here.
Example
Victor needs to send some secret information to specific people. It is not a problem if others see who these people are, but it is important that nobody else receives the secret information. Unfortunately, Samantha is the only person who has the email addresses of all these people, so she has to send the document \( m \) containing all the email addresses. To ensure the integrity of the document, Samantha computes the MAC \( HMAC \) of the document.
First, Samantha and Victor agree to use SHA-1 as the hash function \( \mathcal{H} \) and the key \( K =
C272 \dots 9D \) (written in hexadecimal). Samantha then uses the key and the two strings \( ipad = 36 \dots 36 \) and
\( opad = 5C \dots 5C \) (both written in hexadecimal) to compute the 160-bit MAC:
\( HMAC = \mathcal{H}((C272 \dots 9D \oplus 5C \dots 5C) \: \| \: \mathcal{H}((C272 \dots 9D \oplus 36
\dots 36) \: \| \: m)) = 6F34 \dots 60 \)
The key and the two strings \( ipad \) and \( opad \) are all 512 bits, and the MAC is 160 bits because SHA-1 is used as the hash
function. She then sends the MAC \( HMAC \) and the document \( m \) to Victor.
Victor verifies the integrity of the received document \( m \) by computing the MAC \( HMAC' = 6F34
\dots 60 \) just as Samantha did. Because \( HMAC = HMAC' \), he knows that nobody has altered the
document.
The CBC-MAC explained
CBC-MAC is a keyed hash function based on the block cipher DES
or AES in CBC mode with a fixed
initialization vector \( IV = 0 \dots 0 \), meaning \( IV \) consists only of zeros. Note that IV is 56 bits
in DES and 128 bits in AES.
First, Samantha and Victor agree on the key \( K \) and the encryption algorithm \( E \), i.e., whether
they want to use DES or AES. Samantha then divides the message \( m \) into plaintext blocks \( m_{1},
m_{2}, \dots, m_{t} \) of 56 bits or 128 bits, depending on whether she uses DES or AES. If the last block
is too short, she pads it until it becomes 56 bits or 128 bits. She then sets the first ciphertext block equal
to the initialization vector, i.e., \( c_{0} = IV \), and computes the remaining \( t \) ciphertext blocks
by encrypting the \( i \)th plaintext block XORed with the previous ciphertext block, i.e., \( c_{i} =
E_{K}(c_{i-1} \oplus m_{i}) \), where \( \oplus \) is the XOR operator. The last ciphertext block \( c_{t} \)
is then the MAC of the message \( m \), which she sends to Victor together with the message \( m \).
Victor verifies the integrity of the received message \( m \) by computing its MAC \( c_{t}' \) as Samantha
did, and checks that it is equal to the received MAC \( c_{t} \) from Samantha.
The protocol
Key exchange |
Samantha and Victor agree on the key \( K \) and the encryption algorithm \( E \). |
Signing |
Samantha: Uses the fixed initialization vector \( IV = 0 \dots 0 \) (all zeros) with the same length as the block size. Uses the key \( K \) with the encryption algorithm \( E \) to encrypt the plaintext blocks as follows: \( c_{0} = IV \) and \( c_{i} = E_{K}(c_{i-1} \oplus m_{i}) \). The last ciphertext block \( c_{t} \) is the MAC of the message \( m \).
|
|
Samantha sends the MAC \( c_{t} \) and the message \( m \) to Victor. |
Verification |
|
Victor: Uses the fixed initialization vector \( IV = 0 \dots 0 \) (all zeros) with the same length as the block size. Uses the key \( K \) with the encryption algorithm \( E \) to encrypt the plaintext blocks as follows: \( c_{0}' = IV \) and \( c_{i}' = E_{K}(c_{i-1}' \oplus m_{i}) \), where the last ciphertext block \( c_{t}' \) is the MAC of the message \( m \). Verifies that \( c_{t}' = c_{t} \).
|
Try a demo of the hash function here.
Example
Samantha has just finished writing an article \( m \) and needs to store it in her works database.
The following day, she must print the article, but because she does not want to check the integrity of the
article again by reading the entire article (since someone may have altered it), she computes the MAC of the article using
AES. She then stores both the article and its MAC in the database, and keeps the key on her USB stick. The next day,
she recomputes the MAC of the article and checks if it matches the one stored in the database.
Because Samantha uses the AES encryption algorithm, she chooses a 128-bit key \( K = 01100001 \dots \).
She then sets the first ciphertext block equal to the fixed initialization vector \( IV = 0 \dots 0 \),
i.e., \( c_{0} = IV = 0 \dots 0 \). Suppose that the article \( m \) can be divided into two 128-bit blocks,
i.e., \( m = m_{1} \: \| \: m_{2} \), where \( \| \) denotes concatenation. She then computes the
next two ciphertext blocks (both 128 bits) as follows: \( c_{1} = E_{K}(c_{0} \oplus m_{1}) =
01001000 \dots \) and \( c_{2} = E_{K}(c_{1} \oplus m_{2}) = 01100101 \dots \), where \( c_{2} \) is the
MAC of the article \( m \). She then stores the article \( m \) and the MAC \( c_{2} \) in the database,
and keeps the key \( K \) on her USB stick.
The next day, she recomputes the MAC \( c_{2}' \) of the article \( m \) from the database, and if \( c_{2}
= c_{2}' \), she knows that nobody has altered the article.