Hash functions

Try a demo of HMAC CBC-MAC

The math and definitions explained

In mathematics, we are all familiar with arithmetic operations such as addition, subtraction, multiplication, and division. However, computers work with bit strings — that is, sequences of 0s and 1s — and therefore require similar operations that act on two bit strings. The most important bitwise operations are NOT, AND, OR, and XOR. Below, we show the results of these operations with one or two bits as input.

The NOT operation is often represented as !:

  • \( !0 = 1\)
  • \( !1 = 0\)

The AND operation is represented as \( \wedge \):

  • \( 0 \wedge 0 = 0\)
  • \( 1 \wedge 0 = 0\)
  • \( 0 \wedge 1 = 0\)
  • \( 1 \wedge 1 = 1\)

The OR operation is represented as \( \vee \):

  • \( 0 \vee 0 = 0\)
  • \( 1 \vee 0 = 1\)
  • \( 0 \vee 1 = 1\)
  • \( 1 \vee 1 = 1\)

The XOR (exclusive OR) operation is represented as \( \oplus \):

  • \( 0 \oplus 0 = 0\)
  • \( 1 \oplus 0 = 1\)
  • \( 0 \oplus 1 = 1\)
  • \( 1 \oplus 1 = 0\)
Operations on Multi-Bit Values

These operations work on each bit position independently when applied to longer bit strings. Let A = 10110110 and B = 11001101:

Operation Bit Values Description
NOT A NOT 10110110 = 01001001 Each bit flipped
A AND B 10110110 AND 11001101 = 10000100 Bitwise AND of each position
A OR B 10110110 OR 11001101 = 11111111 Bitwise OR of each position
A XOR B 10110110 XOR 11001101 = 01111011 Bitwise XOR of each position
Common Notations

These operations are represented differently across contexts:

  • Mathematical notation: ! (NOT), \(\wedge\) (AND), \(\vee\) (OR), \(\oplus\) (XOR)
  • Programming notation: ~ (NOT), & (AND), | (OR), ^ (XOR)
  • Boolean logic: NOT, AND, OR, XOR
Importance in Cryptography

Bitwise operations are fundamental to modern cryptography:

  • XOR is especially important because it forms the basis of the one-time pad and is used extensively in stream ciphers and block cipher modes
  • AND and OR are commonly used in hash functions and S-box implementations
  • Bit permutations (rearrangements) combined with these operations create diffusion and confusion in many algorithms
Key Properties of XOR

XOR has special properties that make it valuable in cryptography:

  • Self-inverse: A \(\oplus\) A = 0 (XORing any value with itself gives zero)
  • Identity element: A \(\oplus\) 0 = A (XORing with zero leaves the value unchanged)
  • Associative: A \(\oplus\) (B \(\oplus\) C) = (A \(\oplus\) B) \(\oplus\) C
  • Commutative: A \(\oplus\) B = B \(\oplus\) A

These properties allow operations like: C = P \(\oplus\) K (encryption), P = C \(\oplus\) K (decryption)

Additional Bitwise Operations

Other important operations in cryptography include:

  • Shift operations (left << and right >>): Move bits to the left or right, filling with zeros
  • Rotate operations (ROL and ROR): Similar to shifts but wrap the bits around
  • Bit permutation: Rearranging bits according to a fixed pattern

These operations are fundamental components in many encryption algorithms and hash functions.

The cryptography explained

A cryptographic hash function ensures data integrity by generating a short fingerprint of the data. If the data is altered, the fingerprint will no longer be valid. This means the data can be stored in an insecure location, and its integrity can be checked by recomputing the fingerprint and verifying that it has not changed. Another important application of a hash function is that it takes an input of arbitrary size and returns a fixed-size output. This property is very useful in digital signatures, such as RSA and ElGamal, because instead of signing a large document, we sign the short fingerprint. This is also illustrated as \( \mathcal{H} : \{0,1\}^{*} \rightarrow \{0,1\}^{k} \), which means that \( \mathcal{H} \) is a mapping from the set \( \{0,1\}^{*} \) of arbitrary size to the set \( \{0,1\}^{k} \) of size \( k \), which is also the security parameter of the hash function.

A hash function \( \mathcal{H} \) is also called a one-way function because, given the message \( m \), it is easy to compute its fingerprint \( \mathcal{H}(m) \), but given the fingerprint, it is hard to compute the original message. Hash functions can be grouped into two categories:

  • Unkeyed hash functions \( \mathcal{H} \)
  • Keyed hash functions \( \mathcal{H}_{K} \), where \( K \) is a key

Unkeyed hash functions are used to compute the fingerprint of a message, while keyed hash functions are used to compute the fingerprint of a message with a key. The latter is often called a MAC (Message Authentication Code). The difference between the two is that unkeyed hash functions can be used to verify the integrity of data, while keyed hash functions can also provide authenticity, meaning that the sender of the message is known.

Ensuring Data Integrity with Hash Functions

Suppose Samantha sends a message \( m \) through an insecure channel to her friend Victor. Because there is a potential risk that their adversary Eve could alter the message during transit, Samantha also sends the message's fingerprint \( \mathcal{H}(m) \) to Victor. When Victor receives the message \( m \) and the fingerprint \( \mathcal{H}(m) \), he first recomputes the fingerprint of \( m \) and then compares it with the received fingerprint. Only if they are equal does he know that Eve has not altered the content of \( m \).

If the fingerprint can be sent through a secure channel, Samantha and Victor use an unkeyed hash function because Eve can only intercept and alter the message. Otherwise, if the fingerprint cannot be sent through a secure channel, they have to use a keyed hash function, where they have agreed on a key \( K \) in advance, for example using the Diffie-Hellman key exchange. Now, Eve can intercept and alter both the message and the fingerprint, but since the hash function also requires the correct key \( K \) to compute the correct fingerprint \( \mathcal{H}_{K}(m) \) of the message \( m \), which is only known by Samantha and Victor, Eve cannot generate a valid fingerprint.

Types of Hash Function Attacks

If \( \mathcal{H} \) is a secure unkeyed hash function, the following three problems are considered hard to solve:

Attack Type Description Security Property
Preimage Attack Given the fingerprint \( \mathcal{H}(m) \), find the message \( m \). If hard, the hash function is preimage resistant or one-way.
Second Preimage Attack Given a message \( m \), find another message \( m' \) such that \( \mathcal{H}(m) = \mathcal{H}(m') \). If hard, the hash function is second preimage resistant.
Collision Attack Find any two different messages \( m \) and \( m' \) such that \( \mathcal{H}(m) = \mathcal{H}(m') \). If hard, the hash function is collision resistant.

It's clear that if a hash function isn't second preimage resistant, then it's not collision resistant either. It's also the case that finding a collision is easier than finding a preimage for a hash function. In other words, the highest level of security is achieved if the hash function is collision resistant.

The Birthday Paradox and Collision Resistance

Due to the famous "birthday paradox," it is possible to find a collision in approximately \( 2^{k/2} \) evaluations of the hash function, which implies that \( k \geq 256 \) is recommended by today's standards.

If you have 23 people in a room, the probability that at least two people share the same birthday exceeds 50%, despite there being 365 possible birthdays.

For hash functions, this means that if the output size is \(n\) bits (giving \(2^n\) possible hash values), an attacker only needs to compute about \(2^{n/2}\) hash values before finding a collision with high probability.

This is why a 128-bit hash function (like MD5) only provides about 64 bits of collision resistance, and why modern applications require at least 256-bit hash functions.

Real-World Security Implications

So, why do we want the hash function to be collision resistant? Suppose Samantha sends the message \( m = \mbox{"Transfer 50 USD to Carla from Samantha's account"} \) and its fingerprint \( \mathcal{H}(m) \) to Victor, who works at her bank. If Eve intercepts the message and the hash function is not collision resistant, she can create a different message \( m' = \mbox{"Transfer 50 USD to Eve from Samantha's account"} \) such that \( \mathcal{H}(m) = \mathcal{H}(m') \). Eve then sends \( m' \) and \( \mathcal{H}(m) \) to Victor, and he has no way of knowing that Eve has altered the message.

Another example is when Samantha has stored some data \( d \) in a database. Before she sends the data to the database, she computes the fingerprint \( \mathcal{H}(d) \) and stores it in a secure location, such as on a USB stick. Later, when she downloads the data \( d \) from the database, she recomputes the fingerprint and compares it to the one she stored on her USB stick. If the hash function is not collision resistant, her adversary Eve could alter the data in the database to \( d' \), but because \( \mathcal{H}(d) = \mathcal{H}(d') \), Samantha may not realize that the data has been altered.

Message Authentication Codes (MAC)

The most commonly used unkeyed hash function is SHA, which is an iterated hash function. This means the SHA algorithm only takes a fixed-size input and returns a fixed-size output, but by repeating the algorithm, SHA can process an input of arbitrary size.

When using a keyed hash function \( \mathcal{H}_{K} \), where \( K \) is the key, the fingerprint is often called a MAC (Message Authentication Code). To analyze the security of a keyed hash function, we set up a "game" with Eve as the only player. If Eve wins the game, we say that the keyed hash function is insecure; our goal is to prevent Eve from winning. In this game, Eve can send any number of messages \( m_{1}, m_{2}, \dots \) of her own choosing to an oracle. Each time the oracle receives a message \( m_{i} \), it responds with the corresponding MAC \( \mathcal{H}_{K}(m_{i}) \) under the secret key \( K \), which is known only to the oracle. When Eve is done interacting with the oracle, she returns a message \( m' \) that she did not previously send to the oracle (i.e., \( m' \) is different from \( m_{1}, m_{2}, \dots \)), along with its corresponding MAC \( \mathcal{H}_{K}(m') \) (remember, Eve does not know the key \( K \)). If the MAC \( \mathcal{H}_{K}(m') \) is a valid fingerprint of \( m' \) under the key \( K \), Eve wins the game and the hash function is considered insecure.

One common way to construct a MAC is to incorporate the key into an unkeyed hash function as part of the message to be hashed. An example of this is HMAC, which uses SHA as the unkeyed hash function. Another type of MAC is CBC-MAC, which uses DES or AES in CBC mode, where the last ciphertext block is the MAC of the message.

Common Hash Functions

Output Size: 128 bits

Status: Broken - should not be used for security applications

MD5 was designed by Ron Rivest in 1991 as a successor to MD4. Collisions were found in 2004, and the algorithm is now considered cryptographically broken. It remains in use only for non-security applications like checksums for data corruption detection.

Output Size: 160 bits

Status: Deprecated - no longer considered secure for cryptographic purposes

Developed by the NSA and published in 1995. The first practical collision was demonstrated in 2017, making it unsuitable for security applications like digital signatures and certificates. Major browsers and certificate authorities have phased out support.

Output Size: 256, 384, or 512 bits

Status: Currently secure and widely used

Designed by the NSA and published in 2001, SHA-2 comprises several hash functions with different output sizes. SHA-256 is widely used in security applications including TLS/SSL, PGP, SSH, and cryptocurrencies like Bitcoin. No practical attacks have been demonstrated against the full versions.

Output Size: 224, 256, 384, or 512 bits

Status: Currently secure and increasingly adopted

Selected through the NIST hash function competition and standardized in 2015, SHA-3 is based on the Keccak algorithm. It was designed as an alternative to SHA-2, using a completely different internal structure (sponge construction). This provides additional security if vulnerabilities are found in SHA-2.

Output Size: Variable (typically 256 or 512 bits)

Status: Secure and gaining popularity

BLAKE2 was designed in 2012 as a faster and more secure replacement for MD5 and SHA-1. BLAKE3, released in 2020, offers even better performance. Both are considered highly secure while being faster than SHA-2 and SHA-3 on modern hardware. BLAKE3 is particularly optimized for parallel computation.

Applications of Hash Functions

Password Storage

Hash functions are used to securely store passwords. Instead of storing the actual password, systems store its hash value. When a user attempts to log in, the system hashes the provided password and compares it to the stored hash.

Modern password storage also employs salting - adding random data to each password before hashing to prevent rainbow table attacks and ensure that identical passwords produce different hashes.

Digital Signatures

Hash functions enable efficient digital signatures. Instead of signing an entire document (which would be inefficient), cryptographic systems sign the document's hash value.

This provides both improved performance and the same security guarantees, as any change to the document would change its hash value, invalidating the signature.

Blockchain Technology

Hash functions are fundamental to blockchain systems like Bitcoin and Ethereum. They secure the chain of blocks through:

  • Creating unique block identifiers
  • Implementing the proof-of-work consensus mechanism
  • Generating addresses from public keys
  • Creating Merkle trees for efficient verification

Data Integrity Verification

Hash functions verify data integrity in various scenarios:

  • File downloads: Verifying downloaded files match their source
  • Data backup: Confirming backed-up data is identical to the original
  • Software distribution: Ensuring software hasn't been tampered with
  • Digital forensics: Creating verifiable snapshots of evidence

The HMAC explained

HMAC (Hash-based Message Authentication Code) is a keyed hash function that was published in 1996 by Mihir Bellare, Ran Canetti, and Hugo Krawczyk, and was adopted as a FIPS (Federal Information Processing Standards) standard in 2002.

HMAC combines a cryptographic hash function with a secret key to produce a secure authentication code. The key advantages of HMAC include:

  • No hash function modifications: HMAC uses standard hash functions like SHA-1, SHA-256, or SHA-3
  • Performance: HMAC is relatively fast compared to other cryptographic operations
  • Security proofs: HMAC has formal security proofs connecting its security to that of the underlying hash function
  • Resistance to length extension attacks: Properly constructed to resist vulnerabilities that affect naive hash-and-key combinations

The unkeyed hash function \( \mathcal{H} \) used in HMAC is often SHA-1, which means the key \( K \) is 512 bits long. HMAC also uses two fixed bit strings called \( ipad \) and \( opad \), both of which are 512 bits when SHA-1 is used. In hexadecimal notation, these two strings are defined as \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \). Note that the hex values "36" and "5C" each represent 8 bits, so \( ipad \) is "36" repeated 64 times and \( opad \) is "5C" repeated 64 times when SHA-1 is used.

First, Samantha and Victor agree on the key \( K \) and the hash function \( \mathcal{H} \), for example, SHA-1. Then, Samantha computes the MAC \( HMAC \) of the message \( m \) using the key \( K \) and the two strings \( ipad \) and \( opad \) in the hash function: \( HMAC = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \), where \( \| \) denotes concatenation and \( \oplus \) is the XOR operation. She then sends the message \( m \) and the MAC \( HMAC \) to Victor. Remember, it does not matter if the communication channel is insecure, because they use a keyed hash function where the key \( K \) is known only to Samantha and Victor.

Victor verifies the integrity of the received message \( m \) by computing its MAC \( HMAC' = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) in the same way as Samantha, and checks that it is equal to the received MAC \( HMAC \) from Samantha.

If the key is longer than the block size, it is first hashed to produce a key of the appropriate length. If it's shorter, it's padded with zeros.

Common security considerations:
  • Use a strong hash function (e.g., SHA-256 or SHA-3) to ensure security against collision attacks.
  • Keep the key \( K \) secret and use a unique key for each application to prevent key reuse vulnerabilities.
  • Ensure the key is of sufficient length (at least 128 bits) to resist brute-force attacks.
  • When verifying HMAC values, always use constant-time comparison functions to prevent timing attacks. Don't compare the hex strings directly with standard equality operators.
The protocol
Key exchange
Samantha and Victor agree on the key \( K \) and the hash function \( \mathcal{H} \).
Signing
Samantha:
  1. Uses the two fixed strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (written in hexadecimal).
  2. Uses the key \( K \) to compute the MAC \( HMAC = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) of the message \( m \), where \( \| \) denotes concatenation and \( \oplus \) denotes the XOR operation.
Samantha sends the MAC \( HMAC \) and the message \( m \) to Victor.
Verification
Victor:
  1. Uses the two fixed strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (written in hexadecimal).
  2. Uses the key \( K \) to compute the MAC \( HMAC' = \mathcal{H}((K \oplus opad) \: \| \: \mathcal{H}((K \oplus ipad) \: \| \: m)) \) of the message \( m \), where \( \| \) denotes concatenation and \( \oplus \) denotes the XOR operation.
  3. Verifies that \( HMAC' = HMAC \).

Try a demo of the hash function here.

Example

Victor needs to send some secret information to specific people. It is not a problem if others see who these people are, but it is important that nobody else receives the secret information. Unfortunately, Samantha is the only person who has the email addresses of all these people, so she has to send the document \( m \) containing all the email addresses. To ensure the integrity of the document, Samantha computes the MAC \( HMAC \) of the document.

First, Samantha and Victor agree to use SHA-1 as the hash function \( \mathcal{H} \) and the key \( K = C272 \dots 9D \) (written in hexadecimal). Samantha then uses the key and the two strings \( ipad = 36 \dots 36 \) and \( opad = 5C \dots 5C \) (both written in hexadecimal) to compute the 160-bit MAC:

\( HMAC = \mathcal{H}((C272 \dots 9D \oplus 5C \dots 5C) \: \| \: \mathcal{H}((C272 \dots 9D \oplus 36 \dots 36) \: \| \: m)) = 6F34 \dots 60 \)

The key and the two strings \( ipad \) and \( opad \) are all 512 bits, and the MAC is 160 bits because SHA-1 is used as the hash function. She then sends the MAC \( HMAC \) and the document \( m \) to Victor.

Victor verifies the integrity of the received document \( m \) by computing the MAC \( HMAC' = 6F34 \dots 60 \) just as Samantha did. Because \( HMAC = HMAC' \), he knows that nobody has altered the document.

The CBC-MAC explained

CBC-MAC (Cipher Block Chaining Message Authentication Code) is a keyed hash function based on the block cipher DES or AES in CBC mode with a fixed initialization vector \( IV = 0 \dots 0 \), meaning \( IV \) consists only of zeros. Note that IV is 56 bits in DES and 128 bits in AES.

CBC-MAC leverages the security properties of block ciphers to create a message authentication code. The key features of CBC-MAC include:

  • Block cipher based: Uses well-established ciphers like AES
  • Hardware efficiency: Can leverage existing hardware acceleration for block ciphers
  • Simple structure: Built on the familiar CBC mode of operation
  • Wide adoption: Used in various security protocols and financial systems

First, Samantha and Victor agree on the key \( K \) and the encryption algorithm \( E \), i.e., whether they want to use DES or AES. Samantha then divides the message \( m \) into plaintext blocks \( m_{1}, m_{2}, \dots, m_{t} \) of 56 bits or 128 bits, depending on whether she uses DES or AES. If the last block is too short, she pads it until it becomes 56 bits or 128 bits. She then sets the first ciphertext block equal to the initialization vector, i.e., \( c_{0} = IV \), and computes the remaining \( t \) ciphertext blocks by encrypting the \( i \)th plaintext block XORed with the previous ciphertext block, i.e., \( c_{i} = E_{K}(c_{i-1} \oplus m_{i}) \), where \( \oplus \) is the XOR operator. The last ciphertext block \( c_{t} \) is then the MAC of the message \( m \), which she sends to Victor together with the message \( m \).

Victor verifies the integrity of the received message \( m \) by computing its MAC \( c_{t}' \) as Samantha did, and checks that it is equal to the received MAC \( c_{t} \) from Samantha.

For messages that aren't exact multiples of the block size:

  • PKCS#7 padding (add bytes with value equal to the padding length)
  • ISO/IEC 7816-4 padding (add '80' followed by '00's)
Security considerations:
  • Use a strong block cipher (e.g., AES) with a key length of at least 128 bits.
  • Ensure the key \( K \) is kept secret and unique for each application.
  • Standard CBC-MAC is only secure for messages of a fixed length. For variable-length messages, additional techniques like CMAC should be used.
  • Never use the same key for both CBC-MAC and CBC encryption. This can lead to serious security vulnerabilities where an attacker could forge valid MACs.
  • Always use a zero IV with CBC-MAC. Using any other IV can compromise security in certain contexts.
The protocol
Key exchange
Samantha and Victor agree on the key \( K \) and the encryption algorithm \( E \).
Signing
Samantha:
  1. Uses the fixed initialization vector \( IV = 0 \dots 0 \) (all zeros) with the same length as the block size.
  2. Uses the key \( K \) with the encryption algorithm \( E \) to encrypt the plaintext blocks as follows: \( c_{0} = IV \) and \( c_{i} = E_{K}(c_{i-1} \oplus m_{i}) \).
  3. The last ciphertext block \( c_{t} \) is the MAC of the message \( m \).
Samantha sends the MAC \( c_{t} \) and the message \( m \) to Victor.
Verification
Victor:
  1. Uses the fixed initialization vector \( IV = 0 \dots 0 \) (all zeros) with the same length as the block size.
  2. Uses the key \( K \) with the encryption algorithm \( E \) to encrypt the plaintext blocks as follows: \( c_{0}' = IV \) and \( c_{i}' = E_{K}(c_{i-1}' \oplus m_{i}) \), where the last ciphertext block \( c_{t}' \) is the MAC of the message \( m \).
  3. Verifies that \( c_{t}' = c_{t} \).

Try a demo of the hash function here.

Example

Samantha has just finished writing an article \( m \) and needs to store it in her works database. The following day, she must print the article, but because she does not want to check the integrity of the article again by reading the entire article (since someone may have altered it), she computes the MAC of the article using AES. She then stores both the article and its MAC in the database, and keeps the key on her USB stick. The next day, she recomputes the MAC of the article and checks if it matches the one stored in the database.

Because Samantha uses the AES encryption algorithm, she chooses a 128-bit key \( K = 01100001 \dots \). She then sets the first ciphertext block equal to the fixed initialization vector \( IV = 0 \dots 0 \), i.e., \( c_{0} = IV = 0 \dots 0 \). Suppose that the article \( m \) can be divided into two 128-bit blocks, i.e., \( m = m_{1} \: \| \: m_{2} \), where \( \| \) denotes concatenation. She then computes the next two ciphertext blocks (both 128 bits) as follows: \( c_{1} = E_{K}(c_{0} \oplus m_{1}) = 01001000 \dots \) and \( c_{2} = E_{K}(c_{1} \oplus m_{2}) = 01100101 \dots \), where \( c_{2} \) is the MAC of the article \( m \). She then stores the article \( m \) and the MAC \( c_{2} \) in the database, and keeps the key \( K \) on her USB stick.

The next day, she recomputes the MAC \( c_{2}' \) of the article \( m \) from the database, and if \( c_{2} = c_{2}' \), she knows that nobody has altered the article.

Comparison of HMAC and CBC-MAC

Feature HMAC CBC-MAC
Base Algorithm Cryptographic hash function (typically SHA-1, SHA-256) Block cipher (DES, AES)
Security Basis Security of the underlying hash function Security of the underlying block cipher
Performance Generally faster for longer messages Can be more efficient for shorter messages
Implementation Simpler, requires only hash function code More complex, requires block cipher mode implementation
Best Use Cases Web security, API authentication, data integrity verification Banking applications, smart cards, embedded systems
Message Length Handling Naturally handles arbitrary length messages Requires padding for messages not perfectly aligned with block size
Security Considerations Proven security properties when used with secure hash functions Secure only for fixed-length messages unless additional security measures are implemented