Lmst

Beyond Bcrypt

In 2010, Coda Hale wrote How To Safely Store A Password which began with the repeated phrase, “Use bcrypt”, where the word bcrypt was linked to a different implementation for various programming languages.

This had two effects on the technology blogosphere at the time:

It convinced a lot of people that bcrypt was the right answer for storing a password.
It created a meme for how technology bloggers recommend specific cryptographic algorithms when they want attention from Hacker News.

At the time, it was great advice!

Credit: CMYKat

In 2010, bcrypt was the only clearly good answer for password hashing in most programming languages.

In the intervening almost fifteen years, we’ve learned a lot more about passwords, password cracking, authentication mechanism beyond passwords, and password-based cryptography.

If you haven’t already read my previous post about password-based cryptography, you may want to give that one a once-over before you continue.

But we’ve also learned a lot more about bcrypt, its limitations, the various footguns involved with using it in practice, and even some cool shit you can build with it.

In light of a recent discussion about switching WordPress’s password hashing algorithm from PHPass (which is based on MD5) to bcrypt, I feel now is the perfect time to dive into this algorithm and its implications on real-world cryptography.

Understanding Bcrypt in 2024

Bcrypt is a password hashing function, but not a password KDF or general-purpose cryptographic hash function.

If you’re using a sane password storage API, such as PHP’s password API, you don’t even need to think about salting your passwords, securely verifying passwords, or handling weird error conditions. Instead, you only need concern yourself with the “cost” factor, which exponentially increases the runtime of the algorithm.

There’s just one problem: bcrypt silently truncates after 72 characters (or rather, bytes, if you’re pedantic and assume non-ASCII passwords, such as emoji).

Here’s a quick script you can run yourself to test this:

<?php$example1 = str_repeat('A', 72);$example2 = $example1 . 'B';$hash = password_hash($example1, PASSWORD_BCRYPT);var_dump(password_verify($example2, $hash));

This may sound ludicrous (“who uses 72 character passwords anyway?”) until you see security advisories like this recent one from Okta.

The Bcrypt algorithm was used to generate the cache key where we hash a combined string of userId + username + password. Under a specific set of conditions, listed below, this could allow users to authenticate by providing the username with the stored cache key of a previous successful authentication.
(…)
The username is 52 characters or longer

The other thing to consider is that many people use passphrases, such as those generated from Diceware, which produce longer strings with less entropy per character.

If you use bcrypt as-is, you will inevitably run into this truncation at some point.

“Let’s pre-hash passwords!”

In response to this limitation, many developers will suggest pre-hashing the password with a general purpose cryptographic hash function, such as SHA-256.

And so, in pursuit of a way to avoid one footgun, developers introduced two more.

Truncation on NUL Bytes

If you use the raw binary output of a hash function as your password hash, be aware that bcrypt will truncate on NUL (0x00) bytes.

With respect to the WordPress issue linked above, the default for PHP’s hashing API is to output hexadecimal characters.

This is a bit wasteful. Base64 is preferable, although any isomorphism of the raw hash output that doesn’t include a 0x00 byte is safe from truncation.

Hash Shucking

When a system performs a migration from a cryptographic hash function (e.g., MD5) to bcrypt, they typically choose to re-hash the existing hash with bcrypt.

Because users typically reuse passwords, you can often take the fast, unsalted hashes from another breach and use it as your password dictionary for bcrypt.

If then you succeed in verifying the bcrypt password for a fast hash, you can then plug the fast hash into software like Hashcat, and then crack the actual password at a much faster rate (tens of billions of candidates/second, versus thousands per second).

This technique is called hash shucking (YouTube link).

You can avoid hash shucking by using HMAC with a static key–either universal for all deployments of your software, or unique per application.

It doesn’t really matter which you choose; all you really need from it is domain separation from naked hashes.

I frequently see this referred to as “peppering”, but the term “pepper” isn’t rigidly defined anywhere.

One benefit of using a per-application HMAC secret does make your hashes harder to crack if you don’t know this secret.

For balance, one downside is that your hashes are no longer portable across applications without managing this static key.

Disarming Bcrypt’s Footguns

Altogether, it’s quite straightforward to avoid bcrypt’s footguns, as I had recommended to WordPress last week.

Pre-hash with HMAC-SHA512.
Ensure the output of step 1 is base64-encoded.
Pass the output of step 2 to PHP’s password API.

Easy, straightforward, and uncontroversial. Right?

Objections to Bcrypt Disarmament

The linked discussion was tedious, so I will briefly describe the objections raised to my suggestion.

This is “rolling our own crypto”.
- Answer: No, it’s a well-understood pattern that’s been discussed in the PHP community for well over a decade.
Passwords over 72 characters are rare and not worthy of our consideration.
- Answer: No, this has bit people in unexpected ways before (see: Okta).
  When you develop a popular CMS, library, or framework, you cannot possibly know all the ways that your software will be used by others. It’s almost always better to be misuse-resistant.
Pre-hashing introduces a Denial-of-Service attack risk.
- Answer: No. Bcrypt with a cost factor of 10 is about 100,000 times as expensive as SHA2.
This introduces a risk of hash shucking.
- As demonstrated above, HMAC doesn’t suffer this problem (assuming the key is reasonably selected).
Base64 encoding reduces entropy.
- Answer: No, it’s isomorphic.
Base64 with the 72 character truncation reduces entropy.
- Answer: We’re still truncating SHA-512 to more than 256 bits of its output, so this doesn’t actually matter for any practical security reason.
This would necessitate a special prefix (e.g. $2w$ ) to distinguish disarmed bcrypt from vanilla bcrypt that PHP’s password API wouldn’t know what to do with.
- This is a trivial concern, for which the fix is also trivial:
  After password_hash(), modify the prefix with a marker to indicate pre-hashing.
  Before password_verify(), swap the original prefix back in.

There were some other weird arguments (such as “Bcrypt is approved by NIST for FIPS”, which is just plain false).

Why Bcrypt Truncating SHA-512 Doesn’t Matter

If you have a random secret key, HMAC-SHA-512 is a secure pseudorandom function that you can treat as a Random Oracle.

Because it’s HMAC, you don’t have to worry about Length Extension Attacks at all. Therefore, the best known attack strategy is to produce a collision.

The raw binary output of SHA-512 is 64 characters, but may contain NUL characters (which would truncate the hash). To avoid this, we base64-encode the output.

When you base64-encode a SHA-512 hash, the output is 88 characters (due to base64 padding). This is longer than the 72 characters supported by bcrypt, so it will truncate silently after 72 characters.

This is still secure, but to prove this, I need to use math.

First, let’s assume you’re working with an extremely secure, high-entropy password, and might be negatively impacted by this truncation. How bad is the damage in this extreme case?

There are 64 possible characters in the base64 alphabet. That’s tautology, after all.

If you have a string of length 72, for which each character can be one of 64 values, you can represent the total probability space of possible strings as .

If you know that , you can do a little bit of arithmetic and discover this quantity equal to .

As I discussed in my deep dive on the birthday bound, you can take the cube root of this number to find what I call the Optimal Birthday Bound.

This works out to samples in order to find a probability of a single collision.

This simply isn’t going to happen in our lifetimes.

2^-144 is about 17 trillion times less likely than 2^-100.

The real concern is the entropy of the actual password, not losing a few bits from a truncated hash.

After all, even though the outputs of HMAC-SHA512 are indistinguishable from random when you don’t know the HMAC key, the input selection is entirely based on the (probably relatively easy-to-guess) password.

“Why not just use Argon2 or Scrypt?”

Argon2 and scrypt don’t have the bcrypt footguns. You can hash passwords of arbitrary length and not care about NUL characters. They’re great algorithms.

Several people involved in the Password Hashing Competition (that selected Argon2 as its winner) have since lamented the emphasis on memory-hardness over cache-hardness. Cache-hardness is more important for short run-times (i.e., password-based authentication), while memory-hardness is more important for longer run-times (i.e., key derivation).

As Sc00bz explains in the GitHub readme for his bscrypt project:

Cache hard algorithms are better than memory hard algorithms at shorter run times. Basically cache hard algorithms forces GPUs to use 1/4 to 1/16 of the memory bandwidth because of the large bus width (commonly 256 to 1024 bits). Another way to look at it is memory transactions vs bandwidth. Also the low latency of L2 cache on CPUs and the 8 parallel look ups let’s us make a lot of random reads. With memory hard algorithms, there is a point where doubling the memory quarters a GPU attacker’s speed. There then is a point at which a memory hard algorithm will overtake a cache hard algorithm. Cache hard algorithms don’t care that GPUs will get ~100% utilization of memory transactions because it’s already very limiting.

Ironically, bcrypt is cache-hard, while scrypt and the flavors of Argon2 that most people use are not.

Most of my peers just care that you use a password hashing algorithm at all. They don’t particularly care which. The bigger, and more common, vulnerability is not using one of them in the first place.

I’m mostly in agreement with them, but I would prefer that anyone that chooses bcrypt takes steps to disarm its footguns.

Turning Bcrypt Into a KDF

Earlier, I noted that bcrypt is not a password KDF. That doesn’t mean you can’t make one out of bcrypt. Ryan Castellucci is an amazing hacker; they managed to do just that.

To understand why this is difficult, and why Ryan’s hack works, you need to understand what bcrypt actually is.

Bcrypt is a relatively simple algorithm at its heart:

Run the Blowfish key schedule, several times, over both the password and salt.
Encrypt the string "OrpheanBeholderScryDoubt" 64 times in ECB mode using the expanded key from step 1.

Most of the heavy work in bcrypt is actually done in the key schedule; the encryption of three blocks (remember, Blowfish is a 64-bit block cipher) just ensures you need the correct resultant key from the key schedule.

“So how do you get an encryption key out of bcrypt?”

It’s simple: we, uh, hash the S-box.

static void BF_kwk(struct BF_data *data, uint8_t kwk[BLAKE2B_KEYBYTES]) {  BF_word *S = (BF_word *)data->ctx.S;  BF_htobe(S, 4*256);  // it should not be possible for this to fail...  int ret = blake2b_simple(kwk, BLAKE2B_KEYBYTES, S, sizeof(BF_word)*4*256);  assert(ret == 0);  BF_betoh(S, 4*256);}

Using BLAKE2b to hash the S-box from the final Blowfish key expansion yields a key-wrapping key that can be used to encrypt whatever data is being protected.

The only feasible way to recover this key is to provide the correct password and salt to arrive at the same key schedule.

Any attack against the selection of S implies a cryptographic weakness in bcrypt, too. (I’ve already recommended domain separation in a GitHub issue.)

CMYKat

It’s worth remembering that Ryan’s design is a proof-of-concept, not a peer-reviewed design ready for production. Still, it’s a cool hack.

It’s also not the first of its kind (thanks, Damien Miller).

If anyone was actually considering using this design, first, they should wait until it’s been adequately studied. Do not pass Go, do not collect $200.

Additionally, the output of the BLAKE2b hash should be used as the input keying material for, e.g., HKDF. This lets you split the password-based key into multiple application-specific sub-keys without running the password KDF again for each derived key.

Wrapping Up

Although bcrypt is still an excellent cache-hard password hashing function suitable for interactive logins, it does have corner cases that sometimes cause vulnerabilities in applications that misuse it.

If you’re going to use bcrypt, make sure you use bcrypt in line with my recommendations to WordPress: HMAC-SHA-512, base64 encode, then bcrypt.

Here’s a quick proof-of-concept for PHP software:

<?phpdeclare(strict_types=1);class SafeBcryptWrapperPoC{  private $staticKey;  private $cost = 12;  public function __construct(    #[\SensitiveParameter]    string $staticKey,    int $cost = 12  ) {    $this->staticKey = $staticKey;    $this->cost = $cost;  }    /**   * Generate password hashes here   */  public function hash(    #[\SensitiveParameter]    string $password  ): string {    return \password_hash(      $this->prehash($password),      PASSWORD_BCRYPT,      ['cost' => $this->cost]    );  }  /**   * Verify password here   */  public function verify(    #[\SensitiveParameter]    string $password,    #[\SensitiveParameter]    string $hash  ): bool {    return \password_verify(      $this->prehash($password),      $hash    );  }  /**   * Pre-hashing with HMAC-SHA-512 here   *   * Note that this prefers the libsodium base64 code, since   * it's implemented in constant-time   */  private function prehash(    #[\SensitiveParameter]    string $password  ): string {    return \sodium_bin2base64(      \hash_hmac('sha512', $password, $this->staticKey, true),      \SODIUM_BASE64_VARIANT_ORIGINAL_NO_PADDING    );  }}

You can see a modified version of this proof-of-concept on 3v4l, which includes the same demo from the top of this blog post to demonstrate the 72-character truncation bug.

If you’re already using bcrypt in production, you should be cautious with adding this pre-hashing alternative. Having vanilla bcrypt and non-vanilla bcrypt side-by-side could introduce problems that need to be thoroughly considered.

I can safely recommend it to WordPress because they weren’t using bcrypt before. Most of the people reading this are probably not working on the WordPress core.

Addendum (2024-11-28)

More of the WordPress team has chimed in to signal support for vanilla bcrypt, rather than disarming the bcrypt footgun.

The reason?

That would result in maximum compatibility for existing WordPress users who use the Password hashes outside of WordPress, while also not introducing yet-another-custom-hash into the web where it’s not overly obviously necessary, but while still gaining the bcrypt advantages for where it’s possible.
dd32

The hesitance to introduce a custom hash construction is understandable, but the goal I emphasized with bold text is weird and not a reasonable goal for any password storage system.

It’s true that the overwhelming non-WordPress code written in PHP is just using the password hashing API. But that means they aren’t compatible with WordPress today. PHP’s password hashing API doesn’t implement phpass, after all.

In addition to being scope creep for a secure password storage strategy, it’s kind of a bonkers design constraint to expect password hashes be portable. Why are you intentionally exposing hashes unnecessarily?

CMYKat

At this point, it’s overwhelmingly likely that WordPress will choose to not disarm the bcrypt footguns, and will just ship it.

That’s certainly not the worst outcome, but I do object to arriving there for stupid reasons, and that GitHub thread is full of stupid reasons and misinformation.

The most potent source of misinformation also barked orders at me and then tried to dismiss my technical arguments as the concerns of “the hobbyist community”, which was a great addition to my LinkedIn profile.

If WordPress’s choice turns out to be a mistake–that is to say, that their decision for vanilla bcrypt introduces a vulnerability in a plugin or theme that uses their password hashing API for, I dunno, API keys?–at least I can say I tried.

Additionally, WordPress cannot say they didn’t know the risk existed, especially in a courtroom, since me informing them of it is so thoroughly documented (and archived).

CMYKat

Here’s to hoping the risk never actually manifests. Saying “I told you so” is more bitter than sweet in security. Happy Thanksgiving.

Header image: Art by Jim and CMYKat; a collage of some DEFCON photos, as well as Creative Commons photos of Bruce Schneier (inventor of the Blowfish block cipher) and Niels Provos (co-designer of bcrypt, which is based on Blowfish).

#bcrypt #cryptography #passwordHashing #passwords #SecurityGuidance

What To Use Instead of PGP

It’s been more than five years since The PGP Problem was published, and I still hear from people who believe that using PGP (whether GnuPG or another OpenPGP implementation) is a thing they should be doing.

It isn’t.

I don’t blame individual Internet users for this confusion. There is a lot of cargo-culting around communication tools in the software community, and the evangelists for the various projects muddy the waters for the rest of us.

Harubaki

The part of the free and open source software community that thinks PGP is just dandy, and therefore evangelize the hell out of it to unsuspecting people, are the same kind of people that happily use XMPP+OMEMO, Matrix, or weird Signal forks that remove forward secrecy and think it’s fine.

Not to mince words: The same people who believe PGP is good are also famously not great at cryptography engineering.

If you’re going to outsource your opinions on privacy technology to someone else, make sure it’s someone who has actually found vulnerabilities in cryptographic software before. Most evangelists have not.

CMYKat

I’m not here to litigate the demerits of PGP. The Latacora article I linked above makes the same arguments I would make today, and is a more entertaining read.

It is of my opinion as a security engineer that specializes in applied cryptography that nobody should use PGP, because there’s virtually always a better tool for the job you want to use PGP for.

(And for the uncommon use cases, offering a secure, purpose-built replacement is a work-in-progress.)

Note: I’m deliberately being blunt in this post because literally more than a decade of softspokenness from cryptography experts has done nothing to talk users off the PGP cliff. Being direct seems more effective than being tactful.
If you want a gentler touch, ask your cryptographer. If you don’t have a cryptographer, hire one.

If you can accept that every billionaire is the result of a failed system, that’s how cryptographers feel about people using PGP.

Instead, let’s examine the “use cases” of PGP and what you should be using instead. (Some of this is redundant with the Latacora article, but I’m also writing it 5 years later, so some things have changed.)

CMYKat

I’m focusing on the “what” in this post, not the “why”. If you want to know the why, read the Latacora blog, or the Matthew Green blog.
If you’re curious about the credibility of my recommendations, read my other blog posts or ask your cryptographer.

Instead of PGP, Use This

This section contains specific tools to solve the same problems that PGP tries to solve, but better.

What makes these recommendations better than PGP?

Simply, they don’t make cryptographers want to run the other way screaming when they look under the hood. PGP does.

Some people are forced to use PGP because they work for a government that legally requires them to use PGP. In that corner case, your hands are tied by lawyers, so you don’t need to bother with what cryptographers recommend.

CMYKat

Signing Software Distributions

Use Sigstore.

Note that this is an ecosystem-wide consideration, not something that specific individuals must manually opt into for each of their hobby projects. The only downside to Sigstore is it hasn’t been widely adopted yet.

If you’re a Python developer, you can just use PEP 740 to get attestations with Trusted Publishers, which gives you Sigstore for free. For most developers, this is as simple as setting up a GitHub Action to publish to PyPI.
This is a developing trend: Other programming language and package management ecosystems are following suit. I expect to see Sigstore attestations baked into NPM and Maven before the next US presidential election. With any luck, your favorite programming language could be on this list too.

Sigstore doesn’t just give you a signature that you check with a long-lived public key, nor does it require you to do the Web Of Trust rigamarole.

Rather, Sigstore gives you a lot for free. Sigstore was designed around ephemeral signing certificates rather than a long-lived private key. It was purpose-built for preventing supply-chain attacks against open source software.

Combined with Reproducible Builds, Sigstore solves the triangle of secure code delivery.

Alternatively, use minisign. If your package ecosystem doesn’t support Sigstore yet, you can get by with minisign (which is signify-compatible) until they modernize.

You can also use SSH signatures, if you’d prefer. (More on that below.)

CMYKat

Signing Git Tags/Commits

Use SSH Signatures, not PGP signatures.

With Ed25519. Stop using RSA.

Art by Harubaki

Sending Files Between Computers

Use Magic Wormhole.

You could also use SSH + rsync to do this job. That’s fine too.

CMYKat

Encrypting Backups

Tarsnap is the usual recommendation here.

There are a lot of other encrypted backup tools that work fine, if you don’t want to give Colin Percival your business. I don’t have a financial stake in any of them, nor have I audited them thoroughly.

Borg uses reasonable cryptography, but I haven’t had the time to review it carefully.

Kopia looks fine, but I really hate that they misuse “zero knowledge” to describe an encryption protocol (rather than a proof system). We should not reward this misbehavior by marketers.

The point is: You’ve got options.

Too many options, in my opinion, to settle for PGP.

CMYKat

Encrypting Application Data

Use Tink or libsodium.

Avoid: OpenPGP, OpenSSL and its competitors.

Not a lot to say here. I’ve written a lot about this over the years. Misuse-resistant cryptography libraries–especially ones that make key management less painful for users–are the way to go.

Harubaki

Encrypting Files

Use age.

Age is what PGP file encryption would be if PGP didn’t suck shit.

Age has two modes: Public-key encryption, and password-based key derivation.

Here’s a quick comparison table between what age offers, and what PGP uses in the installed base:

agePGPData encryption modeAEAD (ChaPoly)CAST5 (64-bit block cipher) in CFB mode with a strippable SHA1 “MDC”Key-commitmentYes (via the header)Pah! You wish! Dream on.
PGP isn’t even AEAD.Password KDF memory hard?Yes, with scrypt.No.Vulnerable to chosen-ciphertext attacks?No.Yes, but PGP proponents stupidly consider this a good thing.Supports 90’s-era cryptography?No.Yes.Releases unauthenticated plaintext?No.Yes.Uses versioned protocols rather than “cipher agility”?Yes.No. See: 90’s era cryptography.Most common implementations are memory-safe?Yes (Go, Rust).No (C).

Like, it’s not even close.

CMYKat

Some PGP proponents will insist that AEAD is possible now, but as long as the installed base of PGP remains backwards compatible with the lowest common denominator, that’s what your software uses.

Just use age. Or rage, if you’re a Rust enthusiast.

(And if you have concerns about “which age key should I trust?”, I’m already planning an age-v1 extension for the Public Key Directory project. More on that below.)

Art by Scruff

Private Messaging

Use Signal.

Security teams around the world insist that they need PGP for bug bounty submissions or security operations, but Signal does this job better than PGP ever did.

Once upon a time, you needed to give people a phone number to use Signal, but that hasn’t been the case for a long time. Still, many people have missed that memo and think it’s a requirement.

My Signal username is soatok.45. Go ahead and message me. You won’t learn my phone number that way.

In the near future, I plan on developing end-to-end encryption for direct messages on the Fediverse (including Mastodon). This is what motivated my work on the Public Key Directory to begin with.

But this is not intended to be a Signal competitor by any measure. It’s a bar-raising activity, nothing more.

CMYKat

I understand some people don’t like or trust Signal for whatever reason, but every single alternative that’s been suggested to Signal has offered inferior cryptography to Signal’s. So I will continue to recommend Signal.

Miscellaneous PGP Alternatives

This section contains things people think they need PGP for.

Identity Verification

I’m actively working on something better!

via XKCD

If you want the ability to vend a transparently verifiable public key for a given user, that’s one of the use cases for the Public Key Directory I’m designing in order to build end-to-end encryption for the Fediverse.

Although this is purpose-built for the Fediverse, I’ve deliberately included support for Auxiliary Data messages, whose formats will be specified by protocol extensions.

Rather than trying to grok the Web-of-Trust, you can simply have your software check that multiple independent Public Key Directories have verified the record, since its inclusion is published in an append-only transparency log, secured by a Merkle tree.

My design doesn’t preclude any manual key verification, or key-signing parties, or whatever other PGP cultural weirdness you want to do with these public keys. It just establishes a baseline trustworthiness even if you’re not a paranoid computer nerd.

My project isn’t finished yet. In the meantime, you can manually check public keys when using the other recommendations on this page.

Harubaki

Encrypted Email

Don’t encrypt email. From the Latacora article:

Email is insecure. Even with PGP, it’s default-plaintext, which means that even if you do everything right, some totally reasonable person you mail, doing totally reasonable things, will invariably CC the quoted plaintext of your encrypted message to someone else (we don’t know a PGP email user who hasn’t seen this happen). PGP email is forward-insecure. Email metadata, including the subject (which is literally message content), are always plaintext.

There isn’t a recommendation for encrypted email because that’s not a thing people should be doing.

Art by AJ

Now, there exists a minority of extremely technical computer user for which Signal is a nonstarter (because you need a smartphone and valid phone number to enroll in the first place).

Because those people are generally not the highest priority of cryptographers (who are commonly focused on the privacy of common folk–including people in poor and developing countries where smartphones are more common than desktop computers), there presently isn’t really a good recommendation for private messaging that meets their constraints.

Not Matrix.

Not XMPP+OMEMO.

Certainly not PGP, either.

What PGP offers here is security theater: the illusion of safety. But it’s not actually a robust private communication mechanism, as Latacora argues.

CMYKat

“I insist that I need encrypted email!”

If you find someone insisting that they “need” encrypted email, read up on the XY Problem. In a lot of cases, that’s what’s happening here.

Do they ipso facto need email (as in, specifically the email protocols and email software)?

And do they care more about this constraint, or the privacy of their communications?

Because if their goal just to communicate privately, see above.

If the tool they’re using being email is more important than privacy, they should consider sending empty messages with an attachment, and use age to encrypt the actual message before attaching it.

That’s serviceable, just beware that everything Latacora wrote about encrypted emails still applies to your use case, so expect someone to CC or forward your message as plaintext.

(Unless you’re legally required to use PGP because of a government regulation… in which case, why do you care about my recommendations if you’re chained by the ankle to your government’s bad technology choices?)

Finally, miss me with the “but someone can screenshot Signal” genre of objections.

As Latacora noted, people accidentally fuck up PGP all the time! It’s very easy to do.

Conversely, you have to deliberately leak something from Signal. There is no plaintext mode.

That’s the fucking bar you need to meet to compete with Signal.

PGP fails to be a Signal competitor, in ways that are worse than Threema, Matrix, or OMEMO.

Watch This Space

With all that said, I am actually designing an encrypted messaging protocol that will have an email-like user experience, except:

Everything is always end-to-end encrypted, with forward secrecy.
It’s not backwards compatible with insecure email.
It doesn’t use PGP, or any 1990’s era cryptography.

I can’t promise a release date yet. I’m prioritizing end-to-end encryption for the Fediverse before I write the specification for that project (tentatively called AWOO, but the cryptography underpinning both projects should be similar).

Maybe 2026? We’ll see!

If someone beats me to the punch, and their design is actually good, I’ll update the post and replace this with a specific recommendation.

CMYKat

Against PGP

I don’t know how to get the message out louder or clearer about how cryptographers feel about PGP than what I wrote here.

Latacora wrote their criticism in 2019. As I write this, 2024 is almost over. When will the PGP-induced madness end?

CMYKat

Experts are not divided here. There is no controversy to teach.

Every time a cryptographer has talked about PGP, it’s been to complain about how bad it is and opine that people shouldn’t be using it.

If you’ve read this far, you already know what you should be using instead.

Header art credits: CMYKat and the GnuPG logo.

Update (2024-11-16)

Someone tried to use their Fediverse software to submit an anti-furry comment to this blog post.

Therefore, I’ve added more furry art to it.

loviesophiee

If you’re curious about the cryptography used by other messaging apps, please refer to this page that collects my blogs about this topic.

#alternatives #codeSigning #digitalSignatures #encryption #PGP #security #SecurityGuidance #signing

Earlier this year, Cendyne wrote a blog post covering the use of HKDF, building partially upon my own blog post about HKDF and the KDF security definition, but moreso inspired by a cryptographic issue they identified in another company’s product (dubbed AnonCo).

At the bottom they teased:

Database cryptography is hard. The above sketch is not complete and does not address several threats! This article is quite long, so I will not be sharing the fixes.
Cendyne

If you read Cendyne’s post, you may have nodded along with that remark and not appreciate the degree to which our naga friend was putting it mildly. So I thought I’d share some of my knowledge about real-world database cryptography in an accessible and fun format in the hopes that it might serve as an introduction to the specialization.

Note: I’m also not going to fix Cendyne’s sketch of AnonCo’s software here–partly because I don’t want to get in the habit of assigning homework or required reading, but mostly because it’s kind of obvious once you’ve learned the basics.

I’m including art of my fursona in this post… as is tradition for furry blogs.

If you don’t like furries, please feel free to leave this blog and read about this topic elsewhere.

Thanks to CMYKat for the awesome stickers.

Contents

Database Cryptography?
Cryptography for Relational Databases
- The Perils of Built-in Encryption Functions
- Application-Layer Relational Database Cryptography
  - Confused Deputies
  - Canonicalization Attacks
  - Multi-Tenancy
Cryptography for NoSQL Databases
- NoSQL is Built Different
- Record Authentication
  - Bonus: A Maximally Schema-Free, Upgradeable Authentication Design
Searchable Encryption
- Order-{Preserving, Revealing} Encryption
- Deterministic Encryption
- Homomorphic Encryption
- Searchable Symmetric Encryption (SSE)
- You Can Have Little a HMAC, As a Treat
Intermission
Case Study: MongoDB Client-Side Encryption
- MongoCrypt: The Good
  - How is Queryable Encryption Implemented?
- MongoCrypt: The Bad
- MongoCrypt: The Ugly
Wrapping Up

Database Cryptography?

The premise of database cryptography is deceptively simple: You have a database, of some sort, and you want to store sensitive data in said database.

The consequences of this simple premise are anything but simple. Let me explain.

Art: ScruffKerfluff

The sensitive data you want to store may need to remain confidential, or you may need to provide some sort of integrity guarantees throughout your entire system, or sometimes both. Sometimes all of your data is sensitive, sometimes only some of it is. Sometimes the confidentiality requirements of your data extends to where within a dataset the record you want actually lives. Sometimes that’s true of some data, but not others, so your cryptography has to be flexible to support multiple types of workloads.

Other times, you just want your disks encrypted at rest so if they grow legs and walk out of the data center, the data cannot be comprehended by an attacker. And you can’t be bothered to work on this problem any deeper. This is usually what compliance requirements cover. Boxes get checked, executives feel safer about their operation, and the whole time nobody has really analyzed the risks they’re facing.

But we’re not settling for mere compliance on this blog. Furries have standards, after all.

So the first thing you need to do before diving into database cryptography is threat modelling. The first step in any good threat model is taking inventory; especially of assumptions, requirements, and desired outcomes. A few good starter questions:

What database software is being used? Is it up to date?
What data is being stored in which database software?
How are databases oriented in the network of the overall system?
- Is your database properly firewalled from the public Internet?
How does data flow throughout the network, and when do these data flows intersect with the database?
- Which applications talk to the database? What languages are they written in? Which APIs do they use?
How will cryptography secrets be managed?
- Is there one key for everyone, one key per tenant, etc.?
- How are keys rotated?
- Do you use envelope encryption with an HSM, or vend the raw materials to your end devices?

The first two questions are paramount for deciding how to write software for database cryptography, before you even get to thinking about the cryptography itself.

(This is not a comprehensive set of questions to ask, either. A formal threat model is much deeper in the weeds.)

The kind of cryptography protocol you need for, say, storing encrypted CSV files an S3 bucket is vastly different from relational (SQL) databases, which in turn will be significantly different from schema-free (NoSQL) databases.

Furthermore, when you get to the point that you can start to think about the cryptography, you’ll often need to tackle confidentiality and integrity separately.

If that’s unclear, think of a scenario like, “I need to encrypt PII, but I also need to digitally sign the lab results so I know it wasn’t tampered with at rest.”

My point is, right off the bat, we’ve got a three-dimensional matrix of complexity to contend with:

On one axis, we have the type of database.
- Flat-file
- Relational
- Schema-free
On another, we have the basic confidentiality requirements of the data.
- Field encryption
- Row encryption
- Column encryption
- Unstructured record encryption
- Encrypting entire collections of records
Finally, we have the integrity requirements of the data.
- Field authentication
- Row/column authentication
- Unstructured record authentication
- Collection authentication (based on e.g. Sparse Merkle Trees)

And then you have a fourth dimension that often falls out of operational requirements for databases: Searchability.

Why store data in a database if you have no way to index or search the data for fast retrieval?

Credit: Harubaki

If you’re starting to feel overwhelmed, you’re not alone. A lot of developers drastically underestimate the difficulty of the undertaking, until they run head-first into the complexity.

Some just phone it in with AES_Encrypt() calls in their MySQL queries. (Too bad ECB mode doesn’t provide semantic security!)

Which brings us to the meat of this blog post: The actual cryptography part.

Cryptography is the art of transforming information security problems into key management problems.
Former coworker

Note: In the interest of time, I’m skipping over flat files and focusing instead on actual database technologies.

Cryptography for Relational Databases

Encrypting data in an SQL database seems simple enough, even if you’ve managed to shake off the complexity I teased from the introduction.

You’ve got data, you’ve got a column on a table. Just encrypt the data and shove it in a cell on that column and call it a day, right?

But, alas, this is a trap. There are so many gotchas that I can’t weave a coherent, easy-to-follow narrative between them all.

So let’s start with a simple question: where and how are you performing your encryption?

The Perils of Built-in Encryption Functions

MySQL provides functions called AES_Encrypt and AES_Decrypt, which many developers have unfortunately decided to rely on in the past.

It’s unfortunate because these functions implement ECB mode. To illustrate why ECB mode is bad, I encrypted one of my art commissions with AES in ECB mode:

Art by Riley, encrypted with AES-ECB

The problems with ECB mode aren’t exactly “you can see the image through it,” because ECB-encrypting a compressed image won’t have redundancy (and thus can make you feel safer than you are).

ECB art is a good visual for the actual issue you should care about, however: A lack of semantic security.

A cryptosystem is considered semantically secure if observing the ciphertext doesn’t reveal information about the plaintext (except, perhaps, the length; which all cryptosystems leak to some extent). More information here.

ECB art isn’t to be confused with ECB poetry, which looks like this:

Oh little one, you’re growing up
You’ll soon be writing C
You’ll treat your ints as pointers
You’ll nest the ternary
You’ll cut and paste from github
And try cryptography
But even in your darkest hour
Do not use ECB
CBC’s BEASTly when padding’s abused
And CTR’s fine til a nonce is reused
Some say it’s a CRIME to compress then encrypt
Or store keys in the browser (or use javascript)
Diffie Hellman will collapse if hackers choose your g
And RSA is full of traps when e is set to 3
Whiten! Blind! In constant time! Don’t write an RNG!
But failing all, and listen well: Do not use ECB
They’ll say “It’s like a one-time-pad!
The data’s short, it’s not so bad
the keys are long–they’re iron clad
I have a PhD!”
And then you’re front page Hacker News
Your passwords cracked–Adobe Blues.
Don’t leave your penguins showing through,
Do not use ECB
— Ben Nagy, PoC||GTFO 0x04:13

Most people reading this probably know better than to use ECB mode already, and don’t need any of these reminders, but there is still a lot of code that inadvertently uses ECB mode to encrypt data in the database.

Also, SHOW processlist; leaks your encryption keys. Oops.

Credit: CMYKatt

Application-layer Relational Database Cryptography

Whether burned by ECB or just cautious about not giving your secrets to the system that stores all the ciphertext protected by said secret, a common next step for developers is to simply encrypt in their server-side application code.

And, yes, that’s part of the answer. But how you encrypt is important.

Credit: Harubaki

“I’ll encrypt with CBC mode.”
If you don’t authenticate your ciphertext, you’ll be sorry. Maybe try again?

“Okay, fine, I’ll use an authenticated mode like GCM.”
Did you remember to make the table and column name part of your AAD? What about the primary key of the record?

“What on Earth are you talking about, Soatok?”
Welcome to the first footgun of database cryptography!

Confused Deputies

Encrypting your sensitive data is necessary, but not sufficient. You need to also bind your ciphertexts to the specific context in which they are stored.

To understand why, let’s take a step back: What specific threat does encrypting your database records protect against?

We’ve already established that “your disks walk out of the datacenter” is a “full disk encryption” problem, so if you’re using application-layer cryptography to encrypt data in a relational database, your threat model probably involves unauthorized access to the database server.

What, then, stops an attacker from copying ciphertexts around?

Credit: CMYKatt

Let’s say I have a legitimate user account with an ID 12345, and I want to read your street address, but it’s encrypted in the database. But because I’m a clever hacker, I have unfettered access to your relational database server.

All I would need to do is simply…

UPDATE table SET addr_encrypted = 'your-ciphertext' WHERE id = 12345

…and then access the application through my legitimate access. Bam, data leaked. As an attacker, I can probably even copy fields from other columns and it will just decrypt. Even if you’re using an authenticated mode.

We call this a confused deputy attack, because the deputy (the component of the system that has been delegated some authority or privilege) has become confused by the attacker, and thus undermined an intended security goal.

The fix is to use the AAD parameter from the authenticated mode to bind the data to a given context. (AAD = Additional Authenticated Data.)

- $addr = aes_gcm_encrypt($addr, $key);+ $addr = aes_gcm_encrypt($addr, $key, canonicalize([+     $tableName,+     $columnName,+     $primaryKey+ ]);

Now if I start cutting and pasting ciphertexts around, I get a decryption failure instead of silently decrypting plaintext.

This may sound like a specific vulnerability, but it’s more of a failure to understand an important general lesson with database cryptography:

Where your data lives is part of its identity, and MUST be authenticated.
Soatok’s Rule of Database Cryptography

Canonicalization Attacks

In the previous section, I introduced a pseudocode called canonicalize(). This isn’t a pasto from some reference code; it’s an important design detail that I will elaborate on now.

First, consider you didn’t do anything to canonicalize your data, and you just joined strings together and called it a day…

function dumbCanonicalize(    string $tableName,    string $columnName,    string|int $primaryKey): string {    return $tableName . '_' . $columnName . '#' . $primaryKey;}

Consider these two inputs to this function:

dumbCanonicalize('customers', 'last_order_uuid', 123);
dumbCanonicalize('customers_last_order', 'uuid', 123);

In this case, your AAD would be the same, and therefore, your deputy can still be confused (albeit in a narrower use case).

In Cendyne’s article, AnonCo did something more subtle: The canonicalization bug created a collision on the inputs to HKDF, which resulted in an unintentional key reuse.

Up until this point, their mistake isn’t relevant to us, because we haven’t even explored key management at all. But the same design flaw can re-emerge in multiple locations, with drastically different consequence.

Multi-Tenancy

Once you’ve implemented a mitigation against Confused Deputies, you may think your job is done. And it very well could be.

Often times, however, software developers are tasked with building support for Bring Your Own Key (BYOK).

This is often spawned from a specific compliance requirement (such as cryptographic shredding; i.e. if you erase the key, you can no longer recover the plaintext, so it may as well be deleted).

Other times, this is driven by a need to cut costs: Storing different users’ data in the same database server, but encrypting it such that they can only encrypt their own records.

Two things can happen when you introduce multi-tenancy into your database cryptography designs:

Invisible Salamanders becomes a risk, due to multiple keys being possible for any given encrypted record.
Failure to address the risk of Invisible Salamanders can undermine your protection against Confused Deputies, thereby returning you to a state before you properly used the AAD.

So now you have to revisit your designs and ensure you’re using a key-committing authenticated mode, rather than just a regular authenticated mode.

Isn’t cryptography fun?

“What Are Invisible Salamanders?”

This refers to a fun property of AEAD modes based on Polynomical MACs. Basically, if you:

Encrypt one message under a specific key and nonce.
Encrypt another message under a separate key and nonce.

…Then you can get the same exact ciphertext and authentication tag. Performing this attack requires you to control the keys for both encryption operations.

This was first demonstrated in an attack against encrypted messaging applications, where a picture of a salamander was hidden from the abuse reporting feature because another attached file had the same authentication tag and ciphertext, and you could trick the system if you disclosed the second key instead of the first. Thus, the salamander is invisible to attackers.

Art: CMYKat

We’re not quite done with relational databases yet, but we should talk about NoSQL databases for a bit. The final topic in scope applies equally to both, after all.

Cryptography for NoSQL Databases

Most of the topics from relational databases also apply to NoSQL databases, so I shall refrain from duplicating them here. This article is already sufficiently long to read, after all, and I dislike redundancy.

NoSQL is Built Different

The main thing that NoSQL databases offer in the service of making cryptographers lose sleep at night is the schema-free nature of NoSQL designs.

What this means is that, if you’re using a client-side encryption library for a NoSQL database, the previous concerns about confused deputy attacks are amplified by the malleability of the document structure.

Additionally, the previously discussed cryptographic attacks against the encryption mode may be less expensive for an attacker to pull off.

Consider the following record structure, which stores a bunch of data stored with AES in CBC mode:

{  "encrypted-data-key": "<blob>",  "name": "<ciphertext>",  "address": [    "<ciphertext>",    "<ciphertext>"  ],  "social-security": "<ciphertext>",  "zip-code": "<ciphertext>"}

If this record is decrypted with code that looks something like this:

$decrypted = [];// ... snip ...foreach ($record['address'] as $i => $addrLine) {    try {        $decrypted['address'][$i] = $this->decrypt($addrLine);    } catch (Throwable $ex) {        // You'd never deliberately do this, but it's for illustration        $this->doSomethingAnOracleCanObserve($i);                // This is more believable, of course:        $this->logDecryptionError($ex, $addrLine);        $decrypted['address'][$i] = '';    }}

Then you can keep appending rows to the "address" field to reduce the number of writes needed to exploit a padding oracle attack against any of the <ciphertext> fields.

Art: Harubaki

This isn’t to say that NoSQL is less secure than SQL, from the context of client-side encryption. However, the powerful feature sets that NoSQL users are accustomed to may also give attackers a more versatile toolkit to work with.

Record Authentication

A pedant may point out that record authentication applies to both SQL and NoSQL. However, I mostly only observe this feature in NoSQL databases and document storage systems in the wild, so I’m shoving it in here.

Encrypting fields is nice and all, but sometimes what you want to know is that your unencrypted data hasn’t been tampered with as it flows through your system.

The trivial way this is done is by using a digital signature algorithm over the whole record, and then appending the signature to the end. When you go to verify the record, all of the information you need is right there.

This works well enough for most use cases, and everyone can pack up and go home. Nothing more to see here.

Except…

When you’re working with NoSQL databases, you often want systems to be able to write to additional fields, and since you’re working with schema-free blobs of data rather than a normalized set of relatable tables, the most sensible thing to do is to is to append this data to the same record.

Except, oops! You can’t do that if you’re shoving a digital signature over the record. So now you need to specify which fields are to be included in the signature.

And you need to think about how to model that in a way that doesn’t prohibit schema upgrades nor allow attackers to perform downgrade attacks. (See below.)

I don’t have any specific real-world examples here that I can point to of this problem being solved well.

Art: CMYKat

Furthermore, as with preventing confused deputy and/or canonicalization attacks above, you must also include the fully qualified path of each field in the data that gets signed.

As I said with encryption before, but also true here:

Where your data lives is part of its identity, and MUST be authenticated.
Soatok’s Rule of Database Cryptography

This requirement holds true whether you’re using symmetric-key authentication (i.e. HMAC) or asymmetric-key digital signatures (e.g. EdDSA).

Bonus: A Maximally Schema-Free, Upgradeable Authentication Design

Art: Harubaki

Okay, how do you solve this problem so that you can perform updates and upgrades to your schema but without enabling attackers to downgrade the security? Here’s one possible design.

Let’s say you have two metadata fields on each record:

A compressed binary string representing which fields should be authenticated. This field is, itself, not authenticated. Let’s call this meta-auth.
A compressed binary string representing which of the authenticated fields should also be encrypted. This field is also authenticated. This is at most the same length as the first metadata field. Let’s call this meta-enc.

Furthermore, you will specify a canonical field ordering for both how data is fed into the signature algorithm as well as the field mappings in meta-auth and meta-enc.

{  "example": {    "credit-card": {      "number": /* encrypted */,      "expiration": /* encrypted */,      "ccv": /* encrypted */    },    "superfluous": {      "rewards-member": null    }  },  "meta-auth": compress_bools([    true,  /* example.credit-card.number */    true,  /* example.credit-card.expiration */    true,  /* example.credit-card.ccv */    false, /* example.superfluous.rewards-member */    true   /* meta-enc */  ]),  "meta-enc": compress_bools([    true,  /* example.credit-card.number */    true,  /* example.credit-card.expiration */    true,  /* example.credit-card.ccv */    false  /* example.superfluous.rewards-member */  ]),  "signature": /* -- snip -- */}

When you go to append data to an existing record, you’ll need to update meta-auth to include the mapping of fields based on this canonical ordering to ensure only the intended fields get validated.

When you update your code to add an additional field that is intended to be signed, you can roll that out for new records and the record will continue to be self-describing:

New records will have the additional field flagged as authenticated in meta-auth (and meta-enc will grow)
Old records will not, but your code will still sign them successfully
To prevent downgrade attacks, simply include a schema version ID as an additional plaintext field that gets authenticated. An attacker who tries to downgrade will need to be able to produce a valid signature too.

You might think meta-auth gives an attacker some advantage, but this only includes which fields are included in the security boundary of the signature or MAC, which allows unauthenticated data to be appended for whatever operational purpose without having to update signatures or expose signing keys to a wider part of the network.

{  "example": {    "credit-card": {      "number": /* encrypted */,      "expiration": /* encrypted */,      "ccv": /* encrypted */    },    "superfluous": {      "rewards-member": null    }  },  "meta-auth": compress_bools([    true,  /* example.credit-card.number */    true,  /* example.credit-card.expiration */    true,  /* example.credit-card.ccv */    false, /* example.superfluous.rewards-member */    true,  /* meta-enc */    true   /* meta-version */  ]),  "meta-enc": compress_bools([    true,  /* example.credit-card.number */    true,  /* example.credit-card.expiration */    true,  /* example.credit-card.ccv */    false, /* example.superfluous.rewards-member */    true   /* meta-version */  ]),  "meta-version": 0x01000000,  "signature": /* -- snip -- */}

If an attacker tries to use the meta-auth field to mess with a record, the best they can hope for is an Invalid Signature exception (assuming the signature algorithm is secure to begin with).

Even if they keep all of the fields the same, but play around with the structure of the record (e.g. changing the XPath or equivalent), so long as the path is authenticated with each field, breaking this is computationally infeasible.

Searchable Encryption

If you’ve managed to make it through the previous sections, congratulations, you now know enough to build a secure but completely useless database.

Art: CMYKat

Okay, put away the pitchforks; I will explain.

Part of the reason why we store data in a database, rather than a flat file, is because we want to do more than just read and write. Sometimes computer scientists want to compute. Almost always, you want to be able to query your database for a subset of records based on your specific business logic needs.

And so, a database which doesn’t do anything more than store ciphertext and maybe signatures is pretty useless to most people. You’d have better luck selling Monkey JPEGs to furries than convincing most businesses to part with their precious database-driven report generators.

Art: Sophie

So whenever one of your users wants to actually use their data, rather than just store it, they’re forced to decide between two mutually exclusive options:

Encrypting the data, to protect it from unauthorized disclosure, but render it useless
Doing anything useful with the data, but leaving it unencrypted in the database

This is especially annoying for business types that are all in on the Zero Trust buzzword.

Fortunately, the cryptographers are at it again, and boy howdy do they have a lot of solutions for this problem.

Order-{Preserving, Revealing} Encryption

On the fun side of things, you have things like Order-Preserving and Order-Revealing Encryption, which Matthew Green wrote about at length.

[D]atabase encryption has been a controversial subject in our field. I wish I could say that there’s been an actual debate, but it’s more that different researchers have fallen into different camps, and nobody has really had the data to make their position in a compelling way. There have actually been some very personal arguments made about it.
Attack of the week: searchable encryption and the ever-expanding leakage function

The problem with these designs is that they have a significant enough leakage that it no longer provides semantic security.

From Grubbs, et al. (GLMP, 2019.)
Colors inverted to fit my blog’s theme better.

To put it in other words: These designs are only marginally better than ECB mode, and probably deserve their own poems too.

Order revealing
Reveals much more than order
Softcore ECB
Order preserving
Semantic security?
Only in your dreams
Haiku for your consideration

Deterministic Encryption

Here’s a simpler, but also terrible, idea for searchable encryption: Simply give up on semantic security entirely.

If you recall the AES_{De,En}crypt() functions built into MySQL I mentioned at the start of this article, those are the most common form of deterministic encryption I’ve seen in use.

 SELECT * FROM foo WHERE bar = AES_Encrypt('query', 'key');

However, there are slightly less bad variants. If you use AES-GCM-SIV with a static nonce, your ciphertexts are fully deterministic, and you can encrypt a small number of distinct records safely before you’re no longer secure.

From Page 14 of the linked paper. Full view.

That’s certainly better than nothing, but you also can’t mitigate confused deputy attacks. But we can do better than this.

Homomorphic Encryption

In a safer plane of academia, you’ll find homomorphic encryption, which researchers recently demonstrated with serving Wikipedia pages in a reasonable amount of time.

Homomorphic encryption allows computations over the ciphertext, which will be reflected in the plaintext, without ever revealing the key to the entity performing the computation.

If this sounds vaguely similar to the conditions that enable chosen-ciphertext attacks, you probably have a good intuition for how it works: RSA is homomorphic to multiplication, AES-CTR is homomorphic to XOR. Fully homomorphic encryption uses lattices, which enables multiple operations but carries a relatively enormous performance cost.

Art: Harubaki

Homomorphic encryption sometimes intersects with machine learning, because the notion of training an encrypted model by feeding it encrypted data, then decrypting it after-the-fact is desirable for certain business verticals. Your data scientists never see your data, and you have some plausible deniability about the final ML model this work produces. This is like a Siren song for Venture Capitalist-backed medical technology companies. Tech journalists love writing about it.

However, a less-explored use case is the ability to encrypt your programs but still get the correct behavior and outputs. Although this sounds like a DRM technology, it’s actually something that individuals could one day use to prevent their ISPs or cloud providers from knowing what software is being executed on the customer’s leased hardware. The potential for a privacy win here is certainly worth pondering, even if you’re a tried and true Pirate Party member.

Just say “NO” to the copyright cartels.

Art: CMYKat

Searchable Symmetric Encryption (SSE)

Forget about working at the level of fields and rows or individual records. What if we, instead, worked over collections of documents, where each document is viewed as a set of keywords from a keyword space?

Art: CMYKat

That’s the basic premise of SSE: Encrypting collections of documents rather than individual records.

The actual implementation details differ greatly between designs. They also differ greatly in their leakage profiles and susceptibility to side-channel attacks.

Some schemes use a so-called trapdoor permutation, such as RSA, as one of their building blocks.

Some schemes only allow for searching a static set of records, while others can accommodate new data over time (with the trade-off between more leakage or worse performance).

If you’re curious, you can learn more about SSE here, and see some open source SEE implementations online here.

You’re probably wondering, “If SSE is this well-studied and there are open source implementations available, why isn’t it more widely used?”

Your guess is as good as mine, but I can think of a few reasons:

The protocols can be a little complicated to implement, and aren’t shipped by default in cryptography libraries (i.e. OpenSSL’s libcrypto or libsodium).
Every known security risk in SSE is the product of a trade-offs, rather than there being a single winner for all use cases that developers can feel comfortable picking.
Insufficient marketing and developer advocacy.
SSE schemes are mostly of interest to academics, although Seny Kamara (Brown Univeristy professior and one of the luminaries of searchable encryption) did try to develop an app called Pixek which used SSE to encrypt photos.

Maybe there’s room for a cryptography competition on searchable encryption schemes in the future.

You Can Have Little a HMAC, As a Treat

Finally, I can’t talk about searchable encryption without discussing a technique that’s older than dirt by Internet standards, that has been independently reinvented by countless software developers tasked with encrypting database records.

The oldest version I’ve been able to track down dates to 2006 by Raul Garcia at Microsoft, but I’m not confident that it didn’t exist before.

The idea I’m alluding to goes like this:

Encrypt your data, securely, using symmetric cryptography.
(Hopefully your encryption addresses the considerations outlined in the relevant sections above.)
Separately, calculate an HMAC over the unencrypted data with a separate key used exclusively for indexing.

When you need to query your data, you can just recalculate the HMAC of your challenge and fetch the records that match it. Easy, right?

Even if you rotate your keys for encryption, you keep your indexing keys static across your entire data set. This lets you have durable indexes for encrypted data, which gives you the ability to do literal lookups for the performance hit of a hash function.

Additionally, everyone has HMAC in their toolkit, so you don’t have to move around implementations of complex cryptographic building blocks. You can live off the land. What’s not to love?

Hooray!

However, if you stopped here, we regret to inform you that your data is no longer indistinguishable from random, which probably undermines the security proof for your encryption scheme.

How annoying!

Of course, you don’t have to stop with the addition of plain HMAC to your database encryption software.

Take a page from Troy Hunt: Truncate the output to provide k-anonymity rather than a direct literal look-up.

“K-What Now?”

Imagine you have a full HMAC-SHA256 of the plaintext next to every ciphertext record with a static key, for searchability.

Each HMAC output corresponds 1:1 with a unique plaintext.

Because you’re using HMAC with a secret key, an attacker can’t just build a rainbow table like they would when attempting password cracking, but it still leaks duplicate plaintexts.

For example, an HMAC-SHA256 output might look like this: 04a74e4c0158e34a566785d1a5e1167c4e3455c42aea173104e48ca810a8b1ae

Art: CMYKat\

If you were to slice off most of those bytes (e.g. leaving only the last 3, which in the previous example yields a8b1ae), then with sufficient records, multiple plaintexts will now map to the same truncated HMAC tag.

Which means if you’re only revealing a truncated HMAC tag to the database server (both when storing records or retrieving them), you can now expect false positives due to collisions in your truncated HMAC tag.

These false positives give your data a discrete set of anonymity (called k-anonymity), which means an attacker with access to your database cannot:

Distinguish between two encrypted records with the same short HMAC tag.
Reverse engineer the short HMAC tag into a single possible plaintext value, even if they can supply candidate queries and study the tags sent to the database.

Art: CMYKat\

As with SSE above, this short HMAC technique exposes a trade-off to users.

Too much k-anonymity (i.e. too many false positives), and you will have to decrypt-then-discard multiple mismatching records. This can make queries slow.
Not enough k-anonymity (i.e. insufficient false positives), and you’re no better off than a full HMAC.

Even more troublesome, the right amount to truncate is expressed in bits (not bytes), and calculating this value depends on the number of unique plaintext values you anticipate in your dataset. (Fortunately, it grows logarithmically, so you’ll rarely if ever have to tune this.)

If you’d like to play with this idea, here’s a quick and dirty demo script.

Intermission

If you started reading this post with any doubts about Cendyne’s statement that “Database cryptography is hard”, by making it to this point, they’ve probably been long since put to rest.

Art: Harubaki

Conversely, anyone that specializes in this topic is probably waiting for me to say anything novel or interesting; their patience wearing thin as I continue to rehash a surface-level introduction of their field without really diving deep into anything.

Thus, if you’ve read this far, I’d like to demonstrate the application of what I’ve covered thus far into a real-world case study into an database cryptography product.

Case Study: MongoDB Client-Side Encryption

MongoDB is an open source schema-free NoSQL database. Last year, MongoDB made waves when they announced Queryable Encryption in their upcoming client-side encryption release.

Taken from the press release, but adapted for dark themes.

A statement at the bottom of their press release indicates that this isn’t clown-shoes:

Queryable Encryption was designed by MongoDB’s Advanced Cryptography Research Group, headed by Seny Kamara and Tarik Moataz, who are pioneers in the field of encrypted search. The Group conducts cutting-edge peer-reviewed research in cryptography and works with MongoDB engineering teams to transfer and deploy the latest innovations in cryptography and privacy to the MongoDB data platform.

If you recall, I mentioned Seny Kamara in the SSE section of this post. They certainly aren’t wrong about Kamara and Moataz being pioneers in this field.

So with that in mind, let’s explore the implementation in libmongocrypt and see how it stands up to scrutiny.

MongoCrypt: The Good

MongoDB’s encryption library takes key management seriously: They provide a KMS integration for cloud users by default (supporting both AWS and Azure).

MongoDB uses Encrypt-then-MAC with AES-CBC and HMAC-SHA256, which is congruent to what Signal does for message encryption.

How Is Queryable Encryption Implemented?

From the current source code, we can see that MongoCrypt generates several different types of tokens, using HMAC (calculation defined here).

According to their press release:

The feature supports equality searches, with additional query types such as range, prefix, suffix, and substring planned for future releases.
MongoDB Queryable Encryption Announcement

Which means that most of the juicy details probably aren’t public yet.

These HMAC-derived tokens are stored wholesale in the data structure, but most are encrypted before storage using AES-CTR.

There are more layers of encryption (using AEAD), server-side token processing, and more AES-CTR-encrypted edge tokens. All of this is finally serialized (implementation) as one blob for storage.

Since only the equality operation is currently supported (which is the same feature you’d get from HMAC), it’s difficult to speculate what the full feature set looks like.

However, since Kamara and Moataz are leading its development, it’s likely that this feature set will be excellent.

MongoCrypt: The Bad

Every call to do_encrypt() includes at most the Key ID (but typically NULL) as the AAD. This means that the concerns over Confused Deputies (and NoSQL specifically) are relevant to MongoDB.

However, even if they did support authenticating the fully qualified path to a field in the AAD for their encryption, their AEAD construction is vulnerable to the kind of canonicalization attack I wrote about previously.

First, observe this code which assembles the multi-part inputs into HMAC.

/* Construct the input to the HMAC */uint32_t num_intermediates = 0;_mongocrypt_buffer_t intermediates[3];// -- snip --if (!_mongocrypt_buffer_concat (  &to_hmac, intermediates, num_intermediates)) {   CLIENT_ERR ("failed to allocate buffer");   goto done;}if (hmac == HMAC_SHA_512_256) {   uint8_t storage[64];   _mongocrypt_buffer_t tag = {.data = storage, .len = sizeof (storage)};   if (!_crypto_hmac_sha_512 (crypto, Km, &to_hmac, &tag, status)) {      goto done;   }   // Truncate sha512 to first 256 bits.   memcpy (out->data, tag.data, MONGOCRYPT_HMAC_LEN);} else {   BSON_ASSERT (hmac == HMAC_SHA_256);   if (!_mongocrypt_hmac_sha_256 (crypto, Km, &to_hmac, out, status)) {      goto done;   }}

The implementation of _mongocrypt_buffer_concat() can be found here.

If either the implementation of that function, or the code I snipped from my excerpt, had contained code that prefixed every segment of the AAD with the length of the segment (represented as a uint64_t to make overflow infeasible), then their AEAD mode would not be vulnerable to canonicalization issues.

Using TupleHash would also have prevented this issue.

Silver lining for MongoDB developers: Because the AAD is either a key ID or NULL, this isn’t exploitable in practice.

The first cryptographic flaw sort of cancels the second out.

If the libmongocrypt developers ever want to mitigate Confused Deputy attacks, they’ll need to address this canonicalization issue too.

MongoCrypt: The Ugly

MongoCrypt supports deterministic encryption.

If you specify deterministic encryption for a field, your application passes a deterministic initialization vector to AEAD.
MongoDB documentation

We already discussed why this is bad above.

Wrapping Up

This was not a comprehensive treatment of the field of database cryptography. There are many areas of this field that I did not cover, nor do I feel qualified to discuss.

However, I hope anyone who takes the time to read this finds themselves more familiar with the subject.

Additionally, I hope any developers who think “encrypting data in a database is [easy, trivial] (select appropriate)” will find this broad introduction a humbling experience.

Art: CMYKat

https://soatok.blog/2023/03/01/database-cryptography-fur-the-rest-of-us/

#appliedCryptography #blockCipherModes #cryptography #databaseCryptography #databases #encryptedSearch #HMAC #MongoCrypt #MongoDB #QueryableEncryption #realWorldCryptography #security #SecurityGuidance #SQL #SSE #symmetricCryptography #symmetricSearchableEncryption

Ever since the famous “Open Sesame” line from One Thousand and One Nights, humanity was doomed to suffer from the scourge of passwords.

Courtesy of SwiftOnSecurity

Even in a world where we use hardware tokens with asymmetric cryptography to obviate the need for passwords in modern authentication protocols, we’ll still need to include “something you know” for legal and political reasons.

In the United States, we have the Fifth Amendment to the US Constitution, which prevents anyone from being forced to testify against oneself. This sometimes includes revealing a password, so it is imperative that passwords remain a necessary component of some encryption technologies to prevent prosecutors from side-stepping one’s Fifth Amendment rights. (Other countries have different laws about passwords. I’m not a lawyer.)

If you’ve ever read anything from the EFF, you’re probably keenly aware of this, but the US government loves to side-step citizens’ privacy rights in a court of law. They do this not out of direct malice towards civil rights (generally), but rather because it makes their job easier. Incentives rule everything around me.

Thus, even in a passwordless future, anyone who truly cares about civil liberties will not want to dispense with them entirely. Furthermore, we’ll want these password-based technologies to pass the Mud Puddle test, so that we aren’t relying on large tech companies to protect us from government backdoors.

Given all this, most security professionals will agree that passwords suck. Not only are humans bad sources of randomness, but we’re awful at memorizing a sequence of characters or words for every application, website, or operating system that demands a password.

None of what I’ve written so far is the least bit novel or interesting. But here’s a spicy take:

The only thing that sucks worse than passwords is the messaging around password-based cryptographic functions.

Password-Based Cryptographic Functions

That isn’t a term of the art, by the way. I’m kind of coining it as a superset that includes both Password-Based Key Derivation Functions and Password Hashing Functions.

You might be wondering, “Aren’t those two the same thing?” No, they are not. I will explain in a moment.

The intersection of human-memorable secrets (passwords) and cryptographic protocols manifests in a lot of needless complexity and edge-cases that, in order to sufficiently explain anything conversationally, will require me to sound either drunk or like a blue deck player in Magic: The Gathering.

Counterspell!
Art: CMYKat

To that end, what I’m calling Password-Based Cryptographic Functions (PBCFs) includes all of the following:

Password-Hashing Functions
- Bcrypt
Password-Based Key Derivation Functions
- Scrypt
- Argon2
- PBKDF2
- bscrypt
Balanced Password Authenticated Key Exchanges
- CPace
Augmented Password Authenticated Key Exchanges
- SRP
- AuCPace
- BS-SPEKE
Doubly-Augmented Password Authenticated Key Exchanges
- OPAQUE
- Double BS-SPEKE

If you tried to approach categorization starting with simply “Key Derivation Functions”, you’d quickly run into a problem: What about HKDF? Or any of the NIST SP 800-108 KDFs for that matter?

If you tighten it to “Password-Based KDFs” or “Key-stretching functions”, that doesn’t include bcrypt. Bcrypt is a password hashing function, but it’s not suitable for deriving secret keys. I’ve discussed these nuances previously.

And then you have some password KDF and/or hashing algorithms that are memory-hard, some that are cache-hard.

And then some Password Authenticated Key Exchanges (PAKEs) are quantum-annoying, while others are not.

Art: CMYKat

To make heads or tails of the different algorithms, their properties, and when and where you should use them, you’d either need to secure postdoc funding for cryptography research or spend a lot of time reading and watching passwords-focused conference talks.

It helps if you’re friends with a Password Hashing Competition judge.
(Selfie taken by Sc00bz (left) at DEFCON 30 in the Quantum Village.)

So let’s talk about these different algorithms, why they exist to begin with, and some of their technical details.

Art: Harubaki

Why Does Any of This Matter?

The intersection of passwords and cryptography is one of the foundations of modern society, and one that most Internet users experience everywhere they go.

You’ll know you have succeeded in designing a cryptographic algorithm when the best way to attack it is to try every possible key. This is called an exhaustive search, or a brute-force attack, depending on the disposition of the author.

Remember earlier when I said passwords suck because humans are bad at creating or remembering strong, unique passwords for every website they visit?

Well, it turns out, that some passwords are extremely common. And even worse, people who choose common passwords tend to reuse them everywhere.

This presented a challenge to early web applications: In order to have authenticated users, you need to collect a password from the user and check it on subsequent visits. If you ever get hacked, then it’s likely that all of your users will be impacted on other websites they used the same passwords for.

Including their bank accounts.

So the challenge for any password-based scheme is simply stated: How can you safely authenticate users based on a secret they know?

Enter: Hashing

Storing a hash of a password is an obvious solution to this predicament. Early solutions involved using the user’s password as an encryption key, rather than building atop cryptographic hash functions.

However, it’s important not to call it “password encryption”.

The reason isn’t merely pedantic; it’s meant to prevent a disaster from repeating: Adobe famously encrypted passwords with Triple-DES instead of hashing them.

In early UNIX-based operating systems, your password hashes were stored in a file called /etc/passwd, along with other user account data. These days, your actual password hashes are kept in a second file, /etc/shadow, which is only readable by root.

Unfortunately, the tool used to create password hashes was called crypt, so the “encryption” lingo stuck.

Art: CMYKat

When Hashing Isn’t Enough

Although encrypting passwords is bad, using a fast cryptographic hash function (e.g. SHA256 or BLAKE2) is also bad.

LinkedIn learned this lesson the hard way in 2012, when an attacker leaked 117 million users’ hashed passwords. It turns out, they used SHA1 as their password hashing function.

Hash functions are deterministic: If you feed them the same input, you will always get the same output.

When your password hashing strategy is simply “just use SHA1”, two users with the same password will have an identical password hash.

People are bad at creating, or remembering, unpredictable passwords.

When you combine these observations, there are some extremely obvious candidates (123456, password, etc.) to prioritize when guessing.

Additionally, you could precompute these hashes once and build a reverse index of the passwords that hash to a specific output. This is called a rainbow table.

Unlike most of symmetric cryptography, where the goal ends at forcing the attacker to perform an exhaustive brute-force search of every possible key, password-based cryptography has to go the extra mile to prevent weak secrets from being compromised; a very tall order for anyone to contend with.

Towards Modern Password Hashing

None of this was a surprise to security experts in 2012. There were a couple generally known best practices since I first learned programming in the 2000s:

Salt your passwords (to mitigate rainbow tables)
Use an iterated hashing scheme, rather than a fast hash (to make each step of an exhaustive search slower)

PBKDF2, the first widely used password hashing and key derivation function, did exactly that:

PBKDF2 was designed to support randomly-generated per-user salts.
It used HMAC in a loop to force attackers to spend extra CPU cycles per guess. If an attacker could guess 10 million passwords per second against SHA-256, then using 100,000 iterations of PBKDF2 slowed them down to less than 100 guesses per second (due to HMAC calling the hash function twice).

And that might sound like the end of the story (“PBKDF2 wins”), but the password cracking community is extremely clever, so it’s only just begun.

Parallel Attacks and GPUs

Since your CPU can only guess a few hashes per second, PBKDF2 is a thorny problem to contend with.

However, there’s an easy way for attackers to use commodity hardware to bypass this limitation: Graphics cards are specially designed to perform a lot of low-memory calculations in parallel.

If you can write your password cracking code to leverage the benefits of GPUs instead of CPUs, then PBKDF2 becomes easy potatoes.

The other secure password hashing algorithm in use at the time, bcrypt, had only a linear improvement over PBKDF2’s security against GPU attacks.

Memory-Hard Password Hashing Algorithms

In 2009, Colin Percival proposed scrypt to mitigate this risk. Scrypt (pronounced “ess crypt”) builds atop PBKDF2-SHA256 by hashing psuedorandom regions of memory with Salsa20/8 (hence the S in Scrypt) in a way that makes it difficult to precompute.

Scrypt hashes require large amounts of memory to compute (at least 16 megabytes, in the recommended minimum configuration, versus the few kilobytes of incumbent password hashes).

While GPUs (and FPGAs, and ASICs, and …) are great for cracking password hashes that use CPU-bounded algorithms, algorithms that access large amounts of memory are difficult to attack effectively.

It’s still possible for clever programmers to work around this memory/bandwidth limitation, but to do so, you must compute more operations, which makes it not worthwhile.

Cryptographers and the password-cracking community call this expensive time-memory trade-off memory hardness. In 2016, it was determined that scrypt is maximally memory-hard.

Password Hashing Competition

The Password Hashing Competition (PHC) was kicked off by JP Aumasson in 2012 to develop a new standard for password hashing and password-based key derivation.

In 2015, they selected Argon2 as its recommendation, with four finalists receiving special recognition (Catena, Lyra2, Makwa, and yescrypt–which is based on scrypt).

Cache-Hardness

In the years since the PHC concluded, the advances in password cracking techniques have not slowed down.

One of the PHC judges recently proposed a new algorithm called bscrypt which purports a property called “cache hard” rather than “memory hard”. The reasoning is that, for shorter run times, memory bandwidth and small CPU caches is a tighter bottleneck than overall memory requirements.

In fact, there’s an Argon2 variant (Argon2ds) which offers cache-hardness.

Two Hard Choices

So which of the two should you care about, as a defender?

If you’re validating password hashes in an online service, a cache-hard algorithm might make more sense. You’re incentivized to have shorter server-side run times to avoid DoS attacks, so the benefits of cache-hardness seem more impactful than memory-hardness.

If you’re deriving a sensitive cryptography key from a password and can afford to take several seconds of wall-clock time, a memory-hard algorithm is likely to be a better fit for your threat model.

(Or, like, run both a cache-hard and memory-hard algorithm and use a fast KDF, such as HKDF, to mix the two into your final cryptography key.)

Of course, the story doesn’t end here. Password Hashing and Password-Based Key Derivation continues to mature as the password cracking community discovers new attacks and engineers defenses against them.

Art: ScruffKerfluff

A Strange Game; The Only Winning Move is to PAKE

Password hashing is a defense-in-depth against a system compromise that enables attackers to perform offline attacks against a cryptographic output to determine a user’s password.

However, for many applications, the bigger question is, “Why even play this game in the first place?”

Password-Authenticated Key Exchanges

A password authenticated key exchange (PAKE) algorithm is an interactive protocol to establish keys based on at least one party’s knowledge of a password.

PAKEs are what enable you to log into encrypted WiFi connections and encrypt your traffic. PAKEs prevent you from having to reveal the password (or a password-equivalent value, such as a password hash) to the other party.

Although there are a lot of proposed PAKE algorithms, the one that most people implemented was SRP (“Secure Remote Password”), which was intuitively a lot like Diffie-Hellman but also not Diffie-Hellman (since SRP used addition).

For a good teardown on SRP, Matthew Green’s blog is highly recommended.

There are a few real-world cryptography problems with using SRP:

You need to use a special kind of prime number for your protocol. The standard Diffie-Hellman moduli are not suitable for SRP; you want a safe prime (i.e. a number of the form N = 2q + 1, where q is also prime).
One of the steps in SRP, client-side, is to compute A = g^a (mod N). Here, A is a public value while a is a secret (usually the SHA1 hash of the username and pasword).
If your software implementation of modular exponentiation (a^x mod P) isn’t constant-time, you leak a, which is a password-equivalent value.

Additionally, it’s not possible to use SRP with elliptic curve groups. (Matthew Green’s post I linked above covers this in detail!)

Thus, SRP is generally considered to be deprecated by security experts, in favor of other PAKEs.

IETF, CFRG, PAKE Selection

As early as IETF 104, the Crypto Forum Research Group (CFRG) began exploring different PAKE algorithms to select for standardization.

One of the motivators for this work is that the WiFi alliance had shoehorned a new PAKE called Dragonfly into their WPA3, which turned out to be badly designed.

The CFRG’s candidate algorithms and reviews were publicly tracked on GitHub, if you’re curious.

TL;DR – They settled on recommending OPAQUE as an augmented PAKE and CPace as a balanced PAKE.

Sc00bz has done multiple talks at security conferences over the years to talk about PAKEs:

Quantum Annoyance

The PAKEs we’ve discussed involved a cyclic group (and the newer ones involved an elliptic curve group). The security of these schemes is dependent on the hardness of a discrete logarithm problem.

If a cryptography relevant quantum computer (CRQC) is developed, discrete logarithm problems will cease to be hard.

Some PAKEs fall apart the moment a discrete log is solvable. This is par for the course for Diffie-Hellman and elliptic curve cryptography.

Others require an attacker to solve a discrete log for every guess in an offline attack (after capturing a successful exchange). This makes them annoying for quantum attackers.

While they don’t provide absolute security like post-quantum cryptography, they do make attackers armed with a CRQC work harder.

OPAQUE isn’t quantum annoying. Simply observe all of the packets from making an online guess, solve the DLP offline, and you’re back to the realm of classical offline guessing.

Art: Swizz

The State of the Art

At this point, you may feel somewhat overwhelmed by all of this information. It’s very tempting to simplify all of this historical context and technical detail.

You might be telling yourself, “Okay, Scrypt, Argon2, CPace, OPAQUE. Got it. That’s all I need to remember. Everything else is questionable at best. Ask a cryptographer. I’m bailing out, yo!”

But the story gets worse on two fronts: Real-world operational requirements, and compliance.

Your Hands Are Tied, Now Swim

If you’re selling a product or service to the US government that uses cryptography, you need to first clear a bare minimum bar called FIPS 140.

FIPS 140 is opinionated about which algorithms you use. For password hashing and password-based key derivation, FIPS defers to NIST SP 800-209, which means ~~you’re only permitted to use PBKDF2~~ in any FIPS modules (with a SHA2- family hash function). (Update: See below.)

So all of the information about memory-hard and cache-hard password hashing algorithms? This is useless for you if you ever try to sell to the US government.

An open question is whether Scrypt is FIPSable due to it building atop PBKDF2. To be blunt: I’m not sure. Ask a NIST Cryptographic Module Validation Program lab for their opinion.

Update (2022-01-07): A Brief Correction

Thanks to FreakLegion for pointing this out.

According to NIST SP 800-63B, memory-hard hash functions are recommended.

Examples of suitable key derivation functions include Password-based Key Derivation Function 2 (PBKDF2) [SP 800-132] and Balloon [BALLOON]. A memory-hard function SHOULD be used because it increases the cost of an attack.
NIST SP 800-63B

Don’t use Balloon, though. It’s under-specified.

FreakLegion indicates in their Hacker News comment that scrypt and yescrypt are both acceptable.

End Update

In the realm of PAKEs, both CPace and OPAQUE are frameworks that can be used with multiple elliptic curve groups.

Both frameworks support NIST P-256, but CPace supports X25519 and OPAQUE supports Ristretto255; these are currently not FIPSable curves.

“I don’t care about FIPS. Argon2 all the things!”

JavaScript runtimes don’t provide a built-in Argon2 implementation. This poses several challenges.

Do you care about iOS users? Lockdown mode prevents you from using WebAssembly, even in a browser extension.
Trying to polyfill scrypt or Argon2 in a scripting language is a miserable experience. We’re talking quadratic performance penalties here. Several minutes of CPU time to calculate a hash that is far too weak to be acceptable in any context.

Consequently, not a single password manager I’ve evaluated that provides a browser extension uses a memory-hard algorithm for deriving encryption keys.

Update: I’ve been told Dashlane uses Argon2.

Art: CMYKat

This Is Frustrating

When I write about cryptography, my goal is always to be clear, concise, and helpful. I want whoever reads my writing to, regardless of their level of expertise, walk away more comfortable with the subject matter.

At minimum, I want you to be able to intuit when you need to ask an expert for help, and have a slightly better understanding of how to word your questions to get the answer you need. In an ideal world, you’d even be able to spot the same kind of weaknesses I do in cryptography designs and implementations after reading enough of this blog.

I can’t do that with password-based cryptography. The use-cases and threat models are too varied to make a clear-cut recommendation, and it’s too complicated to parcel out in a way that doesn’t feel reductive or slightly contradictory. This leads to too many caveats or corner cases.

Passwords suck.

If you ask a password cracking expert and describe your use case, you’ll likely get an opinionated and definitive answer. But every time I’ve asked generally about this subject, without a specific use case in mind, I got an opinionated answer followed by a long chain of caveats and alternatives.

With that in mind, here’s a few vague guidelines that are up-to-date as of the end of 2022.

Art: Harubaki

Recommendations for Password-Based Cryptography in 2023

Password Hashing

If you’re authenticating users in an online service (storing the hashes in a database), use any of the following algorithms:

bcrypt with a cost factor appropriate to take about 100 milliseconds to compute on your server hardware (typically at least 12)
scrypt with N = 32768, r = 8, p = 1 and a random salt at least 128 bits in length (256 preferred)
Argon2id with a memory cost of 64 MiB, ops cost of 3, and parallelism of 1

If you’re forced to use PBKDF2 for whatever dumb reason, the parameters you want are at least:

SHA-512 with 210,000 rounds (preferred)
SHA-256 with 600,000 rounds
SHA-1 (if you must) with 1,300,000 rounds

These numbers are based on constraining attackers to less than 10,000 hashes per second, per GPU.

I’m not currently recommending any algorithms deliberately designed for cache-hardness because they need further analysis from other cryptographers. (Bcrypt is minimally cache-hard, but that wasn’t a stated design goal.)

I didn’t evaluate the other PHC finalists nor other designs (Balloon hashing, Ball and Chain, etc.). Ask your cryptographer if those are appropriate instead.

Password-Based Key Derivation

If you’re deriving an encryption key from a password, use any of the following algorithms:

scrypt with N = 1048576, r = 8, p = 1 and a random salt at least 128 bits in length (256 preferred)
Argon2id with a memory cost of 1024 MiB, ops cost of 4, and parallelism of 1

If you’re forced to use PBKDF2 for whatever dumb reason, the parameters you want should target between 1 and 10 seconds on the defender’s hardware. If this is somehow slower than the numbers given for password hashing above, go with the password hashing benchmarks instead.

Here’s a dumb PHP script you can use to quickly find some target numbers.

<?php/* Dumb PBKDF2 benchmarking script by Soatok. 2022-12-29 */$salt = random_bytes(32);$password = 'correct horse battery staple';$c = '';$start = $end = microtime(true);foreach(['sha1', 'sha256', 'sha512'] as $alg) {foreach ([1 << 14,1 << 15,1 << 16,1 << 17,1 << 18,1 << 19,1 << 20,1200000,1 << 21,2500000,3000000,3500000,1 << 22,5000000,1 << 23,1 << 24,1 << 25,] as $n) {   $start = microtime(true);   $c ^= hash_pbkdf2($alg, $password, $salt, $n, 32, true);   $end = microtime(true);      $time = $end - $start;   echo str_pad($n, 12, ' ', STR_PAD_LEFT),            " iterations ({$alg}) -> \t",           $time,            PHP_EOL;      if ($time > 5.5) {   echo PHP_EOL;   break;   }}}

On my laptop, targeting 5.0 seconds means at least 4,000,000 rounds of PBKDF2 with SHA1, SHA256, or SHA512 (with SHA256 coming closer to 5,000,000).

If you’re aiming for less than 1,000 hashes per second per GPU, against an attacker armed with an RTX 4090:

SHA-512 with 3,120,900 iterations (preferred)
SHA256 with 8,900,000 iterations
SHA-1 with 20,000,000 iterations

Sc00bz tells me that this generation of GPU has better performance/cost than previous generations (which had seen diminishing returns), but requires 1.5x the power, 1.78x the price, and 2x the physical space. Sc00bz refers to this as “1.5 GPUs”.

If you adjust for these factors, your final PBKDF2 target becomes:

SHA-512 with 2,100,000 iterations (preferred)
SHA-256 with 6,000,000 iterations
SHA-1 with 13,000,000 iterations

Password Authenticated Key Exchanges

In a client-server model, where the server isn’t meant to see password-equivalent data, you want an augmented PAKE (or perhaps a doubly-augmented PAKE).

You’re overwhelmingly more likely to find OPAQUE support in the immediate future, due to the direction the IETF CFRG is moving today.

In a more peer-to-peer model, you want a balanced PAKE. CPace is the best game in town.

Don’t use SRP if you can help it.

If you can’t help it, make sure you use one of the groups defined in RFC 5054, rather than a generic Diffie-Hellman group. You’ll also want an expert to review your implementation to make sure your BigInt library isn’t leaking your users’ private exponents.

Also, feel free to deviate from SHA1 for calculating the private exponent from their username and password. SHA1 sucks. Nothing is stopping you from using a password KDF (or, at worst, SHA256) here, as long as you do so consistently.

(But as always, ask your cryptography expert first.)

Art: CMYKat

TL;DR

Password-Based Cryptography is a mess.

If you’re not aware of the history of the field and the technical details that went into the various cryptographic algorithms that experts recommend today, it’s very easy to say something wrong that sounds correct to an untrained ear.

At the end of the day, the biggest risk you face with password-based cryptography is not using a suitable algorithm in the first place, rather than using a less optimal algorithm.

Experts have good reasons to hate PBDKF2 (slow for defenders, fast for attackers) and SRP (we have better options today, which also come with actual security proofs), but they certainly hate unsalted SHA1 and Triple-DES more.

My only real contribution to this topic is an effort to make it easier to talk about to non-experts by offering a new term for the superset of all of these password-based algorithms: Password-Based Cryptographic Functions.

Finally, the password cracking community is great. There are a lot of smart and friendly people involved, and they’re all working to make this problem easier to get right. If you’re not sure where to start, check out PasswordsCon.

Art: CMYKat

https://soatok.blog/2022/12/29/what-we-do-in-the-etc-shadow-cryptography-with-passwords/

#cryptographicHashFunction #cryptographicProtocols #OPAQUE #passwordHashing #PBKDF2 #SecureRemotePasswordProtocol #SecurityGuidance

Canonicalization Attacks occur when a protocol that feeds data into a hash function used in a Message Authentication Code (MAC) or Digital Signature calculation fails to ensure some property that’s expected of the overall protocol.

The textbook example of a canonicalization attack is the length-extension attack against hash functions such as MD5–which famously broke the security of Flickr’s API signatures.

But there’s a more interesting attack to think about, which affects the design of security token/envelope formats (PASETO, DSSE, etc.) and comes up often when folks try to extend basic notions of authenticated encryption (AE) to include additional authenticated (but unencrypted) data (thus yielding an AEAD mode).

Let’s start with a basic AE definition, then extend it to AEAD poorly, then break our extension. Afterwards, we can think about strategies for doing it better.

Turning CTR+HMAC into AEAD

Signal uses AES-CBC then HMAC-SHA2 to encrypt messages between mobile devices.

We often refer to this shape as “CBC+HMAC” (although this is a few typos away from being confused with a very different idea called CBC-MAC).

When CBC+HMAC is used with the AES block cipher with 256-bit keys and HMAC-SHA2, it becomes AES-256-CBC+HMAC-SHA256. What a mouthful!

Yuck! Who let a cryptography nerd name anything?
(Art by Lynx vs Jackalope)

In modern designs, AES-CTR is often preferable to AES-CBC, since you don’t have to deal with padding (or padding oracles).

Most systems these days prefer GCM over CBC+HMAC or CTR+HMAC. I don’t like AES-GCM (especially if your use-case is “support platforms without hardware acceleration”), but it’s hardly the worst choice for most applications. AES-GCM is a dedicated AEAD mode, while CBC+HMAC and CTR+HMAC merely provide AE.

Why Does Additional Data Matter?

Art: Harubaki

The main purpose of Additional Data (the AD in AEAD) is to bind an encrypted payload (ciphertext + authentication tag) to a given context. This is extremely helpful in mitigating Confused Deputy attacks.

Critically, this additional data is not encrypted. (At least, on this level; it’s probably wise to communicate over HTTPS!)

Naive CTR+HMAC to AEAD Extension

In a normal CTR+HMAC definition, your algorithm looks something like this:

Generate a random nonce equal to the block size of your block cipher. (16 bytes for AES.)
Encrypt your message with AES-CTR, using the given key and IV.
Calculate the HMAC-SHA2 output of the IV followed by the ciphertext from step 2.
Return IV, Ciphertext, MAC.

Decryption basically runs steps 3 and 2 in reverse: Recalculate the MAC (in constant-time!), decrypt ciphertext, return plaintext.

The most obvious way to extend this design to support additional authenticated data is to append it to the ciphertext.

This yields the following updated protocol:

Generate a random nonce equal to the block size of your block cipher. (16 bytes for AES.)
Encrypt your message with AES-CTR, using the given key and nonce.
Calculate the HMAC-SHA2 output of the IV followed by the ciphertext from step 2, then the additional authenticated data.
Return IV, Ciphertext, MAC.

Suffice to say, this is not a secure construction.

The Canonicalization Attack

Let’s say you built this extended protocol to encrypt a payload that looks like a URI string, but wanted to bind the token to a given browser session, so you use the HTTP User-Agent header as the AAD.

When you generate a token, you might produce the following:

const crypto = require('crypto');function splitKey(key) {    let hmac;    hmac = crypto.createHmac('sha256', key);    hmac.update('encrypt-key');    let Ek = hmac.digest();    hmac = crypto.createHmac('sha256', key);    hmac.update('hmac-key');    let Ak = hmac.digest();    return [Ek, Ak];}function encryptWithContext(plaintext, aad, key) {    let [encKey, authKey] = splitKey(key);    console.log(encKey, authKey);    let nonce = crypto.randomBytes(16);    const aes = crypto.createCipheriv('aes-256-ctr', encKey, nonce);    const ciphertext = aes.update(plaintext);    aes.final();    // Pay attention to this part:    const hmac = crypto.createHmac('sha256', authKey);    hmac.update(nonce);    hmac.update(ciphertext);    hmac.update(aad);    return [nonce, ciphertext, hmac.digest()];}let plaintext = [    'expires=1630223780',    'access_group=1234',    'subject=auth-v2.example.com',    'restrictions=on'].join('&');// expires=1630223780&access_group=1234&subject=auth-v2.example.com&restrictions=on// const key = crypto.randomBytes(32);let [nonce, ciphertext, tag] = encryptWithContext(plaintext, userAgent, key);

So here’s the clever attack: If you can shift bytes from the payload into the prefix of your User-Agent string, they’ll produce the same HMAC tag.

Attackers can truncate as much of the payload as they want by prepending it to the User-Agent included in their HTTP request.

You can even turn this:

 expires=1630223780&access_group=1234&subject=auth-v2.example.com&restrictions=on

…into this:

 expires=1630223780&access_group=1234&subject=auth-v2.example.com

…without invalidating the existing authentication tag–just by ensuring that the last 16 bytes of ciphertext are prepended to your User-Agent and removed from the payload.

More broadly, any time you have a multi-part message being fed into a hash function, if you aren’t careful with how you feed it into the hash function, you can induce trivial collisions.

#SecurityGuidance

Client Info