KNO, the Known Number Oracle

What is the KNO?

The Known Number Oracle (KNO) provides a simple service: you tell it a number, and in return it tells you whether it has been told the same number before. Specifically, the KNO does this for 128-bit binary numbers.

That's... silly. What possible use could that be?

One application of the KNO is for testing of random number generators (RNGs). The idea is that it is extremely unlikely that a properly working RNG will generate the same random 128-bit number twice, so if you keep track of numbers that have been generated in the past and find that the same number is generated again, you know with near certainty that they are not actually random, and that the RNG is flawed. We used this at Araneus as one of the many tests we performed on each of our Alea II TRNGs during production, and we encourage other manufacturers and users of hardware or software RNGs to do likewise.

Since the test becomes more effective the more known numbers you accumulate, it makes sense to create a shared, public database of known numbers. The KNO is intended to serve as such a database.

The KNO can also be seen as a general-purpose public duplicate checking service, and we would not be surprised if people come up with other uses for such a service, unrelated to RNG testing. Feel free to email us your ideas so that we can add them here.

Isn't this rather... crude as RNG tests go?

Yes, but crude can be effective. It is not uncommon to use even cruder tests, such as checking for the RNG output bits being "stuck at zero" or "stuck at one". The KNO will not only detect these failures, but also any other simple repeating pattern, as well as many other types of flaws.

The KNO will not detect the kinds of subtle long-term statistical biases that statistical test suites like Diehard or the NIST STS are designed to find. The KNO is not intended to replace these test suites, but can complement them by finding other types of flaws that they will not detect.

Consider, for example, a program that is supposed to generate random output but that actually outputs the (binary) digits of pi in sequence. This output will pass all known statistical tests for randomness, but will fail a KNO test every time except the first. The same goes for the output of a CSPRNG that has been accidentally left unseeded, or left in a test mode where it is seeded with a test vector instead of actual entropy.

More generally, the KNO makes it possible to detect certain flaws that only become evident when considering an entire population of RNG output sequences rather than each one in isolation. The population can consist of multiple sequences generated by the same RNG at different times, such as the initial sequences generated by the operating system RNG of a single computer across multiple boots, or of outputs from physically distinct RNGs. One well-known real-world example of a flaw where RNGs on different hosts ended up generating identical outputs is the Debian RNG bug of 2008.

The KNO is also useful as a quick sanity check because it only needs a small sample of the RNG output, 128 bits, rather than the multiple megabits typically needed by a full suite of statistical tests. This makes it practical, for example, to test an operating system RNG such as /dev/urandom on every boot, or a web browser RNG on every visit to a web page.

Won't all those numbers fill up the disk in no time?

At least when using a magnetic disk (as opposed to an SSD), they won't. The KNO stores its known numbers indexed by hash, which means its disk accesses are effectively random, and the records it stores are rather small. Disk capacities have grown exponentially over several decades but seek times have remained about the same, with the result that filling a disk by writing small blocks in a random order now takes a very long time.

To see just how long, let's do a back-of-the-envelope calculation. Suppose the KNO uses a 2 TB magnetic disk. Each query requires a random seek to the appropriate part of the disk, half a rotation to find the sector of interest and read it, and a full rotation to rewrite it with new content. This takes about 20 milliseconds in all; this means the KNO can handle about 50 queries per second. For each new number submitted, the KNO will store the 16-byte number itself and an 8-byte timestamp, a total of 24 bytes. This yields an effective maximum write speed of 1200 bytes per second, and filling the disk will take 1.7 × 10⁹ seconds, or about 53 years. In other words, the disk will wear out long before it fills up.

If it turns out the demand for the KNO service exceeds 50 queries per second, we will either have to spread the load across multiple disks, or switch to SSDs. In the unlikely event that the service becomes hugely popular, we might end up with something like the worst-case setup described in the section on false positives.

How do I know the KNO is telling the truth?

There are two ways the KNO could try to deceive you: it could claim to know a number it does not, or it could claim not to know a number it does.

To guard against false claims of knowledge, you interact with the KNO using a two-phase protocol based on hash functions. When you ask the KNO whether it knows a number, you don't send it the actual number at first, but only a hash of it. The KNO keeps its known numbers indexed by their hashes, so if KNO already knows the number, it can quickly find it based on the hash, and will respond with a proof of its knowledge. If it does not, it will be unable to supply a valid proof.

There is one more twist to this. The KNO service is supposed to be a fair trade where you learn the known/unknown status of a number in return for the number itself, but if the KNO doesn't know the number and tells you this based on just the hash, you might not keep your end of the bargain and fail to tell it what the actual number was. To make sure you do, the KNO uses a simple commitment scheme where the proof is split into two parts, an initial commitment that is provided in return for the hash, and a final reveal that is only provided in return for the actual number. For details, see the KNO Protocol Specification.

To guard against false claims of ignorance, you can do random spot checks by resubmitting a number you have already submitted and checking that a valid proof of knowledge is returned. The KNO service is provided on a best-effort basis and there is no guarantee that it won't occasionally lose some data due to hardware failures or other adverse events, but you can at least verify statistically that most of the numbers you submit are retained.

What about false positives?

It is possible that even if the number you send to the KNO is random, someone else has already submitted the exact same number by pure coincidence. We call this a false positive. While it is not impossible for this to happen, the probability is extremely low.

Let's run another back-of-the envelope calculation to see what the chances are in a worst-case scenario. Since the risk of false positives increases as the database of known numbers grows, let's look at a hypothetical scenario where the KNO turns out to be wildly popular, allowing it to build up a very large database.

Suppose the KNO gains 1 billion users, each using it 10 times per day. Then they will be sending us 1.0 × 10¹⁰ numbers per day, totaling 1.3 × 10¹² bits per day. At this rate, the KNO will receive 15 megabits (2 megabytes) of network traffic per second, and that's just counting the numbers themselves, not any protocol overhead. If we are using magnetic disks, we are still limited by the transaction rate rather than storage capacity, and will need some 2315 disks just to handle the 115741 transactions per second. If we use SSDs, we only need a few of them to handle the transaction rate, but need to add 240 gigabytes of SSD capacity per day, which will get somewhat expensive.

If the KNO stays this popular for 50 years, the database will grow to 1.8 × 10¹⁴ 128-bit numbers. Since there are 2¹²⁸ = 3.4 × 10³⁸ possible numbers, the probability that a given newly submitted random number collides with one of those by pure chance is one in 1.9 × 10²⁴. For comparison, the chance of winning the UK National Lottery jackpot is one in 1.4 × 10⁷, so getting a false positive is less likely than winning the jackpot 3 times in a row.

Let's also consider the probability that any of the 1.8 × 10¹⁴ numbers submitted in those 50 years collides with any previously submitted number. Here, the birthday paradox comes into play, and we end up with a probability of one in 2.7 × 10¹². But keep in mind that this is just the chance that someone will get a false alarm, not that it will happen to you.

Any other way I could get a false alarm?

In principle, a malicious KNO (or someone intercepting your connection to the real KNO) could trade storage for computation by pre-calculating hash chains similar to those used in rainbow tables. This would make it possible to prove knowledge of all the numbers it has ever computed hashes for as part of the chains, rather than just the numbers it has stored.

Even so, the risk of a false positive would remain fairly low. Here's yet another back-of-the-envelope calculation: Suppose the malicious KNO pre-calculated the hash chains using repurposed bitcoin mining hardware, which is probably the most cost effective way of calculating SHA-256 hashes currently available. As of 2015, the most efficient miners compute around 2000 megahashes per joule of energy. Each Bitcoin hash consists of two SHA-256 hashes, so this is 4.0 × 10⁹ SHA-256 hashes per joule. At an electricity cost of 0.1 USD/kWh, this comes to 1.4 × 10¹⁷ hashes per USD of electricity cost. An attacker with an electricity budget of 1 million USD could then precalculate 1.4 × 10²³ hashes (ignoring the cost of the hardware). Since there are 3.4 × 10³⁸ possible 128-bit numbers, the risk of a random submitted number being "known" by this method would be one in 2.4 × 10¹⁵.

Any other known weaknesses?

Human factors are probably the greatest obstacle to the success of the KNO as an RNG test. A single failure of the KNO test should statistically be sufficient proof that an RNG is broken and sufficient cause to replace it, but in practice, it is all too likely that no action will be taken unless the failures happen repeatedly.

For one thing, it is human nature to disregard a single failed test among hundreds or thousands of successful ones as a "fluke", or to dismiss it as a problem with the test itself rather than one in the RNG being tested.

Also, even if a failed KNO test does convince the person running it that his/her RNG is broken, it will be difficult to convince a third party such as the OS vendor. The KNO client may leave logs, but any such logs could be forged, and there is no way to prove that the number submitted was in fact generated using a specific RNG.

Finally, a fundamental weakness of the KNO test is that it can only indicate that an RNG has failed, not how or why it has failed.

Using the KNO safely

The KNO is a public database, and submitting a number to the KNO effectively means publishing it. Therefore, you should never submit numbers that need to remain secret.

When testing a random number generator using the KNO, make sure to discard the numbers you submitted and not use them for anything else.

Several people have suggested using the KNO to check passwords for uniqueness. Checking passwords for uniqueness is probably a good idea, but doing it using the KNO is not, because it will expose the password. Even checking a hashed password may be a bad idea, for the same reasons that leaking your database of hashed passwords is generally considered a bad idea.

When generating a public/private key pair, it is probably a good idea to check the public key with the KNO — if it turns out to be a known number, there is a bug either in your RNG or in the key generation procedure. If you do this, make sure to do it only once, and before the key is published, because otherwise someone else could submit it to the KNO before you do, causing it to be incorrectly flagged as known. And of course, you need to be careful not to do a KNO test of the private key.

Why is it called an "oracle"?

The KNO takes its name from the mathematical concept of an oracle, a hypothetical device that provides answers to questions, typically ones that cannot be easily computed by other means. The KNO is not affiliated with Oracle Corporation and does not use database software from that company.

Protocol Specification

The current version of the protocol, version 1, is specified here. For the older version 0 protocol, see here.

Reference Client

A reference KNO client written in Python, kno-query, is available for downloading here: kno-query-1.1.tar.gz.

Feedback

If you have any questions or comments, please email kno ät araneus döt fi.