> (2006)
In the very least,this should feature in the title. In fact, previous submissions from over a decade ago also feature year in the title.
Otherwise it conveys the idea that this is some major epiphany.
Considering how often this happens, I'm surprised HN doesn't have a feature to flag and index a historically resubmitted article, whether externally by users or internally by the admin team.
Then it could have a bot create links to past submissions like the OP did and use the best arrived at title for resubmissions.
It does have the ability to catch multiple submissions of new articles but that probably has a time window of a day or so.
One problem would be that you can't just check the URL, you'd have to check the content. Not only are there many URLs that could point to the same content, but content on a page can obviously change.
I suppose you could compare against wayback, and not sure I'd try to compare with an LLM or RAG.
Also, I didn't know about this specific bug but I spotted it almost immediately while reading the code. This is not because I'm some kind of 10x programmer but because Integer.MAX_VALUE bugs while processing arrays are actually fairly common in java programs in 2025, I fixed one a few weeks back and it's something I look for in code reviews.
I guess it would have been surprising in 2006?
In 2006, if you tried to create an array with 2^30 elements you would just get an OutOfMemoryError almost anywhere.
AMD's 64b CPU were released in 2003, Intel followed up in 2004, and in 2006 launched their first 64b laptop chip (with Core 2). By then the ability to reasonably allocate more than 1GB in one shot was becoming quite widespread (the mid 2007 MBPs could be upgraded to 6GB RAM).
And obviously Google would have been dealing with servers, where it'd be even older as they'd probably have been using PAE and each process would be able to allocate and address a full 4GB.
Not sure if you're a troll... but for general info, in 2006 most desktop computers (laptops were still not so common) had something like 100MB of RAM if you were lucky. Maybe Google already had some huge machines with 1GB+ but that was not at all a common thing outside supercomputers.
I think you’re off by a decade or so.
Common consumer laptops had 2-4 GB of RAM in 2006.
https://everymac.com/systems/apple/macbook_pro/specs/macbook...
Just like today you can buy machines with 128GB of RAM... but that doesn't mean that's what people were using... a lot of people buy machines with 4GB today (just checked the most popular website in my country, lots of the offers only have 4GB: https://www.elgiganten.se/datorer-kontor/datorer/laptop).
I remember pretty clearly that if you had anywhere above 512MB of RAM in 2006 you had very much top-of-the-line.
> I remember pretty clearly that if you had anywhere above 512MB of RAM in 2006 you had very much top-of-the-line.
That’s a different claim than your original statement of having 100MB max.
I am so sorry for not being 100% accurate.
I don't mean to invalidate your experience, as computers and parts prices have varied a lot historically and geographically. My point was to give some context with what was then a decent laptop with the understanding that some setups might be even more high end, and that legacy systems would have lower specs, similar to what you mentioned.
You are not "not being 100% accurate", you are 100% inaccurate while accusing people of trolling.
Eh, 2006 was right on the cusp of the 32-bit/64-bit flip. 32-bit was still pretty common, but my first task at my first job that year was migrating a search app to compile as a 64-bit binary, for servers that had (I believe) 8GB. They were good-sized servers for the time, but not particularly exotic, commodity hardware at a startup.
The MacBook came out in 2006, it had 512MB. I suppose it could have contained an array of 2^30 bits, though that's unlikely to occur. The point is that most machines weren't going to run into this at the time it was published. I'd be curious to see how many machines would run into it today, single sets of that size just aren't that common.
> The MacBook came out in 2006, it had 512MB.
Apple released multiple different models in 2006, "early 2006" had either 512 or 1GB base (1GB for T2600, the 17" and the high-end 15) expandable to 2GB; and "late 2006" had 1GB to 2GB base expandable to 4.
The MacBook Pro I linked came out in 2006, and allows for 4GB, though there are some limitations to using all of it beyond 3GB.
You think Windows Vista ran on 100MB of RAM? I wish!
I was still on Windows 98 :D, that could easily run on 64MB.
Why are they common in java? In .net or node.js I dont think I have ever had one.
I don't know, maybe I review code that uses arrays more?
[flagged]
The problem with this article's name is that we're in an era where actually checking whether "nearly all sorts are broken" against all publicly available implementations of binary searches would be almost feasible, but that's not actually what the article is claiming.
> Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible.
This is incorrect. Generally it's just a little excessive to try to solve the halting problem in a library's unit tests ;). You don't have to test a program for all possible inputs, only for all materially unique state transitions. In a binary search, each variable can be zero, one, some(*), max, overflowed. The combination of these is not infinite as the values of some that cause different behavior is much more reasonable when grouped into chunks of the same behavior.
It was certainly possible to run the algorithm on 16GB of array before the moment when it happened but did the original developer have that sort of space on their desktop at the time? Possibly not.
If a unit test only runs on a server and not on the laptops of the developers then its not going to be written, whereas ideally someone should write a test that is tagged to only run on the server but that is a lot of extra work if that isn't a common thing on a project. Even now I would be quite wary of producing a max size input test for an array of ints and especially objects, that is going to raise some questions due to being a slow test and a highly resource consuming one.
If I was worrying about the same in aerospace programming however then no question that test would get written and we would ensure it got run on the hardware that it was designed to run on. In typical business software its less common to run potentially very long running tests and simulations of the machines states everyone wants to go faster than that especially for the sake of a max input test.
Also you can have checked math, like in Rust, and automatically just crash when you overflow a variable.
In this case, it's not a bug (cannot get incorrect result) but an unsupported input.
If you're worried about being able to make an array where a.length == int.max, (which is reasonable in an algorithm which does any math on the length), replace the array with another substitute which you can mock the size of (e.g. in Java that would be possible against an ArrayList). You can test the algorithm separately from the execution of the algorithm.
Small nitpick: you don't need 16GiB, 2^31 bytes is "just" 2GiB. Doesn't contradict your point though.
Each element of the input array is at least 4 bytes, bringing it to 8GiB.
Not necessary. But, still, my work laptop from 2020 has 32Gb of memory. So, not that implausible.
Swap in an ArrayList and mock size() and you need a few bytes...
You might not need 16GB of memory. There are systems where int is only 16 bits, and overflowing 16 bits is not that difficult.
But maybe it was uncommon to have arrays larger than 64K in the 80s due to segmented memory?
The core of Java was being written in the late 1990s. I had a machine in 1995 that had 16MB of memory but 8MB was more typical for Pentium machines. By 2000 the AMD Athlon and Pentium 3 were the latest and greatest and Anandtech was testing with 128MB of memory [1].
Java defines an int as 32 signed, it doesn't do anything else it calls 16 bit ints shorts. So it definitely has to be 8GB for the array.
Sun was using Solaris machines rather than PCs and they were higher spec on things like memory but still I doubt they had 2GB let alone 8GB+ needed to run this test. That sort of memory didn't become routine for another decade where Sandy Bridge was being tested with 8GB in 2011 [2].
Also goes to show how much things have stagnated, a desktop computer with 16GB has been standard for a long time. The preceding decade we went from 128MB to 8GB as normal and the next 15 years to today normal is 16-32GB which is no where near the same place of progress of memory density.
[1] https://www.anandtech.com/show/566/6 [2] https://www.anandtech.com/show/4083/the-sandy-bridge-review-...
The correct way to test that code is to write a version that doesn't take an actual array but an index -> int function, then you wouldn't need to instantiate the array at all.
> In a binary search, each variable can be zero, one, some(*), max, overflowed. The combination of these is not infinite as the values of some that cause different behavior is much more reasonable when grouped into chunks of the same behavior.
You are presuming that the algorithm is correct in the first place. Contrived example.
binary_search(array, key) {
if (key == 453) explode_world()
//Do correct binary search.
}
So you also need to prove explode_world is not called or something similar but less contrived is not in there.No. I'm presuming that you're able to look at the branches in code and write white box tests. Here the relevant two branches are 453 and anything other than 453. For addition of unsigned numbers, the relevant branches are 0 + 0, 0 + 1, 0 + max, 1 + 0 (usually can be skipped), 1 + 1, 1 + max (which overflows), etc.
The number of relevant material states is bounded and knowable. It's neither unbounded (1,2,3,4,5,6 ... ) or unknowable as the algorithm is available.
That is easily caught by a model checker that tries the equivalence classes. Or just normal coverage tooling.
> each variable can be zero, one, some(*), max, overflowed
These are the value ranges that I test in my unit tests, plus at and before some powers of 2 (e.g. 65535 and 65536), and -1. That is because -1 is uniquely used in some systems to indicate error, and thus trips some bugs.Are you quantizing information?
You mean property testing ? QuickCheck would be the go to library ( the original is in Haskell, however most languages have one ).
This bug, and many others, can be detected with a trivial amount of formal verification. I really think formal verification should see much wider use for things that see as much use as standard libraries, even if it's just for the trivial things like overflow and out-of-bounds access.
In the below code we can see a formal verification tool (GNATProve) detect both the original error and the out-of-bounds access that it causes. Doing the arithmetic using a larger type clears both reported errors without the need for any additional annotations for GNATProve.
function Search (A : A_Type; Target : Integer) return Integer is
Left : Integer := A'First;
Right : Integer := A'Last;
begin
while Left <= Right loop
declare
Mid : Integer := (Left + Right) / 2;
begin
if A (Mid) = Target then
return Mid;
elsif A (Mid) < Target then
Left := Mid + 1;
elsif A (Mid) > Target then
Right := Mid - 1;
end if;
end;
end loop;
end Search;
GNATProve output: Phase 1 of 2: generation of Global contracts ...
Phase 2 of 2: flow analysis and proof ...
wrapper.adb:12:36: medium: overflow check might fail, cannot prove lower bound for Left + Right
12 | Mid : Integer := (Left + Right) / 2;
| ~~~~~~^~~~~~~~
reason for check: result of addition must fit in a 32-bits machine integer
wrapper.adb:12:45: info: division check proved
wrapper.adb:14:19: medium: array index check might fail
14 | if A (Mid) = Target then
| ^~~
reason for check: value must be a valid index into the array
can also be spotted with experience.
these kinds of problems are present in very many standard treatments of algorithms and don't survive their first contact with real world use in some cases.
"needs a carry bit or a wider type" is common for arithmetic operations that actually use the range.
It makes the case that we need Math.mean or Math.avg function that we can use in these cases rather than than reinventing it.
Basically we should favor using built in functions for as much as possible because those should be reliable and tested more than ad hoc code we write. And compilers should optimize those built in functions well so there is no extra cost in using them.
C++ added std::midpoint() to the standard library: https://en.cppreference.com/w/cpp/numeric/midpoint
Another fun case, besides integer overflow, is negative values: in Java and C/C++ (i + j)/2 will round towards j, but i + (j - i)/2 will round towards i instead. Sometimes the difference matters.
I was thinking the same. And the code would be clearer too.
> The bug is in this line:
> In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (231 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.
At some point we have to draw an arbitrary line. Even an "arbitrary precision" calculation is bounded by system memory.
"bug" is not well-defined, or perhaps "bug" is more of a continuous label as opposed to discrete.
Why do we need to draw that line somewhere? Fixed solution works for a full range of Java int.
int mid =(low + high) / 2;
> Specifically, it fails if the sum of low and high is greater than the maximum positive int value (2^31 - 1).I would really challenge calling this kind of effects "bug" or "breakage".
It's like calling Newtons law of gravity broken, because it's not accurate at predicting how galaxies move.
Things are engieered and tested for a certain scale.
Knowing which tools to use at which sacle is part of the craft of engineering.
Sometimes they’re engineered and tested for a certain scale.
More often they’re engineered and tested for an arbitrary scale. The limits aren’t considered, behavior at the edges isn’t accounted for, and it’s assumed it will be good enough for real world inputs.
The use of `int` tends to be a dead giveaway. There are some cases where it’s clearly correct: where the spec says so (like argv), where you’re starting from a smaller type and it’s impossible for the calculations to overflow in an int (like adding two uint8), that sort of thing. And there are cases where it’s subtly correct, because you know the range of the value is sufficiently limited, either by mechanics or by spec.
But most of the time, int gets chosen because it’s the apparent default and it’s easy to type. No analysis has been done to see if it’s correct or if you want to declare your code to only support inputs in a certain range.
It’s really clear if you’ve written a binary search (or anything else that works on general arrays) in C and you use int as the index type. There’s pretty much no scenario where that makes sense. In theory you could analyze the entire program and prove that over-large arrays are never passed in, and keep doing it to make sure it stays that way, but that’s not realistic. If the programmer actually took one second to think about the appropriate data types, they’d use size_t rather than int.
You can still have this bug with size_t, of course. But it won’t be “this falls apart with arrays over 1G elements on 64-bit systems that can easily handle them.” If you declare that you wrote the obvious midpoint calculation with size_t because you didn’t intend to support byte arrays larger than half the address space, it’s at least plausible.
i write c++, but i had to teach myself and always wondered why others use imprecise types. portability is one possibility, but then you can't know if your datastructure will break for a given input
History and tradition at this point. Bit-sized integers and the other “meaningful” integer types like size_t weren’t added to the languages themselves until C99 and C++11. A lot of us learned those languages before that, and lots of code still exists from that time, or at least code bases that have evolved from that time.
I think it actually comes from the opposite of portability. Access to different kinds of systems wasn’t common then. If you were learning and working on a system where int is 32 bits and pointers are 32 bits, and other possibilities are just vague mentions in whatever books you’re learning from, it’s very easy to get into the habit of thinking that int is the right type for a 32-bit quantity and for something that can hold a pointer.
The lack of explicitly sized ints is actually a pro-portability feature but it prioritizes speed and ease of implementation of arithmetic operations over bitwise operations. The minimum ranges for each type can be used as a guide for average users to write correct and portable arithmetic and carefully-written bitwise operations. But most people would rather not think about the number of bits being variable at all.
Sort of. It was kind of handy when int would be the natural machine size, 16-bit on 16-bit hardware, 32 on 32. But making int be 64-bit leaves a gap, so it’s generally stuck at 32-but even on 64-bit hardware. And then people couldn’t agree on whether long should be 32 or 64 on 64-bit platforms, so now none of the basic integer types will typically give you the natural machine size on both 32 and 64-bit targets. At this point, if you want the “biggest integer that goes fast on this hardware” then your best bet is probably intptr_t or size_t.
There were/are machines where the char size is not 8 bits, and the ints are not sized in powers of 2. These machines are now rare but I think they still exist. This references some historical examples: https://retrocomputing.stackexchange.com/questions/12794/wer...
Oh wow, I didn't know size_t was so recent.
At least for C++, it's older than C++11; a lot of us used for a long time the "C++0x" pseudo-standard (which is mostly the draft of what later became C++11; as the C++0x name indicates, it was originally intended to be finished before 2010), and on most C++ compilers headers and types from C99 were available even when compiling C++ code (excluding MSVC, which really dragged their feet in implementing C99, and which AFAIK to this day still hasn't fully implemented all mandatory C99 features).
I believe it was somewhat older as part of typical C and C++ implementations, but don’t get standardized for a while. A big part of the older C and C++ standards are about unifying and codifying things that implementations were already doing.
I'm not sure what you mean by "imprecise types", but if you mean something like using an `int` for an array index instead of `size_t` or something, I can tell you why I do it. Using `int` lets you use -1 as an easy invalid index, and iterating backwards is a straightforward modification of the normal loop: `for (int i = max; i >= 0; --i)`. That loop fails if using `size_t`, since it is never negative. Actually `size_t` may not even be correct for STL containers, it might be `std::vector::size_type` or something. Also, I don't think I've encountered an array with more than 2 billion items. And some things, like movie data, are usually iterated over using pointers. As you say `int` is easy to type.
Also, for something like half my programming life, a 2+GB array was basically unobtainable.
By precise, I meant more the byte width (uint32_t vs uint64_t etc). The other kinds of types help you track what the purpose of something is, but don't really assist with correctness at the machine level.
In my work, I have a lot of data that is > 2GB, so int32_t vs uint32_t is very meaningful, and usually using a uint32_t is just delaying upgrading to int64_t or uint64_t.
Going in the other direction, a point cloud can usually be represented using 3 uint16_t and that saves a lot of memory vs using uint32_t or uint64_t.
If you want an index that can go negative, then the right type is ssize_t, not int.
> certain scale
Make it explicit. If the array is too large, throw an IllegalArgumentException. Document the limit. Then I agree with you.
Otherwise, if an allowed input crashes the program at runtime with a random exception, I respectfully disagree.
IllegalArgumentException is OK in your book, but OverflowException is not? It's very rare that I actually care which exception type is thrown, but it's a little more polite to throw more clear and reliable exception types. But saying "this could be a little more polite and therefore YOUR CODE HAS A BUG" would make me never want to work with you again.
Explict OverflowException could be ok but it may not tell me the root cause. A random index error is not ok at any time.
Then you should absolutely stay away from C :)
Words to live by!
I do.
But the example was Java.
The problem is that it's not that much harder to make it work for all the valid inputs.
Not doing that is not good enough.
Another example is summing lots of floats naively instead of using Kahan's algorithm.
It's like we had a theory of gravity that doesn't work on Mars because we have unnecessary division by (m-Mars' mass) in our laws :) It wouldn't be good physics.
It's not much harder in this toy example. In real examples what this looks like is a ton of bikeshedding and arguing about minutiae instead of improving the system in ways that are orders of magnitude more valuable to everyone. The truth is it's far easier to waste time on this mostly meaningless crap, exchanging one exception for another, than it is to think deeply about the problem you're actually trying to solve. It's lazy.
In real world you're not implementing binary search but using one implemented already.
Hopefully implemented by someone who cared enough to avoid bugs or at least document the range of the arguments.
In the real world I don't have an array with over 2^30 elements 99.75% of the time. If I did need to process an array with over 2^30 elements, I'd have dozens of reasons to worry that standard techniques might start failing, and would therefore carefully consider and test everything to a much higher degree. I'd want a safety margin of at least 10x, which means I'm designing for 2^33+ and already know I need to avoid int32 everywhere I can. The type signature of a binary sort returning int32 in such a case should already be triggering alarm bells.
This is a bit like arguing that every building humans have built is flawed because it wouldn't withstand a significant meteor strike. That's not how we do it. We consider maximum likely use cases and then build in safety margins of at least 10^2, sometimes up to 10^6. If you think there's a possibility you're dealing with hundreds of millions of elements, you stop using int32 entirely. You don't wait for 100 million to turn into 4 billion and crash.
Is it worth making the algorithm slower just to have it work on extreme edge cases?
> Is it worth making the algorithm slower just to have it work on extreme edge cases?
Yes! If an API has no documented (and preferably enforced) limits on its arguments, the caller has every right to assume the full range of the argument types is supported.
Else it is very much a bug.
“Be reasonable, dude” is ok when you order 1,000 beers. It’s not an appropriate response when calling a binary search API.
Is it worth to make the algorithm faster at the cost of throwing surprise OutOfBounds exceptions in some extreme edge cases?
Maybe, but only if you - and only you,and not an attacker can control the case you are in.
If an attacker can somehow make sure that there is an array with 2³⁰ elements, you have worse problems than a binary search crashing.
Why do you think this algorithm only applies to arrays? Why do you think this algorithm doesn't apply to this sine lookup table that the compiler placed at the end of the memory in the microcontroller?
Because the article is about binary searching an array. Obviously algorithms must be adapted to what they are used for.
Sure, because my pre-computed table isn't an array. It's just a bunch of numbers in consecutive positions in memory. That has never been an array.
Also, what makes you think I'm not adapting the algorithm? I mean, I set the search indexes at the address of the beginning and end of the totally-not-an-array array so as to not have to do a relative memory read, because those are not supported in my microcontroller, or if they are supported, it's an extra cycle, and my timing budget is quite tight. Also, I don't do a div by 2, but a shift right by 1, same reason.
It's as simple as replacing one NULL character with something else. Or 1 length field.
If that happens, you’re going to get an out-of-bounds error no matter how you implement your binary search.
In Java. In C you just access random memory values (or segfault).
Yes, which doesn’t contradict my point.
Well it's either that or document the domain.
Nice example with the 1/(m-m_mars)!
+1 for mentioning Kahan.
> Certain scale
Or just put the algorithm in a 16-bit microcontroller, put some table that needs to be looked up (think precomputed sine table), put that table near the end of the memory range, and just make the mistake to call binary search specifying the start and end memory positions of the table.
These implementations are definitely broken when the specification goes like "you just pass in your array here and it will perform binary search on it for your value." Yes, you could constrain this spec, but come on...
It's such a blatant example for a bug, that I struggle to believe how can anyone even remotely conceive and pivot to the idea that "nuh-uh, this is a bad spec not a bad implementation!".
I would disagree. The inputs to these functions are user controlled and so can be forced to break, whereas humans cannot change how gravity works.
But in what realistic scenario would a user be able to put in ≥2^31 while not being able to put in 2^32-1 (which probably breaks many more things from innocent increments or similar) or 2^100?
And further this all assumes they used int vs. long. It can be “wrong” in that it only works for arrays of under 2^63 elements without that ever being a possibility.
Production code is often filled with edge case bugs that simply never come up. Works for 100x the expected use case is generally good enough if you’re approaching the limits where 2^31 is a potential issue then you are also approaching the case where 2^32 definitely will be.
Hacking, of course. Overflows are one of the primary ways that hackers gain control of systems.
That’s irrelevant for majority of software.
This mindset is why hackers are able to exploit most systems.
Which isn’t a problem as exploiting most software is meaningless. Wow someone can hack software they already have root access to the machine for whatever will we do.
It’s management and developers not treating software where it is meaningful for someone to hack as a different category that’s an actual problem.
So a coworker sends me this CAD file via email, I open it and my computer gets controlled by a botnet, and the file immediately sends itself to all my outlook contacts.
Nah, that sounds impossible. I'm sure it has never ever happened.
Opening 3rd party files is one of those risks I was just talking about.
There’s a user moving around in a game, and there’s opening a workplace file these are inherently different kinds of risks. If your building a flappy bird clone you can know exactly how it’s going to interact with the world.
2^32-1 is almost always a possibility when 2^32 is, but there are many cases where those are possible but 2^100 is not. Basically anything where the value is a count of something rather than raw input fits the bill. How many characters, lines, or files do you support? 2^32 is a totally feasible number in many contexts. 2^100 is physically impossible, there isn’t enough matter.
If you accept 2^32, then code using 32-bit ints is definitely broken on it and thus the OP question of the issue on half that is irrelevant. Which is my point - widening the acceptable input range from 2^31 to 2^32 (or in the case of signed integers, from 2^30 to 2^31; give or take 1 of course) just "fixes" one small case of the actual core issue of nearly any arithmetic anywhere being wrong if you don't explicitly constrain input sizes.
I agree on there not being much difference between 2^30/31/32. But it’s not “nearly any arithmetic.” If your size is an actual data size, then 2^64 is fine.
Right, with 64-bit ints things are a lot nicer. Though you can still run into some issues on "generate this much data" tasks as opposed to "operate over existing data of this size", though perhaps less exploitably so.
Nearly all binary searches and mergesorts are broken in languages with bounded integers.
Doing any math with bounded integers considered harmful.
At some point during my undergrad I realized this and tried to be really careful when implementing algorithms, but it's stupidly hard to do in a tidy way, at least in old C. It's not practical and people just rather close their eyes and live in blissful avoidance.
And on machines with finite memory, right? Which would be every actual computer ever built?
Well I would posit that it would be hard to get to this code in a language with unbounded integers where (low n + high n) causes an OOM error, because in order to run this code, you first need an array n units wide.
You could argue that the array itself could take up most of the space leaving no room for the indices, but that's hardly a fault with the algorithm, as now you've got a computer that basically can't do anything due to overloading. Whereas overflowing a 32 bit integer is a much more likely occurrence that arguably the algorithm should account for.
Why does everyone talk about binary search in terms of arrays? Binary search works with any monotonic function. Looking up a value in a sorted array is just a special case of a monotonic function.
Because the 'bug' as presented in the article applies to binary search over an array that has a natural maximum length. If you weren't using an array, there'd be nothing constraining the magnitude of the indices, so you might as well go straight to bigints.
Of course there's a natural constraint: the type of the function. You can't pass a bigint to a function that expects an int. And if you are just blind casting, it turns out you have a similar bug: you are calling the function with different arguments than you think you are. It's the same underlying problem with a different expression in some cases.
So previously it worked for arrays up to length 2³⁰ and now it works for arrays up to length 2³¹. Is it really that much of a difference? Wouldn’t it be better to change the indexing type to a 64-bit integer?
The difference is that we know for a fact that the proper implementation works for integers up to 2^31, whereas the improper implementation deceived us into thinking that the code would work in situations where the code actually doesn't work.
I find it valuable to understand when my code will crash.
Article was written in 2006 when 32-bit architectures were dominant.
I suppose the issue is moot now on 64-bit architectures, as the difference between 2^62 and 2^63 isn't relevant. (2^62 bytes is 4611686 terrabytes.)
Spelling it out like that sure gives some perspective - It's a frighteningly low number! They sell 2tb microsd cards nowadays. I bet you could fit 2^64 bytes in a shoebox. So I think uint64 might eventually be insufficient as a size type.
Edit: Well not quite, more like a small closet.
Almost makes you think RISC-V was right with 128-bit extension. On the other hand, exabyte-scale memories might one day be possible, but would it still be useful to maintain single-address-space illusion for these?
RV128I is not an extension, but a (non-ratified) base ISA.
Independent from RV64I, which in turn is independent from (and predates) RV32I.
>Spelling it out like that sure gives some perspective - It's a frighteningly low number!
Yeah, it's very common for computers to have byte addressable 4 exabytes of storage...
Well I used to think 64 bits would be enough forever basically, I guess that's why I was a little shocked that it actually might become insufficient at some point even if its far off.
I'd say it's more likely that in the future we'll regress to 8-bit machines, losing the ability to sustain fab technology, than we'll exhaust 64 bits address space.
Having said that, with the advances of Electron apps, you might very well have a point...
I do not look forward to the future where common software requires exabytes of memory.
The use of:
int high = a.length - 1;
tells us that a.length-1 is supposed to fit into an int, so there is no need to handle 2³¹ or larger.Yep. No story here, feel free to move on...
Not really. Arrays are limited to an int in size. You would be using more memory for the calculation and have to cast the value down to a 32 bit value to use as an array index.
Or you could just write the code so it isn't vulnerable to integer overflow.
Which is sad (array size limited to an int) and has always annoyed me, coming back from Ada where the index of an array is just another discrete type (including boolean, enum type, and ranged integers).
Ideally the index should support int64 with int being a synonym on a 64bit platform by default.
If not yes frankly you're into such unexpected behaviour territory that you should check your whole codebase rather than rely on stuff working just because it compiled.
And we all know how everyone loves writing and understanding integration tests... (I personally do but most problems in the industry wouldn't be a thing if more people stopped to write them)
size_t, you mean?
Well yes by inference
Java doesn’t have a type that corresponds to size_t. It only has signed integer types, so the closest match is ssizet_t (but even then you need to figure out how many bits that is on your architecture).
J8 has unsigned integer arithmetic at least.
A cynical takeaway from this is that it's hard to write good code and it doesn't really matter if you do or not.
Most code at every company I've worked at and project I've built is littered with technical incorrectness and buggy half-measure implementations. It's human nature or time pressure or something like that, but the application continues providing business value despite that.
Because it's sort of like premature optimization. If your business case will never be dealing with billion-element arrays, it's a waste of time to make sure your code can handle them.
No they're not. If you're using an array with length over a billion in Java, your code stinks already before you start using binary search.
That's not only wrong in itself, but totally orthogonal.
A binary search implementation should still work, regardless of the array length, or have the limitation documented.
And of course an array "with length over a billion" can be totally valid, depending on the use case, your tradeoffs, available memory, etc. It could even be the optimal data structure for some use cases.
I'm not a Java programmer, but how would you load of a 1GB file into memory? I assume read returns some kind of an array.
Also big arrays being (supposedly) a coffeee smell doesn't mean that code handling them improperly is not buggy.
If you really needed it in memory you’d use one of the file APIs that will map it and present a direct byte buffer view over that memory.
Those APIs use long as their offset unlike the 32 ints used by arrays, and would avoid having to copy the data into some other object.
There has been some discussion over the years about how arrays could be changed in the JVM to support longer lengths, but doing so without breaking existing code and while providing truly useful functionality without providing obvious footguns isn’t as easy as you might think.
Java's arrays use a signed 32-bit int as their length, so the longest they can be is about 2 billion elements.
If your code has arrays over a billion elements, then it will fall over the moment someone inputs slightly larger data
Relational databases often require searching and sorting gigabytes of data to answer queries (sometimes larger than RAM if e.g. k-way merge sort is used) so it doesn't seem that far-fetched, especially given that there are database systems written in Java.
not doing much scientific programming eh?
Title needs updating with the year 2006
I often think AI is mostly crap, wasting a lot of energy for very questionable benefits. But could/should this repetitive task of reminding submitters to follow the submission guidelines and add the year to submissions of old articles be automated?
I would agree, though why would you need AI for that is an open question.
A simple crawler would have been able to detect it’s from 2006. Perhaps a reminder should be added if the year is not recent
Even simpler, just check if the url or title has been submitted before. That would also take care of all the duplicate entries that pop up once per day for a week after a viral story is emerging.
In this instance, the url is slightly different from previous submissions so some more clever fuzzy matching or using only the title would be needed.
Yes, I have always wondered why the simple duplicate checker within the same couple of days does not exist. Or does it exist and the duplicates are actually sligt variations of the URL.
What algorithm would you suggest to find the year in an arbitrary submission? Of course AI is not a very clearly defined term, more difficult problems certainly exist. I was just thinking of the case the submission contains several dates or none at all and still several hints a human would take into consideration get checked.
Of course some minimal implementation without AI techniques could already handle many cases. My AI suggestion was not death-serious ;)
Google's research blog does not seem to provide this, but many blogs include the Open Graph metadata[0] around when the article was published or modified:
article:published_time - datetime - When the article was first published.
article:modified_time - datetime - When the article was last changed.
For example, I pulled up a random article on another website, and found these <meta> tags in the <head>: <meta property="article:published_time" content="2025-01-11T13:00:00.000Z">
<meta property="article:modified_time" content="2025-01-11T13:00:00.000Z">
For pages that contain this metadata, it would be a cheaper/faster implementation than using an LLM, but using an LLM as a fallback could easily provide you with the publication date of this Google article.[0]: https://ogp.me/
>What algorithm would you suggest to find the year in an arbitrary submission?
In the submission title, a simple regex for the presence of a date with a standard format (e.g. %Y) would suffice.
Matching it to the article might or might not be possible, but that would already be enough (assuming having the date is a good thing, which I'm not certain at all)
As another comment suggested, you can scan for previous submissions by URL -- Algolia is very helpful with that.
Outside that, no clue, been a long time since I last wrote crawlers, admittedly. Though it can't be too difficult to crowd-source origin date parsers per domain?
But hey, if any LLM's free tier can achieve it, then why not. My point was that many people worked on that particular problem historically. It would be a shame if we can't use any of their hard work.
I think adding the year is mostly crap. What exactly information would it give, except perhaps the false impression that this article is "antiquated information", when it pretty much holds true, and describes a perrenial issue?
It gives a cue about how many times I've probably seen the article before. Quite useful, IMO. I read this particular article when it came out in 2006... it's convenient to know we're not discussing a novel finding on the same topic.
Write this in a Leetcode interview and I suspect the average interviewer will fail you and not believe your reasoning.
https://rust-lang.github.io/rust-clippy/master/index.html#ar... is your friend.
Isn't the larger point of this article that errors like this can sneak in and remain dormant for years? Even if this example is old, isn't this lesson still relevant? Expecting that we are now immune from this class of errors because it is not 2025 and not 2006 is hubris. Hubris rarely ends well.
The simplest fix is to use 64-bit indices; that way you couldn't possibly allocate enough memory for a sum of valid index values to overflow (even if your array stores single-byte types, there are necessarily other things in memory besides the array).
(Also, what brabel said: https://news.ycombinator.com/item?id=42665755.)
(Or you could use a language with arbitrary-sized integers. Python surged in popularity in 2005, as I recall.)
This article when I first read it over a decade ago made me internalise the fact that “int” types are not mathematical integers but are instead rings.
If you look at every algorithm through that lens, interpreting them with the actual type instead of an idealised one, then bugs like this will jump out at you. (Similarly, compiler optimizer passes and the like all need to account for this.)
How you can fix it is by using an integer type that is the same width as the pointer type.
Then, there is no way your array can be so big that high + low overflows, unless it's an array of bytes over more than half the entire bit virtual space, which is a fiction.
(If you're binary searching some abstract space that is not an array, I retract my ignorant statements.)
C++20 introduced std::midpoint() [0]
Interestingly, calculating the midpoint of doubles seems quite complex (according to gcc at least) [1]
> It is not sufficient merely to prove a program correct; you have to test it too.
Well... If your proof made the (false) assumption that int is an unbounded integral type, then you didn't prove the program is correct at all. What you proved was than an algorithm in some ideal universe is correct. But your program is a different beast that lives in a specific programming language.
size_t is 64 bits on 64-bit CPUs
I think most of the language implementers know about it.
Here is a relevant line from Go, even with a comment that it's about overflow:
https://github.com/golang/go/blob/19e9231/src/sort/search.go...
At which point does this article talk about Merge Sort?
Anyway... Merge sort doesn't even need to be recursive in the first place. It's always taught that way in CS classes, but it can just as easily be written with nested for loops. On the outermost for loop, you double the sublist size until you exceed the whole size.
Wikipedia gives an example merge sort that does nearly exactly what you describe.
That algorithm has some weird performance drops at 2^n+1 sorted elements. https://bruceediger.com/posts/mergesort-investigation-8/
What's funny is that the safer alternative:
int mid = low + ((high - low) / 2);
is probably what most of us originally came up with before we saw the shorter, more elegant, but overflow-prone approach.I was once penalised for doing this in a leetcode problem given during an interview. I was told to rethink things about calculating a mid-point from first principles.
I wrote one before I read the article to see if I would hit the bug and yep, I wrote it the safer way.
For me, that is the most readable way to do it because it lines up both conceptually and visually with how I'm thinking of the problem.
I love how wonderfully simple the solution is and how it incurs no performance/memory penalty. Literally just changing the arithmetic shift right instruction to the logical shift right instruction.
Ok let me say it. The implementation of ints in computers is plain stupid. I am not sure why we have been trying to convince ourselves otherwise for decades.
So many programs are wrong because of this and they are just ticking bombs.
I think you'd have better responses if you said the implentation of ints in most programming languages is plain stupid.
Processors are doing a reasonable job, but there's certainly things they could do differently. Some languages have pretty specific integer types. And then you have C. Java at least makes all the integer widths consistently sized on all platforms, but neither provides good ways to express and manage overflow or carry. And the architects of Java decided that developers aren't qualified to use unsigned types, which makes a lot of tasks much harder.
What would you have done? I don't disagree, but I'm not sure what would have been better either.
Int as a type serves different purposes. Counting, indexing, rounding floats, exact arithmetic etc.
These have completely different requirements and edge cases. It was a mistake trying to tackle all of these with a singular type.
Not to mention bools, bitarrays, error codes...
while it's unlikely you'd allocate a memory array[2^30] on a 32-bit machine, I also use binary search on mmap'd files.
Specifically, text files. Binary search doesn't need the exact midpoint, just somewhere nearby, so I pick the midpoint and look around for line boundaries.
On normal sorted text files, this works fairly well. I was able to search a list of a billion IP addresses while touching only a few pages.
Not true for Javascript. The maximum array length is 2^32-1, but the maximum safe integer value for numbers is 2^53 – 1.
Wait a second. Is it really your bug if the fix is to make it work around the hardware's limits?
My personal experience is yes. Hardware provides a substrate of reality, and software developers are asked to implement aspirations.
If you did a lot of stuff on 8 bit systems you ran into this very early and often.
Not in assemly, where it's trivial to shift right with carry.
I'm puzzled by the inclusion of merge sort. Merge sort is referred to by the title, and in passing in the text, but no examples are given.
maybe dumb, but what happens when low and high > unitMAX/2
shift left is worse in that case right?
It's not possible, because low and high are of type int, which is always lesser than uint_nax/2
> mid = ((unsigned int)low + (unsigned int)high)) >> 1;
Also not correct if the sum overflows.
It can't overflow - max_int+max_int < max_uint
This reasoning is correct but still has to assume that both indices are non negative, which is reasonable.
So does the bitshift cast the result of the addition to an unsigned int? I am not so familiar with Java.
"Nearly all"? Really? This is an obvious, common category of bug that's been well known for decades. Surely nowadays a decent proportion of binary searches and mergesorts are written in languages that are less stupid (such as Python) and prevent this class of bug.
> This is an obvious, common category of bug that's been well known for decades.
TFA is actually the article which made it common knowledge, it's from 2006.
Notably, this is impossible in well-designed languages.
Do you mean that the language would either halt or throw an exception due to overflow on "high+low"?
I'll make sure to consider this the next time I'm working with a billion length array lmao.
Just because such a thing is beyond your experience doesn’t mean it’s rare or unimportant. I see this attitude in students who have worked mostly on toy (assigned) problems, but at least they are aware this is the case.
yawn this is old news
[dead]
Meh. Thats not the algo but int type used. I did not like this title, it was a scamm by ai.