SQL NULLs aren't weird, they're just based off of Kleene's TRUE-FALSE-UNKNOWN logic!
If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:
1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.
2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.
3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.
4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!
This is the correct way of thinking about things. Null is one of the hardest things for traditional software engineers in my experience as a guy who came up as a data admin.
That doesn't address anything in the second half of the post though, starting with this pull quote:
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.
This is confusing when you know that NULLs are not comparable, but it makes some sense if you consider the result of distinct/union as the output of a GROUP BY. You can consider everything that's NULL to be part of the same group, all the values are unknown.
So NULLs are not comparable but they are part of the same set.
MS SQL Server treats NULLs as indistinct for UNIQUE constraints, SELECT DISTINCT and for UNION.
Indeed, the sqlite page the pull quote is from says as much.
If only it had a name that was more indicative of that, like UNKNOWN, or UNDEFINED or INDERTIMINATE or something.
It's unfortunate to namesquat on 'boolean' if your elements have three possible values. Just call it 'ternary logic' and let individuals decide which system to use.
Who's name squatting boolean? The bool column is exactly what it claims to be, you just have the option of introducing unknowability if you define it allow nulls.
The distinction is that not all formal logic systems are Boolean. Meaning that it is nonsensical and confusing to use "Boolean" as a generic synonym for "truth value" in the same way that it's nonsensical to use "Pantone" as a generic synonym for "color value", including when the specific kind of color value you're talking about is CMYK or HSV and definitely not Pantone.
But it is a boolean value, there's only two possible values TRUE and FALSE. But because it's SQL you can define any column as TYPE | NULL.
You could say that a boolean column with a NULL value is FALSE like how a lot of programming languages coerce it but if you wanted that you would just make a default of FALSE. The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
What I want is for e.g. "x OR y" where y is NULL (and/or of nullable type) to be an error rather than silently giving surprising results. Just like in a decent programming language I can't do x || y where x and y are of type boolean?, I have to explicitly handle the case where one or other of them is null (or e.g. write x!! || y!! - and that will at still error if either is null rather than silently evaluating to a funny result).
The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
You mean neither true or false?
I think I mean either. So yes NULL is a distinct value from true and false so I think it's also right to say it's neither true nor false. But the value NULL does represent is roughly "could be true or false, we don't know yet."
It could mean anything or nothing depending on context, which is part of the problem. (Plenty of people think the meaning of NULL is clear and obvious and consistent, unfortunately they give different answers about what that meaning is)
It could also be neither. It's whatever you define it to be. Null could mean you don't know if it's true or if it's false, or it could mean you know it's neither true nor false.
It could be true or false, but it’s unknown. For example. a user doing a survey is yet to fill in the answer. That doesn’t mean there is no answer, it’s just unrecorded.
Maybe GP was edited, but it doesn't use the word "boolean" anywhere.
Correct, I edited "boolean" out prior to ^^P's comment. My apologies.
> quite parasitic
This is the mental model that I use and teach: a NULL is a virus that will likely make everything it touches NULL... unless cordoned off by a COALESCE or IFNULL or something else that explicitly speaks about its NULL-sensitive behavior in its documentation.
So if you see a bug where a WHERE or FILTER clause is unexpectedly failing, and you're in a time crunch, rather than pounding your head against the wall, start sprinkling COALESCE statements around (while being thoughtful about the domain meaning of what you're doing) and it's more likely than not to solve your problem!
Except that NULL is not the same as UNKNOWN! NULL is a data value (like integers) that can appear in data expressions (like NULL + 1) and comparisons (like NULL = 1) whereas UNKNOWN is a truth value that can appear in boolean/logical expressions constructed from logical connectives like AND, OR, NOT.
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html
Indeed, they're not identical - that's why I just said "based on", and that's likely why the word UNKNOWN itself isn't used in SQL.
Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.
> the word UNKNOWN itself isn't used in SQL
That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.
And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818
> select null = null; returns NULL, because each NULL is basically a placeholder representing any “unknown value”. Two unknown values are not necessarily the same value; we can’t say that they are equal, because we don’t know the value of either of them.
Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.
Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.
The problem is that in practice in a database NULL is a placeholder for a missing value, not an unknown value.
SQL NULL doesn’t behave like “unknown” in all contexts. That’s one issue of NULL, that it doesn’t really have consistent semantics.
Furthermore if null only means unknown then we need a value for “known absent”, there’s a reason why null is so often used as that.
From a purely relational perspective, if some piece of data can be absent, it's a 1:N relation where N<=1, and ought to be encoded as such.
(Of course, this is rather awkward in practice, and when NULL is there, it's inevitably going to be used for that instead.)
It is encoded as such. That's why most columns are made nullable. It's crazy to say you need to use the full power of a 1:N relation with some child table when you know N cannot be greater than 1, when a nullable column already exactly encodes a 1:(0..1) relation. I'm not trying to shill for null here: one of null's great problems is exactly the fact that null can represent "unknown", "known absent", "not applicable", or even some sentinel-ish "other" escape condition, each with their own subtle differences in how they should be handled.
Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!
Do you actually need that in a Boolean context? It would only be useful for evaluating self-referent claims like "this sentence is false".
Your questions might be relevant if null were limited to boolean contexts.
It’s not.
The result of comparisons involving NULL values can result[1][2] in UNKNOWN, and in PostgreSQL for example you can test[3] for this using IS UNKNOWN.
That said, as someone self-taught in SQL, I agree NULL was not a good choice.
Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.
[1]: https://stackoverflow.com/a/79270181
[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...
[3]: https://www.postgresql.org/docs/current/functions-comparison...
SQL was developed in the 1970s, there’s no way they’d waste all those bytes to spell out UNKNOWN and INDETERMINATE.
Also self-taught SQLer and I don't have an issue with NULL.
I also don't use UNIQUE constraints, so maybe that has something to do with it.
[delayed]
Introducing “unknown” feels like another kind of hell like undefined in JavaScript.
Just to clarify, I'm not advocating to introduce a new `unknown` keyword. I'm saying that the existing `null` in SQL was not named properly and that the name `unknown` would have been more fitting. SQL's `null` already has the semantics of `unknown` as explained in the part of the article that I quoted.
SQL's use of "null" is probably one of the oldest instances of that concept in computing. It's exactly equivalent to unknown. That is its definition.
SQL NULL is not "exactly equivalent" to unknown. E.g. in an outer join, there's nothing unknown about the result that is missing a row from one side, yet SQL semantics is to fill it with nulls.
In practice, it behaves as "unknown" in some contexts, as "missing value" in other contexts, and sometimes it's just plain WTF like SUM() returning NULL rather than 0 if there are no rows.
Really? I know that SQL is old but I would have expected `null` to refer to pointers at first.
Going by Wikipedia, I see that SQL is from 1974 and C from 1972. Were there earlier uses/drafts where `null` is "unknown" instead of "unset"?
You'd also have to ask when NULL came into common use in C (to which I do not know the answer). AFAIK NULL was not considered to be standard until C89. As far as I'm aware, all C compilers implement it as a #define in a standard header somewhere; it's not actually part of the core language itself.
I wonder who first added that macro? Was it there from the beginning?
Just random thoughts...
I wouldn't necessarily define `null` as "unknown" -- it's just "no value" -- which is really the same thing and also somewhat equivalent to "unset". But null pointers aren't unset as pointers aren't initialized to null in C and you can explicitly set a pointer to null.
E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.
In nth normal form, you can't have 'no value'. That would mean your model is wrong. In academic relational data books, null does mean "unknown". There is a value, we just don't know what it is (yet).
If there might actually not be such a value, you're supposed to change your schema to reflect that.
>also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.
Why would anyone want to use another database?
SQLite has its own use cases. And MySQL was all-around better than Postgres in the past, so it still has more usage in some areas. Nowadays I'll take Postgres over MySQL by default, but it's not a huge difference.
Simplicity. PG is often overkill for simple apps, where MySQL/Maria/et al is easier to maintain, or even SQLite for very simple apps where zero DB maintenance is preferable.
Also the reverse — MySQL et al support much more complex replication topologies out of the box, including multi master for the very rare use case when you need it.
It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.
Why would you say MySQL/Maria/et al are easier to maintain for simple apps than PG?
The main pain point for smaller apps is that every major Postgres version requires an explicit migration of the underlying data representation. I get why it's there, but for simpler apps I would appreciate a flag to do it transparently.
I'm not sure what you mean. I have migrated versions without having to update any applications that connects to it?
Maybe it is a driver specific issue? I have used Python/Java, and haven't updated any of my code or dependencies because of a major Postgre update
It's not client side, it's server side. The binary format between Postgres major versions is generally not compatible so you need to do a pg_dump from the old database and do a pg_restore into the new one. pg_upgrade [1] can let you bypass this by migrating the binary data directly, but still requires having both the new and old postgres versions installed. There's also things you can do with replication, but since we're talking simplicity I don't think that really applies.
Personally I think upgrades are the one thing MySQL has on Postgres at this point.
It requires manual interventions because the upgrade process is basically dump + restore. MySQL and MariaDB upgrade between major versions automatically — you simply install the next version (or change the version tag in your container definition) and restart the server.
Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).
You need to dump the database on the old PostgreSQL version and then load the dump on the new PostgreSQL version. Some other database servers can just use the old table data or migrate the data files on the fly transparently.
Usually its a skill issue
VACUUM
If your data's large and changing enough that you have to care about vacuuming, any reasonable database is going to require some tuning, tending and management.
I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.
Sure, it's never going to be plug and play, but it doesn't mean that all the issues will be equivalent. Vacuuming doesn't really have an equivalent in say, MySQL. It's something you don't have to worry about if you use the latter.
For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.
* legacy applications or vendor lock-in
* use of a cloud provider that favours another database (like SQL server)
* some people claim mysql/maria is faster for them, simpler, or has a better support for replication
* use of sqlite for lightweight or portable apps
* domain specific databases that still use sql as their query language
* someone may want to use another database for fun or to learn something new
Replication
In Object Oriented Context "null" is useful to indicate that some object doesn't have value for that property.
What's interesting is, do we mean that in our data that attribute has no value? Or do we mean the real-world object represented by the data does not have that attribute?
Does null mean
a) We don't know the value of this attribute for this object, or
b) We do know that there is no value for this attribute in the real-world object represented by our data.
In JavaScript because there is both null and undefined it is easy to assume that undefined means we don't know the value and null means we do know it has no value.
EXAMPLE: The attribute 'spouse'. Some people have a spouse some don't. So what does it mean if the value of the field 'spouse' is null? That we know there is no spouse, or that we don't know who the spouse is if any.
In practical terms we can say null means "We don't know" which includes the case that there is no spouse.
How about Option<bool> for that? In SQL you could have a nullable Boolean column
I remember from my databases course at university that NULL means that the database doesn't contain that data, and empty string means that it is known to be empty.
That's your professor's opinion, and probably one that does not come from industry experience. Look in 4 different databases and you'll see 9 different conventions. A common one is to have all strings non-null with a default value of empty string. And not all columns are strings; there is no "obviously empty" integer or boolean.
What is the type is something other than a string?
age: null? married: null?
But in a _relational_ database lack of spouse would not be modeled with a nullable column "spouse" but rather an absence of a spouse row/relation. Which is very real-world-like.
> ... and this is even less obvious if you’re used to using ORMs.
Which is why I continue to be such an ORM skeptic. I agree that they're convenient. But I do worry that we've now got an entire generation of engineers who regularly interact with relational databases, but have largely been spared the effort of learning how they actually work.
As another commenter pointed out, if you've learned basic relational algebra then the way SQL nulls behave seems obvious and logically consistent. The logic is the same as the logic behind the comparison rules for NaN in IEEE floats. It's the behavior of C-style nulls that is, always and forever, a billion-dollar mistake.
Stop thinking of ORMs as trying to hide the details of SQL and you'll stop hating them. Instead think of them as a way to compose relational queries dynamically, with the full power of your primary language, instead of inside of database stored procedures in a language totally devoid of any support for basic software engineering best practices. They shouldn't be hiding SQL from your primary language, they should be exposing the relational model to it! SQL is not the only possible implementation of the relational model, and it's not even a particularly good one. Even SQL's founders don't think it implements EF Codd's relational model very faithfully. Good ORMs act as a domain-specific language for the relational model embedded inside the parent language.
If you can avoid learning SQL or the underlying DBMS then great, I have no problem with people ignoring things that are ignore-able. Problem is you can't. You will need to learn the DBMS, and the ORM will end up just getting in the way of what you really want to do.
My experience with ORMs is that most of the time you end up needing to write some queries by hand, in raw SQL. Usually these are the most complex, that you can't express in your ORM's DSL. My point being, I don't think using an ORM really shields you from having to learn how it works behind the scenes.
It's not even about having to write SQL by hand. In an ORM like Django that's exceedingly rare. But you still need to understand what's going on underneath. In other words, it's the most leaky abstraction there is. I think the popularity is mostly aesthetic and convenience. Most people into ORMs like Django don't really know about layered architecture and that you can keep all your SQL in one place in the data access layer. They just scatter that stuff everywhere in the codebase.
I don't know Django specifically but I'm always floored by how people talk about ORMs. They're only a leaky abstraction if you believe their point is to shield terrified junior devs of the inner workings of the scary relational database. That's an awful way to use ORMs, and the source of most of the flak they get. To be fair, some are designed that way, or at least strongly push you toward it.
Stop thinking of ORMs as trying to hide the details of SQL and you'll stop hating them. Instead think of them as a way to compose SQL dynamically, with the full power of your language. SQL is an awful language to write application logic in, because it has horrible support for abstraction, composition, encapsulation, dependency injection, etc. The ORM gives you a way to produce SQL in an environment that actually supports basic software engineering principles. Scattering ORM logic everywhere in the codebase is the point: putting all your SQL in one data access layer is like putting all your arithmetic in one calculation layer. Why would you ever do that? What's wrong with seeing a plus sign in more than one file? What's wrong with seeing language-encoded relational logic in more than one file?
I can guarantee you the popularity is not "aesthetic". And convenience is a real thing that actually does reduce costs. People complain about ORMs, but have you seen the absolute horse-shit-level code that people jam into SQL functions and procedures to do the utterly most basic things? The standard for what ends up in SQL Stored procedures is the most unmaintainable garbage in the entire software engineering ecosystem.
NULLs are weird because they are basically two different types under the same name. The 3-value logic type is useful for representing "missing" foreign keys, but 2-value logic type is arguably more useful when searching/sorting/aggregating.
I think we would have been better-off by treating FKs (and maybe outer JOINs) as a special case, and using 2-value logic everywhere else.
I actually like how NULLs behave in SQL. They mean "I don't know" In the modern programming language we all care about Null safety. But no matter how you model your data, you will always run into the situations when you don't know everything. So I believe NOT NULL is not very practical. NULLs in SQL handle these case very well - when the input is unknown your output is unknown
Except they don't consistently behave that way. If NULL means "unknown", why do they show up in outer joins, or when you SUM an empty table?
I feel like the same, Null equal null is null is totally right
Agreed. If SQL didn't have NULL, we'd have other special values meaning "I don't know" or "no data" all over the place.
Too many newbies hear that NULL is bad, so they declare all columns as NOT NULL and end up inserting ad hoc values like 0, -1, '', or {} when they inevitably come across cases where they don't have data. Which is even worse than NULL.
SQL NULL is also coerced to different boolean values depending on context e.g. in a WHERE clause NULL is coerced to false, whereas in a CHECK constraint NULL is coerced to true
I expected the article to mention how in Oracle NULLs are equal to empty strings. Now that is weird.
Oh man. I had a talk with a DBA about how oracle could not deal with an adress with no street name - literally a tiny village with 10 houses on 1 nameless town square. It was unsearchable in parts of the app because street='' was interpreted as street is null. DBA kept claiming oracle was right and the town should adapt their naming to our software.
This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
> This attitude was so prevalent at the time, I sometimes wonder if the rise of noSQL was simply people sick of dealing with Oracle DBAs
That was definitely one part; another part was sharp corners in MySQL (at least as of 20 years ago; I would be surprised if many of them haven't been rounded off in the meantime). The last part was places with no DBA with developers unaware of how to handle schema migrations.
Domain-embedded nulls are the bane of my existence.
SQL NULLs are not weird once you consider how you want relational logic to work when they is a record with non-existent values.
Exactly this. SQL is based on the relational algebra and that's well-defined, NULL along with other features of SQL work in an entirely regular and predictable way. The only time it's weird is when a developer decides that it should work the way Javascript (or whatever) NULLs work because that's the last time they saw the same word used in a programming language, in which case it's the assumption that's weird.
That's not the only time it is weird. There's even a whole book by one of the pioneers of the relational DB model, Date's "Database Technology: Nulls Considered Harmful" [1], covering many of the ways it is weird.
[1] https://www.amazon.com/Database-Technology-Nulls-Considered-...
The part that’s weird with nulls is that it’s a trinary logic stuffed into a boolean algebra. The use of x = NULL instead of x IS NULL is pretty much always a mistake.
More importantly, x = value instead of (x = value and x IS NOT NULL) is almost always a mistake, and a stupidly subtle one at that. And for this curse, we get… nothing particularly useful from these semantics.
Also the x != NULL case is completely cursed
Agreed.
I will die on the hill that regular C-like nulls are the actual thing that's weird.
The real billion dollar mistake [1] was the damage it made on the minds of developers.
Even null in programming languages isn't so bad if it's a distinct type. The problem with null in languages like Java is that null is part of every reference type (C's pointers are another world of broken, null being basically just another unsafe invalid address).
Most languages nowadays do get nulls right, even PHP of all things.
>unsafe invalid address
Ironically NULL is probably the safest pointer value in C, as any dereferences from it (and thousands of pages that follow it) are guaranteed to crash.
We should start adjusting that billion for inflation.
SQL nulls in some ways behave in similar to floating point nans. Of course nans are also weird in their own way, but it is a bit comforting that its not so completely singularly weird.
Also similar to the bottom value in haskell and exceptions in other languages.
NaN is cool because it's almost like a type that constrains uncertainty. What do we know about this entity? not much! but it's definitely not a number. Calling it anything else would have been a mistake.
Null is more confusing because it means different things in different languages. Sometimes it's a more constrained uncertainty, eg this definitely doesn't exist. But in sql it's a less constrained uncertainty, like "undefined" in math. The value of this thing couldn't make sense in this context, but we can't make assertions about its existence.
There's another comment in here that talks about thinking of NULL as UNKNOWN, and I quite like that. It makes a lot more sense, and I think it would've been a better choice to standardize on. UNDEFINED would also be an improvement.
UNDEFINED would not be accurate. If your signup form has an optional field for a full name which I don’t fill in, I still have a name. Just because a value is not known by your database doesn’t mean it isn’t defined.
E. F. Codd thought about this issue.[0]
> Codd indicated in his 1990 book The Relational Model for Database Management, Version 2 that the single Null mandated by the SQL standard was inadequate, and should be replaced by two separate Null-type markers to indicate why data is missing. In Codd's book, these two Null-type markers are referred to as 'A-Values' and 'I-Values', representing 'Missing But Applicable' and 'Missing But Inapplicable', respectively. Codd's recommendation would have required SQL's logic system be expanded to accommodate a four-valued logic system. Because of this additional complexity, the idea of multiple Nulls with different definitions has not gained widespread acceptance in the database practitioners' domain. It remains an active field of research though, with numerous papers still being published.
> If your signup form has an optional field for a full name which I don’t fill in, I still have a name. Just because a value is not known by your database doesn’t mean it isn’t defined.
There's the Closed World Assumption in a database.
> UNDEFINED would not be accurate. If your signup form has an optional field for a full name which I don’t fill in, I still have a name.
If your signup form has an optional field for middle name which I don’t fill, it can absolutely be because I don’t have a middle name. It’s undefined and known to be so.
> E. F. Codd thought about this issue.[0]
And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
> If your signup form has an optional field for middle name which I don’t fill, it can absolutely be because I don’t have a middle name.
The fact that some NULLs are indeed undefined doesn’t contradict my point that it would be inaccurate to refer to all NULLs as UNDEFINED.
> It’s undefined and known to be so… And because four value logic was left out, nulls have to fulfil multiple incompatible roles, and ends up being weird.
You know it to be undefined, but if you just leave that field blank on the form, the database still can’t know whether that’s because you don’t have one or because you didn’t provide the one you have, unless there was a field on the form that allowed you to explicitly indicate that you don’t have one.
Except it is of type number, at least in JavaScript.
Weird as they seem at first, SQL null handling ends up being convenient the way it is. Part of this is because left/right join give you nulls.
Surprised that this doesn't mention "IS DISTINCT FROM" ( https://modern-sql.com/caniuse/is-distinct-from )
(Although in rare cases that is even weirder: https://stackoverflow.com/a/58998043 )
I'm glad SQL Server finally got this, but I wish the syntax was nicer. It's a multi-word infix operator that gets tough to read. I've been using Snowflake SQL recently and I like that they just made it a function called EQUAL_NULL
If you want equality testing with nulls, you want to use `is (not) distinct from` instead of `=` and `<>` / `!=`.
`1 is not distinct from NULL` => false
`NULL is not distinct from NULL` => true
`0 is not distinct from 1` => false
Having that is much better than not having it, but man is it verbose and confusing.
For Postgres specific approach, you may refer to
https://blog.rustprooflabs.com/2022/07/postgres-15-unique-im...
Practically speaking, I go with not null, and always set default value.
This has always made queries unpredictable in many scenarios and it should be a feature to turn nulls off entirely and swap them out with Option<T> instead.
How would you handle unmatched outer joins?
By having a default value (non-null) for each declared type of those columns.
Or, the user must define a default value in the query itself.
Yes, tedious; but, precise and forces the programmer to really prepare for the "unknown" scenario.
a left outer join b yields tuples of (A, Option<B>), a full outer join b yields tuples of (Option<A>, Option<B>)
null != null is pretty bizarre at first, until you understand the reason the did it was to try to make sense of null-able indexed columns. Not sure why we couldnt have our cake and eat it, but instead we got IS NOT NULL is not the same as != NULL
FWIW, you can explicitly change this behavior in Postgres as of version 15 - include "NULLS NOT DISTINCT" when creating the unique index.
Weirder still are floating point numbers in SQL.
There's another good, technical write-up on NULL behavior in SQL at modern-sql.com
https://modern-sql.com/concept/null
(Note: I am not affiliated with that bloh/website in any way, shape, or form.)
The NULLs in unique constraints quirk actually works differently in ORACLE databases, which is infuriating to say the least. Apparently this comes from some ambiguity in some sql standard, anyone know more about this?
NULL is the absence of a value. If you try and treat it as a value, you're going to have a bad time. So an attempted UNIQUE(email_address, deleted_at) constraint is fundamentally flawed. If you treated NULL as a value that could be unique, you're going to break foreign keys.
But let's continue the logic of deleted_at being NULL indicating an active account, which seems to the intent here. You end up doing things like:
SELECT /* ... */
FROM accounts
WHERE email_address = '...'
AND deleted_at IS NOT NULL
Depending on your database, that may or may not index well. More problematic, you may end up with privacy leaks if someone forgets the last conditional.If anything, you want to reverse this so someone has to go out of their way to explicitly select deleted accounts. There are multiple strategies for this eg using an active_accounts view or table.
Lastly, there are lots of potential reasons for an account to be disabled or otherwise not visible/accessible. Takedowns, court orders, site safety, hacked accounts and so on.
Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
> Overloading deleted_at to have a semantic meaning for an active account is just fundamentally bad design.
Then don't do that. It's kind of a leap to say soft deletes are categorically bad because someone might confuse "deleted" with "inactive". My users table does the super-advanced thing of having both columns. The ORM also doesn't forget to add the not-null criterion. There's also zero databases in active use where it poses a problem to indexing.
Soft deletes suck in their own way, but none of the alternatives are perfect either.
That's interesting - I believe this is exactly how Sequelize implements soft-deletion.
You put the "is not null" on the index itself and then simply don't use it for the much rarer queries that are on deleted accounts. Or just use a view for active accounts.
Overloading timestamps to carry a boolean on null is awesome as long as you decide that's what you're doing and use one of the several standard techniques to dodge the easily avoided potential downside.
This isn't a valid security concern, more than any other incorrect sql query would be anyway. A dev can always write a bad Q, you need another way to address that it's not more likely here because of the null.