Comments Page - Migrating to Postgres

« Back Migrating to Postgresengineering.usemotion.comSubmitted by shenli3514 19 hours ago

luhn 15 hours ago
> By Jan 2024, our largest table had roughly 100 million rows.
I did a double take at this. At the onset of the article, the fact they're using a distributed database and the mention of a "mid 6 figure" DB bill made me assume they have some obscenely large database that's far beyond what a single node could do. They don't detail the Postgres setup that replaced it, so I assume it's a pretty standard single primary and a 100 million row table is well within the abilities of that—I have a 150 million row table happily plugging along on a 2vCPU+16GB instance. Apples and oranges, perhaps, but people shouldn't underestimate what a single modern server can do.
- thehappyfellow 6 hours ago
  It’s incredible how much Postgres can handle.
  At $WORK, we write ~100M rows per day and keep years of history, all in a single database. Sure, the box is big, but I have beautiful transactional workloads and no distributed systems to worry about!
  rastignack 3 hours ago
  At $WORK, we are within the range of 2 billion rows per day on one of our apps. We do have beefy hardware and ultra fast SSD storage though.
  wvh an hour ago
  Two days ago, I'd have said the same. Yesterday, big box went down, and because it was so stable, it was a joint less oiled and the spare chickened out at the wrong time and apparently even managed to mess up the database timeline. Today was the post-mortem, and it was rough.
  I'm just saying, simple is nice and fast when it works, until it doesn't. I'm not saying to make everything complex, just to remember life is a survivor's game.
- hliyan 14 hours ago
  Call me old fashioned, but when records start reaching the 100 million range, it's usually an indication that either your dataset is too wide (consider sharding) or too deep (consider time based archival) to fit into a monolithic schema. For context, I've dealt with multiple systems that generate this volume of data between 2003 - 2013 (mostly capital markets, but also some govt/compliance work) with databases and hardware from that era, and we rarely had an issue that could not be solved by either query optimization, caching, sharding or archival, usually in that order.
  Secondly, we did most of these things using SQL, Bash scripts, cron jobs and some I/O logic built directly into the application code. They were robust enough to handle some extremely mission critical systems (a failure could bring down a US primary market and if it's bad enough, you hear it on the news).
  hylaride 4 hours ago
  It obviously depends on how you use your data, but it really is surprising how far one can go with large tables when you implement sharding, caching, and read replicas.
  For tables with a lot of updates, Postgres used to fall over with data fragmentation, but that's mostly been moot since SSDs became standard.
  It's also easier than ever to stream data to separate "big data" DBs for those separate use cases.
  hliyan 2 hours ago
  Thanks, I knew I forgot something: read replicas
  paulddraper 2 hours ago
  Depends on the read/write workload and row size, but yeah after 100-200m rows PostgreSQL vacuums can take a while. And index rebuilding (which you have to do on an active table) too.
  It all depends though, sometimes 1b is passe.
  But 100m is a good point to consider what comes next.
- sgarland 3 hours ago
  Agreed. Devs usually do a double take when I tell them that their table with 100K rows is not in fact big, or even medium. Everyone’s experiences are different, of course, but to me, big is somewhere in the high hundreds of millions range. After a billion it doesn’t really matter; the difference between 5 billion and 1 billion isn’t important, because it’s exceedingly unlikely that a. Your working set is that large b. That your server could possibly cope with all of it at once. I hope you have partitions.
- icedchai 15 hours ago
  You don't even need to be that "modern." Back in 2010 I was working on a MySQL 5.x system with about 300 million rows on a dual Xeon box with 16 gigs RAM and a few hundred gigs of RAID 10. This was before SSDs were common.
  The largest table was over 100 million rows. Some migrations were painful, however. At that time, some of them would lock the whole table and we'd need to run them overnight. Fortunately, this was for an internal app so we could do that.
  luhn 15 hours ago
  The improvements to migrations have been the biggest boon for running even modestly-sized Postgres DBs. It wasn't that long ago that you couldn't add a column with a default value without rewriting the whole table, or adding NOT NULL without an exclusive lock while the whole table was scanned. That becomes unfeasible pretty quickly.
  nasretdinov 8 hours ago
  Does adding a default value into a column finally work without locking up an entire table now at least?
  luhn 34 minutes ago
  Like most ALTER TABLE subcommands, you need an exclusive lock on the table while the catalog is updated. But no table scan or rewrite is required, so that lock is sub-second and can usually be done without disrupting a live application.
  williamdclt 7 hours ago
  Yes, it has for a few years :) There's almost always non-locking ways to migrate things with Postgres now
- throwaway7783 14 hours ago
  Yeah, we have 300m+ rows in a table as well. It's partitioned by time and chugs along with no issues. Granted It's a 30 vcpu, 100gb ram machine, but it hosts billions of rows in aggregate
  SchemaLoad 13 hours ago
  Last app I worked on had a few tables in the billions of rows. Seemed to work fine as we were only really accessing it by unique keys which seems to remain fast no matter how large the table is.
  lelanthran 6 hours ago
  > we were only really accessing it by unique keys which seems to remain fast no matter how large the table is.
  Even a naive B-tree index has a logarithmic curve, which means that the time to find a record asymptotically flattens out as the number of records increases.
- williamdclt 7 hours ago
  Yeah, 100mil is really not that much. I worked on a 10B rows table on an rds r6g.4xl, and Postgres handled it fine, even with 20+ indexes. Really not ideal and I'd rather have fewer indexes and sharding the table, but postgres dealt with it.
- thomasfromcdnjs 15 hours ago
  Does mid six figure mean ~$500k?
  That sounds insane for a crud app with one million users.
  What am I missing?
  gbear605 14 hours ago
  I’ve seen startups with a thousand active users paying $50k/month (though that’s overall costs, not just db). It’s really easy to waste a lot of money doing nothing.
  bigfatkitten 10 hours ago
  It’s especially easy to waste money on databases.
  People just throw more compute power (ie money) at performance problems, rather than fixing their queries or making better use of indices.
  mplanchard an hour ago
  You can see this in the article here, where they are just using whatever garbage queries Prisma spits out.
  I’ve contended for a long time that ORMs and their ilk automatically building queries is an antipattern for anything but small data scale. At any reasonable size of db, you’re going to need to know sql well enough to write optimized queries anyway. There’s essential complexity in the DB queries, which ORMs can only hide for so long.
  cies 7 hours ago
  And do not consider owning the hardware. See what hardware you can buy for that money.
  aitchnyu 4 hours ago
  How many servers are needed to bounce back from a server failure in a few minutes? Should we consider 3 VMs instead of 1 physical?
  sgarland 3 hours ago
  Two. Run one in warm/hot standby, optionally with synchronous commits if you want (nearly) guaranteed zero data loss, and can tolerate the increased write latency.
  Technically you’ll need a third server to perform the failover, but it doesn’t need to be nearly as big, as it’s just watching heartbeats and issuing commands.
  graemep 3 hours ago
  Server failures are rare. Its still usually going to be cheaper to have physical servers even with spare capacity for failures.
  cies 3 hours ago
  I think these days it's more: were do we find the grey old unix guy who confidently will host your db for you on bare metal.
  ies7 14 hours ago
  $500k for only 100 millions rows db also sounds crazy
  wbercx 13 hours ago
  The largest table was 100 million rows. They could have had hundreds more tables.
  imhoguy 10 hours ago
  I bet it is cost of query processing (CPU) and traffic (network throughput) plus ofc provider markup.
- h1fra 6 hours ago
  Also screamed in my head, I have way more rows than that in a Postgres right now and paying less than $500!
- sethammons 6 hours ago
  I missed that number, but caught they migrated all data in 15 minutes and I blinked: "wait, how little data are we talking about for how much money!?"
- ringeryless 10 hours ago
  OTOH they are admittedly using an ORM (Prisma, known for its weight)
  monero-xmr 10 hours ago
  It is truly amazing how mature developers always wind up at the same result - old tech that has stood the test of time. Betting the company on alpha solutions of dubious quality keeps the devs employed at least.
  cies 7 hours ago
  > It is truly amazing how mature developers ...
  ...use ORMs!
  They are always bad, but especially bad when the code base and/or query complexity grows.
  https://dev.to/cies/the-case-against-orms-5bh4
  KronisLV 4 minutes ago
  > They make the simple queries slightly simpler, but they do not help you for the harder queries. — me
  I dont think it’s possible to say they’re always bad when offering an example of when they make things better.
  In my eyes, it’s pretty simple: use the ORM for the simple CRUD and simple data querying, because they will make that simple and you’ll have to do a lot of it in most codebases.
  For everything else, there is in-database processing or dropping down to raw SQL. Even when using an ORM, nobody is preventing you from making an entity mapping against a DB view so you can keep the complex querying logic in pure SQL but get an object that’s easy to work with out of it. Plus, you know, any ORM worth their salt will let you point it at a live DB that you’ve run your (hopefully versioned, for example, with dbmate) migrations against and will let you output the backend code for the entity mapping for everything in a single command.
  It’s kind of unfortunate that a lot of ORMs do kinda suck and have jagged edges, as well that most attempts at dynamic SQL also remain pretty niche, like: https://mybatis.org/mybatis-3/dynamic-sql.html
- casper14 15 hours ago
  Nice! What optimizations have you put in llace yo support 150 mil? Just some indexing or other fancy stuff?
  luhn 15 hours ago
  You don't need to optimize anything beyond appropriate indices, Postgres can handle tables of that size out of the box without breaking a sweat.
  gopalv 14 hours ago
  > Postgres can handle tables of that size out of the box
  This is definitely true, but I've seen migrations from other systems struggle to scale on Postgres because of decisions which worked better in a scale-out system, which doesn't do so well in PG.
  A number of well meaning indexes, a very wide row to avoid joins and a large number of state update queries on a single column can murder postgres performance (update set last_visited_time= sort of madness - mutable/immutable column family classifications etc.)
  There were scenarios where I'd have liked something like zHeap or Citus, to be part of the default system.
  If something was originally conceived in postgres and the usage pattern matches how it does its internal IO, everything you said is absolutely true.
  But a migration could hit snags in the system, which is what this post celebrates.
  The "order by" query is a good example, where a bunch of other systems do a shared boundary variable from the TopK to the scanner to skip rows faster. Snowflake had a recent paper describing how they do input pruning mid-query off a TopK.
  sgarland 3 hours ago
  That’s not the fault of the DB, though, that’s bad schema design. Avoiding JOINs is rarely the correct approach.
  williamdclt 7 hours ago
  You really don't need anything special. 150M is just not that much, postgres has no problem with that.
  Obv it depends on your query patterns
esafak 16 hours ago
I read it as: Why You Shouldn't Use Prisma and How Cockroach Hung Us Out To Dry
I already knew about prisma from the infamous https://github.com/prisma/prisma/discussions/19748
- vvpan 12 hours ago
  I am in a startup that's using Prisma and it we really wish we had not:
  - The query objects can become hard to read with anything more or less complex.
  - If you need an unsupported Postgres extension you are out of luck.
  - One large file in a schema, impossible to shard.
  - We have many apps in a monorepo and they cannot have separate prisma connections cause the schema gets baked into "@prisma/client"
  Basically the only thing useful about it are the TS types which is something SQL-builder libraries solve better. Long story short, use Kysely, Prisma provides no value that I see.
- pier25 16 hours ago
  Prisma is so bad... can you believe it's by far the most downloaded ORM in NPM?
  seer 13 hours ago
  I don’t understand the hate, the only truly limiting factor for Prisma right now is its poor support for polymorphism, apart from that it has quite good support for complicated index setups, and if you need anything more performant, just drop to typed raw sql queries, it also supports views (materialized or otherwise) out of the box.
  I recently wanted to check it out and wrote a small app that had good use of pgvector for embeddings, custom queries with ctes for a few complex edge cases, and it was all quite smooth.
  Now it might not be at the level of active record, ecto or sqlalchemy but it was quite decent.
  If you know your sql at any point it gave me options to drop down a level of abstraction, but still keep the types so as not to break the abstraction too much for the rest of the code.
  sitharus 12 hours ago
  I don't hate prisma - it's just a tool - but that's far from the only limiting factor.
  I recently looked at migrating a legacy project with basic SQL query generation to a modern ORM. Prisma came up top of course so I tried it.
  We use Postgres built-in range types. Prisma does not support these, there's no way to add the type to the ORM. You can add them using "Unsupported", but fields using that aren't available in queries using the ORM, so that's pretty useless.
  It also requires a binary to run, which would require different builds for each architecture deployed to. Not a big thing but it was more annoying than just switching the ORM.
  That coupled with their attitude to joins - which has truth to it, but it's also short-sighted - eliminated Prisma.
  The final decision was to switch to Kysely to do the SQL building and provide type-safe results, which is working well.
  hn_throw2025 an hour ago
  Some of those criticisms are out of date.
  > It also requires a binary to run, which would require different builds for each architecture deployed to.
  https://www.prisma.io/blog/from-rust-to-typescript-a-new-cha...
  > That coupled with their attitude to joins
  https://www.prisma.io/blog/prisma-orm-now-lets-you-choose-th...
  As another poster has mentioned, a thing Prisma has over the others is type safety if you use the raw SQL escape hatch for performance reasons.
  vvpan 12 hours ago
  How do you do typed raw queries?
  seer 5 hours ago
  Check out the docs https://www.prisma.io/docs/orm/prisma-client/using-raw-sql/t... it generates output types, though input types still need to be done by you via type comments in the sql
  campbellmorgan 9 hours ago
  https://www.prisma.io/docs/orm/prisma-client/using-raw-sql/r... - it assumes that these queries return arrays and there's a template you can pass in like this:
  prisma.$queryraw<YourType>`SELECT * FROM ...`
  VWWHFSfQ 15 hours ago
  Every ORM is bad. Especially the "any DB" ORMs. Because they trick you into thinking about your data patterns in terms of writing application code, instead of writing code for the database. And most of the time their features and APIs are abstracted in a way that basically means you can only use the least-common-denominator of all the database backends that they can support.
  I've sworn off ORMs entirely. My application is a Postgres application first and foremost. I use PG-specific features extensively. Why would I sacrifice all the power that Postgres offers me just for some conveniences in Python, or Ruby, or whatever?
  Nah. Just write the good code for your database.
  pier25 15 hours ago
  I use PG with Entity Framework in .NET and at least 90% of my queries don't need any PG-specific features.
  When I need something PG specific I have options like writing raw SQL queries.
  Having most of my data layer in C# is fantastic for productivity and in most cases the performance compared to SQL is negligible.
  theshrike79 10 hours ago
  Coming from Javaland to C#, Entity Framework is a breath of fresh air.
  The Npgsql driver automatically applies PG-specific tricks without me having to do anything special
  The only path I had to do myself is the data ingress point that had some race condition issues, everything else seems to perform pretty well out of the box.
  romanhn 15 hours ago
  Entity Framework really is such a time saver. The PG adapter makes it a breeze not just with common queries, but also more recent stuff, like working with embeddings for vector search.
  evantbyrne 14 hours ago
  Nah. The most prolific backend frameworks are all built on ORMs for good reason. The best ones can deserialize inputs, validate them, place those object directly into the db, retrieve them later as objects, and then serialize them again all from essentially just a schema definition. Just to name a few advantages. Teams that take velocity seriously should use ORMs. As with any library choice you need to carefully vet them though.
  marcus_holmes 12 hours ago
  The "good reason" is that modern web devs do not consider SQL a core skill, and plain do not understand databases. To be a competing modern web framework you have to include an ORM so these people will consider you.
  Trying to explain to a modern web dev that the optimum data storage structure is not the same as the optimum application layer data structure, so you can't just take one and map them across 1:1 to the other, is really painful.
  Developing without an ORM is just as quick as developing with one (because the time you save on routine queries you will more than lose on the tricky edge cases that the ORM completely screws up on). But you need to know SQL and databases to do it.
  evantbyrne 3 hours ago
  Could this be selection bias? I've never worked with a backend engineer that couldn't write SQL. I've worked on plenty of projects were there were bugs in hand written SQL though.
  wredcoll 12 hours ago
  ORMs are pretty much the definition of technical debt.
  Sometimes debt is worth it. Sometimes the interest rate is too high.
  hombre_fatal 5 hours ago
  On the other hand, ORMs insulate you from database integrity since ORMs have limited access to underlying database features.
  In Postgres that usually means you're not locking rows, you're not using upsert, you might not be writing table DDL yourself. It often means you aren't even using database transactions.
  While these things might be extraneous fluff for an all-nighter hackathon, you really have to figure out a sweet spot so that data integrity violations aren't killing your velocity when your service's rubber begins hitting the road.
  evantbyrne 3 hours ago
  These are important features for a database toolkit to consider. I don't think that it is fair to dismiss an entire category of libraries on the grounds of some implementations being less complete than desired though. If we applied that same standard more generally, then we wouldn't use anything at all, because most software libraries kind of stink.
  hombre_fatal an hour ago
  Fair enough. Do you have a favorite ORM that makes what you feel is a decent set of trade-offs, all things considered?
  Admittedly, most of my experience with ORMs was with Ruby on Rails' Active Record + Rails' generated SQL tables + the culture that ensued from it, like large production Rails applications that didn't use a single db transaction (and often no indexes). Though I reckon things could have changed in 15 years.
  I can imagine that an ORM might be the best option for most people. It wasn't until I worked at one specific company that I learned how to really use Postgres. Before that, an ORM and its abstractions probably made more sense than expecting me to figure out how to use a database directly on my own.
  evantbyrne 7 minutes ago
  I've been building one for Go in my free time, but it's not ready for general use. Historically, I've used Django despite being imperfect, because I can just install Wagtail and have a nice admin interface for free. It does have some nice convenience features though and transactions are easy enough. At my day job we use a Java framework with a terrible codegen-based ORM. Laravel has a decent database toolkit if you are into PHP. Unfortunately, excellent database toolkits are rare, and I have historically found myself dipping into SQL frequently. All decent ones will at least allow you to do so though.
  CuriouslyC 14 hours ago
  SQL Alchemy is pretty good, because it's mostly a sql engine that has an ORM bolted on top of that, and the docs actively try to point users towards using the sql engine rather than using the ORM for everything.
  reillyse 14 hours ago
  Every ORM except Active Record is awful. Active Record is amazing.
  MrPapz 6 hours ago
  I still dream about a JS version of Rails' Active Record.
  irjustin 13 hours ago
  I moved from Rails -> Django and man my life is so painful. The Django ORM is an exercise in patience.
  To be fair, Prisma's `OR` clause looks so good. Way better than ActiveRecord.
  globular-toast 9 hours ago
  Except active record can barely be considered an ORM IMO. Doing a literal one to one mapping between records and objects is not that impressive. A real data mapper ORM at least gets you true entities that are decoupled from the db. That way you could totally swap out your data layer without affecting your domain layer. Active record leads to big ball of mud architecture.
  globular-toast 9 hours ago
  It's actually even worse than this, many Django applications are straight up Postgres applications. They use Postgres specific bits of the ORM without hesitation. So they're learning these weird ORM incantations instead of just learning the underlying SQL, which would be knowledge you could apply anywhere.
  People just hate embedding SQL into other languages. I don't know why.
  cess11 6 hours ago
  That's a tradeoff that sometimes makes sense. MICROS~1 SQL Server heavily leans into the 'use specific features extensively', and countless applications on it consist mainly of stored procedures. It does however cause a lock-in that might not be attractive, your customers might be sensitive to what database engine you run their stuff on and then you need to figure out the common ground between two or more alternatives and build your application in that space.
  It's not as uncommon as one might think, one of the big products in public sector services where I live offers both SQL Server and Oracle as persistence layer so they can't push logic into stored procedures or similar techniques.
  But just sketching out some schemas and booting PostgREST might be good enough for forever, if that's the case, go for it. As for ORM:s, I kind of like how Ecto in Elixir settings does things, it solves a few tedious things like validation and 'hydration', and has a macro DSL for generating SQL with concise expressions.
- etblg 16 hours ago
  > It's true that Prisma currently doesn't do JOINs for relational queries. Instead, it sends individual queries and joins the data on the application level.
  ..........I'm sorry, what? That seems........absurd.
  edit: Might as well throw in: I can't stand ORMs, I don't get why people use it, please just write the SQL.
  jjice 16 hours ago
  I believe it’s either released now or at least a feature flag (maybe only some systems). It’s absolutely absurd it took so long. I can’t believe it wasn’t the initial implementation.
  Funny relevant story: we got an OOM from a query that we used Prisma for. I looked into it - it’s was a simple select distinct. Turns out (I believe it was changed like a year ago, but I’m not positive), event distincts were done in memory! I can’t fathom the decision making there…
  etblg 16 hours ago
  > event distincts were done in memory! I can’t fathom the decision making there…
  This is one of those situations where I can't tell if they're operating on some kind of deep insight that is way above my experience and I just don't understand it, or if they just made really bad decisions. I just don't get it, it feels so wrong.
  MonkeyClub 4 hours ago
  > if they're operating on some kind of deep insight
  If one's getting OOM errors from a SELECT DISTINCT, then there's no deep insight behind the choice, it's just a mistake.
  wredcoll 12 hours ago
  It really gives me flashbacks to the early days of mongodb.
  Which, frankly, is a good lesson that marketing and docs and hype can make up for any amount of technical failure, and if you live long enough, you can fix the tech issues.
  Tadpole9181 15 hours ago
  > I can't tell if they're operating on some kind of deep insight that is way above my experience and I just don't understand it
  This is answered at the very top of the link on the post you replied to. In no unclear language, no less. Direct link here: https://github.com/prisma/prisma/discussions/19748#discussio...
  > I want to elaborate a bit on the tradeoffs of this decision. The reason Prisma uses this strategy is because in a lot of real-world applications with large datasets, DB-level JOINs can become quite expensive...
  > The total cost of executing a complex join is often higher than executing multiple simpler queries. This is why the Prisma query engine currently defaults to multiple simple queries in order to optimise overall throughput of the system.
  > But Prisma is designed towards generalized best practices, and in the "real world" with huge tables and hundreds of fields, single queries are not the best approach...
  > All that being said, there are of course scenarios where JOINs are a lot more performance than sending individual queries. We know this and that's why we are currently working on enabling JOINs in Prisma Client queries as well You can follow the development on the roadmap.
  Though this isn't a complete answer still. Part of it is that Prisma was, at its start, a GraphQL-centric ORM. This comes with its own performance pitfalls, and decomposing joins into separate subqueries with aggregation helped avoid them.
  evanelias 13 hours ago
  It's a completely ridiculous answer though. They're linking to High Performance MySQL's 2nd edition, which came out in June 2008, and was written for users of MySQL 5.0 running on 2008-era hardware.
  My take, as a MySQL expert: that advice is totally irrelevant now, and has been for quite some time. It's just plain wrong in a modern context.
  setr 9 hours ago
  I’m not even sure it was correct for its time? The whole point of an RDBMS is to execute join operations. The only reason I’d suspect an RDBMS to be bad at its one fundamental job, in any point of time, is the N+1 query scenario or multiple left joins with unrelated dependencies, but that’s triggered by bad orm abstractions to begin with
  evanelias 2 hours ago
  It was absolutely valid advice for its time, but only in the highly specific cases/reasons outlined in the book. The 2nd edition was written by the top Percona folks, who pretty much had more experience scaling databases for large websites than anyone else.
  The Prisma answer just does not summarize correctly what the book was saying.
  It mainly boiled down to sharding and external caching. Storage and memory were much smaller back then, so there was a lot of sharding and functional partitioning, and major reliance on memcached; all of those are easier if you minimize excessive JOINs.
  The query planner in MySQL wasn't great at the time either, and although index hints could help, huge complex queries sometimes performed worse than multiple decomposed simpler queries. But the bigger issue was definitely enabling sharding (cross-shard joins had to be handled at the application level) and enabling external caching (do a simple range scan DB query to get a list of IDs/PKs, then do point lookups in memcached, then finally do point lookups in the DB for any that weren't in memcached).
  seer 13 hours ago
  Tbh, I once dabbled in building an ORM myself (in PHP) and I did find that in some situations it was faster to do individual queries and then join in code, to solve the N+1 problem.
  Granted I was much worse in my sql knowledge and postgre/mysql had severe limitations in their query planners, so I can see how something like this could have happened. If they support multiple dbs, and even one has this problem, it might be better (for them) to do it application side.
  The specific issue was doing a join with a table for a one to many, you get a lot more data from the db than you would normally need, if you do the join the naive way, and if the join is nested you get exponentially more data.
  It was faster to do a query for each db separately and then stitch the results.
  Now it is easy to solve in pg with nested selects and json aggregation, which pg query planner rewrites to efficient joins, but you still get only the bytes you have requested without duplication.
  lesuorac 3 hours ago
  Can't speak about Prisma (or Postgres much).
  But I've found with that you can get better performance in _few_ situations with application level joins than SQL joins when the SQL join is causing a table lock and therefore rather than slower parallel application joins you have sequential MySQL joins. (The lock also prevents other parallel DB queries which is generally the bigger deal than if this endpoint is faster or not).
  Although I do reach for the SQL join first but if something is slow then metrics and optimization is necessary.
  hermanradtke 3 hours ago
  In what cases is your join causing a table lock?
  pier25 16 hours ago
  > I can't stand ORMs, I don't get why people use it, please just write the SQL.
  I used to agree until I started using a good ORM. Entity Framework on .NET is amazing.
  bob1029 15 hours ago
  > Entity Framework on .NET is amazing.
  I disagree. It is probably one of the less terrible ORMs, but it is far from amazing. The object-relational impedance mismatch will always dominate for anything that isn't trivial business. EF works great until you need different views of the model. It does support some kind of view mapping technique, but it's so much boilerplate I fail to see the point.
  Dapper + SqlConnection is goldilocks once you get into the nasty edges. Being able to query a result set that always exactly matches your view models is pretty amazing. The idea of the program automagically upgrading and migrating the schemas is something that was interesting to me until I saw what you could accomplish with Visual Studio's SQL Compare tool & RedGate's equivalent. I feel a lot more comfortable running manual schema migrations when working with hosted SQL providers.
  cyral 15 hours ago
  > It does support some kind of view mapping technique
  Can you call .Select(entity => SomeSmallerModel() { Name = entity.Name }) or something like that to select what you need? If I am understanding your issue correctly.
  I also agree that its one of the least worst but there are still things that annoy me.
  pier25 12 hours ago
  > I feel a lot more comfortable running manual schema migrations
  Me too. I use a DB-first approach. Then EF simply rebuilds the application models automatically with a single command.
  neonsunset 15 hours ago
  > EF works great until you need different views of the model
  You can easily project or use views with SQL then projected onto objects. It's very convenient with `.FromSql`:
  https://learn.microsoft.com/en-us/ef/core/querying/sql-queri...
  tilne 16 hours ago
  Doesn’t entity framework have a huge memory footprint too?
  homebrewer 6 hours ago
  If you don't do stupid things like requesting everything from the database and then filtering data client side, then no.
  We have one application built on .NET 8 (and contemporary EF) with about 2000 tables, and its memory usage is okay. The one problem it has is startup time: EF takes about a minute of 100% CPU load to initialize on every application restart, before it passes execution to the rest of your program. Maybe it is solvable, maybe not, I haven't yet had the time to look into it.
  tilne 5 hours ago
  Yeah makes sense. I should have used my language more carefully: I have very little hands-on experience with entity framework, and thus don’t have objective reasons to say it has a “huge” memory footprint.
  Does the startup time/CPU usage cost vary depending on the size of the dataset you’re interacting with?
  neonsunset 2 hours ago
  Hmm, it should not be taking this much time unless you're running into an edge case - we have a fairly complex application that needs to pre-load a lot of data from persistence (no EFC though, it's an embedded DB) and over the wire at startup, and it still takes 15 seconds on a rainy day, mostly waiting for requests to finish. If you can gather a profile it will likely shed some light on why this takes place. Worst case, if you have a repro, it's worth to file an issue at https://github.com/dotnet/efcore. When you have time that is.
  neonsunset 15 hours ago
  Do you have any links that note memory usage issues with any of the semi-recent EF Core versions?
  tilne 5 hours ago
  No. To be clear: I wasn’t trying to say it was bad. Just repeating what I had read in a (fairly old) .net book. Should have chosen my words more carefully.
  neonsunset 3 hours ago
  To be fair, old Entity Framework was on the heavier side. Still much faster than e.g. ActiveRecord but enough for Dapper to be made back then. The gap between them is mostly gone nowadays plus .NET itself has become massively faster and alternate more efficient alternatives got introduced since (Dapper AOT, its main goal is NAOT compatibility but it also uses the opportunity to further streamline the implementation).
  tilne 3 hours ago
  Thanks for the context. I’m new to the ecosystem, so it’s valuable to hear thoughts like these from people with more experience with it.
  ketzo 15 hours ago
  Not 100% parallel, but I was debugging a slow endpoint earlier today in our app which uses Mongo/mongoose.
  I removed a $lookup (the mongodb JOIN equivalent) and replaced it with, as Prisma does, two table lookups and an in-memory join
  p90 response times dropped from 35 seconds to 1.2 seconds
  lelanthran 5 hours ago
  > I removed a $lookup (the mongodb JOIN equivalent)
  There is no "MongoDB JOIN equivalent" because MongoDB is not a relationalal database.
  It's like calling "retrieve table results sequentially using previous table's result-set" a JOIN; it's not.
  nop_slide 13 hours ago
  Maybe because mongo isn’t ideal for relational data?
  merek 9 hours ago
  I believe a lot of Mongo's criticisms come from people modelling highly relational data on a non-relational DB.
  rwyinuse an hour ago
  I'm not sure what is the point of using MongoDB these days, when you can as easily store and query jsonb in postgres.
  wredcoll 12 hours ago
  Does mongodb optimize joins at all? Do they even happen server side?
  compton93 16 hours ago
  It is. But wait... it doesn't join the data on the application level of your application. You have to deploy their proxy service which joins the data on the application level.
  Tadpole9181 15 hours ago
  It's pretty obvious when somebody has only heard of Prisma, but never used it.
  - Using `JOIN`s (with correlated subqueries and JSON) has been around for a while now via a `relationLoadStrategy` setting.
  - Prisma has a Rust service that does query execution & result aggregation, but this is automatically managed behind the scenes. All you do is run `npx prisma generate` and then run your application.
  - They are in the process of removing the Rust layer.
  The JOIN setting and the removing of the middleware service are going to be defaults soon, they're just in preview.
  compton93 15 hours ago
  They've been saying that for 3 years. We actually had a discount for being an early adopter. But hey its obvious Ive never used it and only heard of it.
  Tadpole9181 15 hours ago
  The JOIN mode has been in preview for over a year and is slated for GA release within a few months. Which has been on their roadmap.
  The removal of the rust service is available in preview for Postgres as of 6.7.[1]
  Rewriting significant parts of a complex codebase used by millions is hard, and pushing it to defaults requires prolonged testing periods when the worst case is "major data corruption".
  [1]: https://www.prisma.io/blog/try-the-new-rust-free-version-of-...
  compton93 14 hours ago
  They've had flags and work arounds for ages. Not sure what point you are trying to make? But like you said I've never used it, only heard of it lol.
  williamdclt 7 hours ago
  Honestly, everything you say makes me want to stay far from prisma _more_.
  All this complexity, additional abstractions and indirections, with all the bugs gootguns and gotchas that come with it... when I could just type "JOIN" instead.
  paulddraper 14 hours ago
  It is hard.
  Harder than just doing joins.
- coverj 15 hours ago
  I didn't mind prisma for managing the schema etc but also seen your linked github issue. I found other people recommend combining Prisma with Kysley. I have only used this in toy projects so take this with a grain of salt.
  https://kysely.dev/ https://github.com/valtyr/prisma-kysely
- frollogaston 16 hours ago
  I'm not the most experienced in huge DBs and can't write anything off, but I've never seen a horizontally sharded DBMS work well, even Citus which allegedly does. There's always been a catch that seems worse than manually doing sharding at a higher level than your DB, not that that's easy either.
  caffeinated_me 16 hours ago
  I'd argue that horizontally sharded databases can work well, but they do tend to have significant non obvious tradeoffs that can be pretty painful.
  There's a handful of companies that have scaled Citus past 1PB for production usage, but the examples I'm aware of all had more engineering to avoid capability or architecture limitations than one might like. I'd love to see someone come back with a fresh approach that covered more use cases effectively.
  Disclaimer: former Citus employee
  banashark 16 hours ago
  Vitess and planetscale seem to have quite a number of high profile users who have lauded its capabilities. A search through hn history pops up a few.
  As someone who has primarily worked with Postgres for relational concerns, I’ve envied the apparent robustness of the MySQL scaling solutions.
- ScalaHanSolo 15 hours ago
  Author here. Yeah, that's not a bad take away either. I've also been really vocal in Primsa issues for all sorts of things. We are about to embark on a big migration away from Prisma and onto Drizzle once the Drizzle team lands 1.0
  We will absolutely share our findings when that migration happens!
  redcobra762 10 hours ago
  Yes, moving to a freshly 1.0 tool/library is often the best way to gain stability...
- cess11 6 hours ago
  "Instead, it sends individual queries and joins the data on the application level. However, this doesn't mean that Prisma's relational queries are per se slower"
  Wow, what the fuck.
  "Also, this chapter about Query Performance Optimization from the High Performance MySQL book has some great insights. One of the techniques it mentions is JOIN decomposition:
  Many high-performance web sites use join decomposition. You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application."
  This belief that they can write JavaScript that outperforms decades of bare-metal executed optimisations in mainstream database engines is just astonishing.
gilbetron 4 hours ago
The answer to the question, "what database should I use?" is "postgres". If you are in a situation where postgres actually won't work, then you already would know exactly why postgres won't work.
In other words: [Postgres -> exotic solution] is the path everyone should take (and 99% will just stay in postgres), and not [exotic solution -> postgres].
- abraxas an hour ago
  Yes, the nosql fad that swept the industry was nearly as insufferable as the SPA craze that followed alongside. Now everyone's back to tried and true. Most data once more sits in RDBMS and most html gets render on the server.
  Us grizzled genX devs saw this coming a decade ago.
eftpotrm 7 hours ago
I can't help thinking more startups need greybeards around. (Of which, realistically, I'm now one.)
Largest table 100 million rows and they were paying 6 figures for database services annually? I have one now that sits happily enough on an 8yo laptop. I've worked on systems that had similar scale tables chugging along on very average for 20 years ago MSSQL 2000 boxes. There just isn't a need for cloud scale systems and cloud scale bills for that data volume.
The problems they're describing should never have got that far without an experienced hand pointing out they didn't make sense, and if they'd hired that greybeard they'd have spotted it long before.
- abraxas 6 hours ago
  > and they were paying 6 figures for database services annually?
  Might have been monthly.
  100,000,000 rows is what I handled on a single Sun server in 2001 with Sybase, no problemo.
sreekanth850 14 hours ago
It's wild and hilarious, how often startups and companies go for distributed databases like CockroachDB/TiDB/Yugabyte before they actually need distribution, this trends sucks. 100 million rows is nothing that a well-tuned Postgres or MySQL instance (or even read-replicated setup) can't handle comfortably. Scale when you hit the wall.
- simianwords 5 hours ago
  I don't buy this! Startups do need high availability. If you start having replicas you are already in distributed territory!
  GaryNumanVevo 3 hours ago
  It's much more simple to have a single master multi replica setup than a multi master one
- Spivak 14 hours ago
  100M isn't much even for not super well tuned postgres.
  sreekanth850 14 hours ago
  Yes, there are multiple steps to consider before jumping to a distributed database and only when you actually hit bottlenecks, like read replication, CQRS, etc. But I guess it's often just about chasing fancy stuff.
robertlagrant 5 hours ago
For all the Prisma-haters: I salute you. But I want to reply to numerous comments with the following:
ORMs come in two main types, that I'm aware of: Active Record (named after the original Ruby one, I think) and Data Mapper (think Hibernate; SQLAlchemy).
Active Record ORMs are slightly more ergonomic at the cost of doing loads of work in application memory. Data Mapper looks slightly more like SQL in your code but are much more direct wrappers over things you can do in SQL.
Data Mapper also lets you keep various niceties such as generating migration code, that stem from having your table definition as objects.
Use Data Mapper ORMs if you want to use an ORM.
- arunix 3 hours ago
  Also, the Query Object style, e.g. JOOQ and SQLAlchemy Core
  https://martinfowler.com/eaaCatalog/queryObject.html
- felipemesquita 4 hours ago
  Rails’ Active Record was named after the pattern as described by Martin Fowler:
  https://www.martinfowler.com/eaaCatalog/activeRecord.html
  robertlagrant 4 hours ago
  Ah - the reverse! Thanks.
crowcroft 3 hours ago
Why does Postgres get so much love, and MySQL/MariaDB get nothing?
I'm assuming it's largely because Postgres has more momentum, and is much more extensible, but if you're just trying to do 'boring DB stuff' I find it's faster for most use cases. Development has slowed, but it would be hard to argue that it's not battle tested and robust.
- sgarland 3 hours ago
  Because MySQL got a (rightfully so) bad rap before it adopted InnoDB as the default storage engine, and then tech influencers happened. I love Postgres, but I also love MySQL, and 99% of the time I see people gushing about Postgres, they aren’t using any features that MySQL doesn’t have.
  The single biggest thing for MySQL that should be a huge attraction for devs without RDBMS administration experience is that MySQL, by and large, doesn’t need much care and feeding. You’re not going to get paged for txid wraparound because you didn’t know autovacuum wasn’t keeping up on your larger tables. Unfortunately, the Achilles heel of MySQL (technically InnoDB) is also its greatest strength: its clustering index. This is fantastic for range queries, IFF you design your schema to exploit it, and don’t use a non-k-sortable PK like UUIDv4. Need to grab every FooRecord for a given user? If your PK is (user_id, <some_other_id>) then congrats, they’re all physically colocated on the same DB pages, and the DB only has to walk the B+tree once from the root node, then it just follows a linked list.
  abraxas an hour ago
  To the contrary when the PK has to be a BTree it already ties my hands because I can't have good disk layout for say, time series data where I might use a ligher index like BRIN at a cost of somewhat slower queries but much better index update rates.
  sgarland an hour ago
  I would not personally build a TSDB atop MySQL, though I worked at a place that did, and it worked OK. I don't remember their schema, though.
  If I had to, I'd probably do something like this (haven't tested it beyond validating that it creates):
  mysql> SHOW CREATE TABLE ts\G *************************** 1. row *************************** Table: ts Create Table: CREATE TABLE `ts` ( `metric_id` smallint unsigned NOT NULL, `ts` datetime NOT NULL, PRIMARY KEY (`metric_id`,`ts`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci /*!50100 PARTITION BY RANGE (dayofmonth(`ts`)) SUBPARTITION BY HASH (`metric_id`) SUBPARTITIONS 3 (PARTITION p1 VALUES LESS THAN (2) ENGINE = InnoDB, PARTITION p2 VALUES LESS THAN (3) ENGINE = InnoDB, PARTITION p3 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ 1 row in set (0.012 sec)
  Obviously there would be more partitions for different days (or whatever other date chunk you wanted – months, weeks, etc.), and the sub partitions number would depend on how many metrics you were tracking. You could also simplify this at the expense of more tables by having a table per metric.
  crowcroft an hour ago
  Postgres is a lot more flexible so if you're making a TSDB, handling geospatial data etc. etc. it is usually better (not to say MySQL can't be used effectively for a lot of these use cases still).
  I just see lots of people making CRUD web apps and choosing these new Postgres solutions, and that seems like the one thing that MySQL is almost always better at.
- ahartmetz 3 hours ago
  Funny that it was the other way around 20 years ago. Everybody was using MySQL, but there were many blog posts and such about the looseness of MySQL with the SQL standard and other issues. And that people should use Postgres unless they need the replication features of MySQL. AFAIU, replication is still the main (good) reason to use MySQL, though there are some semi-proprietary(?) solutions for Postgres.
  crowcroft 3 hours ago
  Yea, Postgres really came up when Node and Mongo did. During the PHP/RoR era MySQL was a very clear winner.
  I still think MySQL is a better choice for most web apps due to performance, but more general use cases I can understand the debate.
- venantius 3 hours ago
  I am not an expert, but I have worked somewhere MariaDB/MySQL was being used at scale.
  My preference today for Postgres comes down to the fact that its query planner is much easier to understand and interface with, whereas MySQL/Maria would be fine 29/30 times but would then absolutely fuck you with an awful query plan that you needed a lot of experience with to anticipate.
  evanelias 30 minutes ago
  On the other hand, at least MySQL/MariaDB has built-in support for index hints. Postgres does not, and you can absolutely still get bitten by unexpected query plan changes. It's rarer than bad plans in MySQL, but it's worse when it happens in pg -- without index hint support, there's no simple built-in solution to avoid this.
etler 15 hours ago
I've lost count of how many "Migrating from X to Postgres" articles I've seen.
I don't think I've once seen a migrating away from Postgres article.
- delish 15 hours ago
  Related: Oxide's podcast, "Whither CockroachDB," which reflects on experience with postgres at Joyent, then the choice to use cockroach in response to prior experiences with postgres.
  https://www.youtube.com/watch?v=DNHMYp8M40k
  I'm trying to avoid editorializing in my above summary, for fear of mischaracterizing their opinions or the current state of postgres. Their use of postgres was 10 years ago, they were using postgres for a high-availability use case -- so they (and I) don't think "postgres bad, cockroach good." But like Bryan Cantrill says, "No one cares about your workload like you do." So benchmark! Don't make technical decisions via "vibes!"
- yakkomajuri 4 hours ago
  I think your point still stands, and I'm a big Postgres advocate/user myself btw.
  But yeah we did migrate our _analytics_ data to ClickHouse (while still keeping Postgres for more transactional stuff) back when I was at PostHog.
  Writeup: https://posthog.com/blog/how-we-turned-clickhouse-into-our-e...
  mplanchard an hour ago
  We also did this, using change data capture and kafka to stream data to clickhouse as it gets written to postgres.
  Clickhouse is incredible tech. We’ve been very pleased with it for OLAP queries, and it’s taken a lot of load off the postgres instance, so it can more easily handle the very high write load it gets subjected to.
- betaby 15 hours ago
  Here you go https://www.uber.com/en-CA/blog/postgres-to-mysql-migration/
  speed_spread 14 hours ago
  It's a very Uber thing to do to enter a one way from the wrong end.
  DonHopkins 7 hours ago
  At least they have the moxie to flip off people going the right way.
  https://www.youtube.com/watch?v=Tz6A6t55qXk
  psionides 15 hours ago
  Yeah so there's basically just that one ;)
  hu3 13 hours ago
  https://engineeringblog.yelp.com/2024/10/migrating-from-post...
  wredcoll 12 hours ago
  You're not wrong, but this is the tldr:
  > The DB used is PostgreSQL which is not used anywhere else at Yelp, which meant that only a small rotation of long-tenured employees knew Postgres well enough to do outage response. This caused issues in maintenance, visibility, and outage response times. The teams working on the Restaurants products are not infra teams, and the Yelp-wide infra teams (understandably) focus on Yelp-standard infrastructure. As a result, when we did see issues with Postgres it was often a scramble to find people with relevant expertise.
  > So, we switched out this DB in-place with a Yelp-standard MySQL DB.
- evanelias 10 hours ago
  Not an article, and I have no direct knowledge of this either way, but I would strongly suspect that Instagram migrated off Postgres a while back. Probably to fb-mysql + myrocks, or some other RocksDB based solution.
  The compression level is vastly superior to any available Postgres-based solution, and at Instagram’s scale it amounts to extremely compelling hardware cost savings.
  Also if they were still primarily on pg, it would be one of the largest pg deployments in existence, and there would be obvious signs of the eng impact of that (conference talks, FOSS contributions, etc).
  Bigger-picture: Postgres is an amazing database, and it’s often the right choice, but nothing in tech is always the best choice 100% of the time. There’s always trade-offs somewhere.
- rakejake 10 hours ago
  Probably a corollary of the fact that most usecases can be served by an RDBMS running on a decently specced machine, or on different machines by sharding intelligently. The number of usecases for actual distributed DBs and transactions is probably not that high.
- sa46 14 hours ago
  I helped with the initial assessment for a migration from Postgres with Citus to SingleStore.
  https://www.singlestore.com/made-on/heap/
- notTooFarGone 11 hours ago
  We migrated from postgres to ADX based on cost analysis done of the managed version on Azure.
  Now we have lovely kql queries and pretty much start new with postgres again...
- yen223 14 hours ago
  I have participated in a Postgres -> Clickhouse migration, but I haven't bothered writing an article about it.
  I_am_tiberius 13 hours ago
  The entire database? Isn't that very limiting due to slow write speeds in Clickhouse? I saw ch more as a db for mainly read activities.
  jacobsenscott 13 hours ago
  CH excels at extremely high volume writes. You probably can't throw enough data at it.
  I_am_tiberius 13 hours ago
  Sorry, meant writes in terms of update/delete.
  nasretdinov 2 minutes ago
  Many use cases can be addressed by using ReplacingMergeTree and such, so even when you need updates you can get them relatively cheaply
  hodgesrm 11 hours ago
  There are many use cases that do not alter data. Observability is a canonical example.
- dev_l1x_be 9 hours ago
  Roughly the same count as migrating from Postgres to X.
optician_owl 6 hours ago
It's an awkward article. To answer why a query is slow you need a bit more details than just the query. Also, I reread about timeouts and didn't get it, what was the database, whether it was a database issue, how it was related to migration.
The only information I could extract was that the company made bad architectural decisions, believes in ORM (looking at the queries, there are many doubts that the data layouts in DB are adequate) and cannot clearly explain situations. But this is only interesting to their candidates or investors.
It may sound rude, so I apologise.
from-nibly 16 hours ago
Feels like postgres is always the answer. I mean like there's gotta be some edge case somewhere where postgres just can't begin to compete with other more specialized database but I'd think that going from postgres to something else is much easier than the other way around.
- graealex 10 hours ago
  It's not like PostgreSQL hasn't been in development for close to 30 years, covering basically every use case imaginable just through millions of deployments.
  In addition, SQL in itself is a proven technology. The reality is that most problems you might think about solving with specialized databases (Big Data TM etc) could probably easily be solved with your run-of-the-mill RDBMS anyway, if more than five minutes are spent on designing the schema. It's extremely versatile, despite just being one layer above key-value storage.
- jacobsenscott 13 hours ago
  PG requires a lot of expertise to keep running when you get to a billion rows or massive ingest. It can do it, but it doesn't just do it out of box running the defaults.
  graealex 10 hours ago
  Hopefully at 1B records, you have a business model that allows you to spend some money on either hardware or talent to solve this problem.
  sgarland 2 hours ago
  Unfortunately, most places go with hardware first, so the problem grows larger and larger. When they do finally hire someone who knows how to address it, it’s a monstrous undertaking that dev teams are reluctant to do, because it’s a lengthy pause on feature development (so really, PMs are reluctant), and also future velocity is typically somewhat slower – turns out proper RDBMS data modeling is rigid, and doesn’t suffer fools.
- 999900000999 15 hours ago
  Depends.
  If you want to fully embrace the vibe tables are difficult.
  Even before LLMs, I was at a certain company that preferred MongoDB so we didn’t need migrations.
  Sometimes you don’t care about data structure and you just want to toss something up there and worry about it later.
  Postgres is the best answer if you have a solid team and you know what you’re doing.
  If you want to ride solo and get something done fast, Firebase and its NoSQL cousins might be easier .
  pojzon 14 hours ago
  I really enjoy this comment.
  > Postgres is the best answer if you have a solid team and you know what you’re doing.
  Not every type of data simply fits into relational model.
  Example: time series data.
  So depending on your model - pick your poison.
  But for relational models, there is hardly anything better than postgres now.
  It makes me happy coz I always rooted for the project from earily 2000s.
  999900000999 an hour ago
  Don't get me wrong, Postgres is awesome when things work.
  But, for example I was working on a .net project and entity framework decided it couldn't migrate Postgres tables correctly.
  I'm not a SQL person, at this point my options are to drop tables, and let .net recreate them or try and write my own migrations.
  This just isn't an issue with Firebase. I can add all the fields I want directly on the client.
  freilanzer 6 hours ago
  Even for timeseries there is https://github.com/timescale/timescaledb. Haven't used it, just knew it existed.
  abraxas an hour ago
  It's very good. Postgres by itself can handle a very high volume of inserts (I did over 100,000 rows/s on very modest hardware). But timescale makes it easier to deal with that data. It's not strictly necessary but it's very time series friendly (good compression, good indexing and partitioning etc). Nothing a pg expert can't accomplish with a vanilla postgres but very, very handy.
  mplanchard an hour ago
  I haven’t tried timescale, but I have found postgres with time-based partitions works very well for timeseries data. Unless you’ve got really heavy indexes, the insert speed is phenomenal, like you said, and you’ve got the freedom to split your partitions up into whatever size buckets makes the most sense for your ingestion and query patterns.
  A really nice pattern has been to use change data capture and kafka to ship data off to clickhouse for long-term storage and analytics, which allows us to simply drop old partitions in postgres after some time.
  abraxas 18 minutes ago
  I think timescale will compress them heavily on your schedule so if that's acceptable to your use case you might be able to do away with clickhouse. Hard to say of course, without knowing details around your insertion and query patterns, retention requirements and aggregations you need. But timescale can do a lot of that with pretty straightforward syntax.
  SchemaLoad 13 hours ago
  What situations do you encounter where you don't care about the structure of the data? The only ones I've ever encountered have been logging, where it's only purpose is to be manually text searchable, and something like OpenStreetMap where everything is just a key value store and the structure is loosely community defined.
  As soon as you have a loosely defined object you can't access any specific keys which makes it useless for 99% of times you want to store and retrieve data.
  999900000999 13 hours ago
  You define the data schema client side.
  That's the entire idea behind Firebase. It makes prototyping much faster. I don't know how well it scales, but it works for most smaller projects.
  graealex 10 hours ago
  Wait until you hear about ORMs.
  999900000999 36 minutes ago
  I still have to create tables. I still have to migrate tables, these are all things I don't need to worry about with firebase.
  It all depends on what you need to actually do. The only real weakness of Firebase is the Google lock in.
- mdaniel 14 hours ago
  There's a gist that shows up in these threads https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...
  But while digging that up it seems there is one with more colors: https://postgresforeverything.com/
  And one for the AI crowd https://github.com/dannybellion/postgres-is-all-you-need#pos...
- coolcase 15 hours ago
  I hear MySQL can be better for some workloads?
  JohnCClarke 4 hours ago
  <Not intending to start a flamewar...> MySQL shines for simple OLTP. My PostgreSQL friends all have "paged at 3am to vacuum the database" war stories. I like simple and dumb for transactional data and MySQL delivers.
  graealex 10 hours ago
  Till about yesterday, MySQL was little more than a CSV file on wheels, and even current popularity is mostly driven by two decades of LAMP.
  You can be sure that PostgreSQL will be applicable for any work load that MySQL can handle, but not necessarily the other way round, so if you actually have the freedom to make a choice, go with PostgreSQL.
  In particular, because PostgreSQL has a lot more features, people imply that the other one, for the lack of those features and its associated complexity, must automatically be faster. Which isn't true, neither generally, nor in special cases, since the latter one can go either way - your particular query might just run 4x on PostgreSQL. There is also no universal approach to even benchmark performance, since every database and query will have completely different characteristics.
  evanelias 40 minutes ago
  > Till about yesterday, MySQL was little more than a CSV file on wheels
  That's utter nonsense. What are you even referring to here?
  InnoDB is a rock-solid OLTP storage engine, and it's been MySQL's default since 2010. A very large percentage of the global economy relies on InnoDB, and has for quite some time.
moonikakiss 16 hours ago
great blog. It seems like you might benefit from columnar storage in Postgres for that slow query that took ~20seconds.
It's interesting that people typically think of columnstores for strict BI / analytics. But there are so many App / user-facing workloads that actually need it.
ps: we're working on pg_mooncake v0.2. create a columnstore in Postgres that's always consistent with your OLTP tables.
It might help for this workload.
- I_am_tiberius 15 hours ago
  That sounds awesome. Are you saying you still use your normal OLTP table for writing data and the columnstore table is always in sync with that OLTP table (that's fantastic)? I ready it works with duckdb - how does it work? I guess there's no chance this is going to be available on Azure Flexible Server anytime soon.
  moonikakiss 14 hours ago
  exactly. we take the CDC output / logical decoding from your OLTP tables and write into a columnar format with <s freshness.
  We had to design this columnstore to be 'operational' so it can keep up with changing oltp tables (updates/deletes).
  You'll be able to deploy Mooncake as a read-replica regardless of where your Postgres is. Keep the write path unchanged, and query columnar tables from us.
  --- v0.2 will be released in preview in ~a couple weeks. stay tuned!
  I_am_tiberius 14 hours ago
  Ah, I see. So there's a replication process similar to ClickHouse's MaterializedPostgres. Ideally, there would be functionality allowing a columnstore query to wait until all writes to the OLTP tables — up to the query's execution time — are available. This would make the system truly Postgres-native and address issues that no other system currently solves.
  moonikakiss 13 hours ago
  yep exactly. we can wait for replay LSN. So you're only reading once all writes to OLTP are complete.
  I_am_tiberius 14 hours ago
  A follow up question: You can't join columnar tables with OLTP tables, right?
  moonikakiss 14 hours ago
  yes you can. Even if the columnar tables are in the read replica. you'll be able to do joins with your OLTP tables
  I_am_tiberius 14 hours ago
  That's great, thanks.
- compton93 16 hours ago
  What are your thoughts on Fujitsu's VCI? I typically work for ERP's but im always advocating to offload the right queries to columnar DB's (not for DB performance but for end user experience).
sgarland 17 hours ago
It is forever enraging to me that ORMs turn SELECT * into each individual column, mostly because people then post the whole thing and it’s obnoxiously large.
Similarly maddening, the appalling lack of normalization that is simply taken for granted. “It’s faster, bro.” No, no, it is not. Especially not at the hundreds of millions or billions of rows scale. If you store something low-cardinality like a status column, with an average length of perhaps 7 characters, that’s 8 bytes (1 byte overhead assumed, but it could be 2). Multiply that by 2 billion rows, and you’re wasting 16 GB. Disk is cheap, but a. Memory isn’t b. Don’t be lazy. There’s a right way to use an RDBMS, and a wrong way. If you want a KV store, use a damn KV store.
Finally, I’d be remiss if I failed to point out that Prisma is an unbelievably immature organization who launched without the ability to do JOINS [0]. They are forever dead to me for that. This isn’t “move fast and break things,” it’s “move fast despite having zero clue what we’re doing but convince JS devs that we do.”
[0]: https://github.com/prisma/prisma/discussions/19748
- bastawhiz 17 hours ago
  > ORMs turn SELECT * into each individual column
  This is a safety feature. If my code expects columns A, B, and C, but the migration to add C hasn't run yet and I'm doing something that would otherwise `SELECT `, my query should fail. If the ORM _actually_ does `SELECT ` I'll get back two columns instead of three and things can get spooky and bad real fast (unless the ORM manually validates the shape of the query response every time, which will come with a real runtime cost). If there are columns that the ORM doesn't know about, you could end up with _far more_ data being returned from the database, which could just as easily cause plenty of spooky issues—not the least of which being "overwhelming your network by flooding the client connections with data the application doesn't even know exists".
  to11mtm 2 hours ago
  > (unless the ORM manually validates the shape of the query response every time, which will come with a real runtime cost).
  Semi related to this, the ORM explicitly specifying columns ensures that the shape is consistent which both makes for a faster parse of the rows coming back (And, again, eliminates surprises for the parser.)
- compton93 16 hours ago
  I worked for startup who did all of these things on CockroachDB. We could of used a single m5.xlarge PostgreSQL instance (1000 basic QPS on 150GB of data) if we optimized our queries and went back to basics, instead we had 1TB of RAM dedicated to Cockroach.
  I added about 4 indexes and halved the resources overnight. But Prisma, SELECT *, graphql and what other resume building shit people implemented was the bane of my existence, typically engineers did this believing it would be faster. I remember 1 engineer had a standing ovation in slack for his refactor which was supposedly going to save us $$$$$ except our DB CPU went up 30% because he decided to validate every company every second in every session. In his defense, he added 1 line of code that caused it, and it was obscured through prisma and graphql to an inefficient query.
  FWIW; I love CockroachDB but the price is directly linked to how much your software engineers shit on the database.
- reissbaker 16 hours ago
  Eh, I've run applications on RDBMSes with multi-billion-row tables, and I've never found normalization/denormalization to be particularly impactful on performance except for in a few rare cases. The biggest impact came from sensible indexing + query patterns. Normalization vs denormalization had a big impact on convenience, though (not always favoring one way or the other!).
  But I'm no fan of Prisma either. Drizzle has its own pain points (i.e. sequential numbers for its auto-generated migrations means annoying merge conflicts if multiple people iterate on the schema at the same time), but it's much better than Prisma at sticking close to the metal and allowing good query performance and table design.
  rrr_oh_man 12 hours ago
  > I've never found normalization/denormalization to be particularly impactful on performance
  Really?
- amazingamazing 17 hours ago
  I don't disagree with your point, but over normalization and joining everywhere also isn't necessarily the answer, even with an index. there's no easy answer to this, really depends on the performance characteristics the critical user journeys need.
  with a little pain, if I had to pick an extreme, I'd pick extreme normalization with materialized views that are queried (e.g. no joins), rather than joining all of the time.
  sroussey 16 hours ago
  I typically go for 3rd normal form, and selectively denoralize where it has true performance value.
  spudlyo 15 hours ago
  “Normalize ’til it hurts, denormalize ’til it works.”
  rrr_oh_man 12 hours ago
  Just saying this feels like it elevates you above 80% of people who work with databases.
  williamdclt 7 hours ago
  > materialized views
  how though? Postgres doesn't support auto-refreshing materialized views when the underlying data changes (with good reasons, it's a really hard problem)
- HappyJoy 16 hours ago
  Spelling out columns can help the query optimizer too
- thr0w 16 hours ago
  > Disk is cheap, but a. Memory isn’t
  This isn't said enough.
compton93 16 hours ago
I'm curious about Motion's experience with "Unused Indices". They suggest Cockroach's dashboard listed used indexes in the "Unused Indices" list.
I think the indexes they suspect were used are unused but Motion didn't realize CockroachDB was doing zigzag joins on other indexes to accomplish the same thing, leaving the indexes that would be obviously used as genuinely not used.
It's a great feature but CRDB's optimizer would prefer a zig zag join over a covering index, getting around this required indexes be written in a way to persuade the optimizer to not plan for a zig zag join.
CoolCold 7 hours ago
> First, the support portal is a totally different website that doesn’t share auth with the main portal. Second, you have to re-input a lot of data they already know about you (cluster ID, etc). And by the time they respond it’s typically been a week.
I was about to ask what was main constraint for CockroachDB like iostats and atop info for CPU/disk drives, but realized that is probably something offloaded to some SaaS - so still curious
frollogaston 16 hours ago
Did I miss something, or does the article not mention anything about sharding in Postgres? Was that just not needed?
Also, query planner maturity is a big deal. It's hard to get Spanner to use the indexes you want.
- monkeyelite 16 hours ago
  There are probably fewer than 100 websites that couldn’t be a single Postgres instance on nice server hardware, with good caching.
  maz1b 9 hours ago
  What makes you say that? AFAIK, the largest single dedicated servers you can buy on the market go up to around hundreds of cores and terabytes of ram, and NVME up to a PB~ish if you stack NVME/SSD/HDD as well. this is when i last checked.
  monkeyelite an hour ago
  Yes. What do you have in mind?
  ScalaHanSolo 15 hours ago
  Yeah, this is our read with Postgres here at Motion. I believe that Motion will easily be able to 10x on modern hardware along with various optimizations along the way.
  kevml 15 hours ago
  Not everything on the internet is a “website” and then there are several website hosting platforms that aggregate the individual concerns.
  monkeyelite 15 hours ago
  All true points. I guess I just want to hear more about why they think sharding is important to them.
mmiao 15 hours ago
a 100 million rows table is fairly small and you just don't need a distributed database. but you will need one if you hit 10 billion rows
- mplanchard 44 minutes ago
  Depends on what you’re doing with them. We’ve currently got a postgres DB with >100b rows in some tables. Partitioning has been totally adequate so far, but we’re also always able to query with the partition keys as part of the filters, so it is easy for the query planner to do the right thing.
- jacobsenscott 13 hours ago
  You can partition that over 20 or 30 or more tables on one PG instance and have good performance - assuming a good partitioning key exists. If you need to query all 10B rows you'll have a bad day though.
coolcase 15 hours ago
Why not optimise the bad queries first?
Aside. Job section says not 9-5. What does that mean? Long hours? Or not 9-5 attitude?
- ScalaHanSolo 15 hours ago
  Author here. Optimizing bad queries was absolutely part of the issues with the performance. The issue with cockroach was that the visibility into those bad queries was not great. It wasn't until we had the superior tooling from the Postgres ecosystem that we were able to track them down more efficiently.
  compton93 15 hours ago
  When you get a chance can you take a look my reply here: https://news.ycombinator.com/item?id=43990502
  When I first stepped into a DBA role with CockroachDB I was confused why indexes we obviously need were in unused indexes. It wasn't until I did an explain on the queries I learned the planner was doing zig-zag joins instead.
hobs 17 hours ago
It still makes me sad when half the queries I see are json_* - I know its far too late, but a big sad trombone in query performance is constantly left joining to planner queries that are going to give you 100 rows as an estimate forever.
- bastawhiz 16 hours ago
  If the queries are sensible, you can always create indexes that index on the queried expressions.
  https://www.postgresql.org/docs/current/indexes-expressional...
- panzi 16 hours ago
  Not sure why those are json_agg() instead of array_agg() in that example. Why would you use a JSON array instead of a native properly typed array? Yes, if you have some complex objects for some reason you can use JSON objects. But those where all just arrays of IDs. Also why was it json_agg() and not jsonb_agg()? Is there any reason on why to use JSON over JSONB in PostgreSQL?
  renhanxue 15 hours ago
  If you, for whatever obscure reason, need to preserve whitespace and key ordering, that is you want something that is effectively just a text column, then you should use JSON over JSONB.
  I can't think of any case at all, no matter how contrived, where you'd want to use the non-B versions of the JSON aggregate functions though.
  paulryanrogers 15 hours ago
  The non-B JSON can take up less space on disk and less write time complexity.
  panzi 3 hours ago
  I wonder if its just used in a query like that if the generated data structure is only in memory anyway and then serialized to JSON to send to the client. I.e. JSON VS JSONB would not make any difference?
- NegativeLatency 17 hours ago
  Hoping for more easy columnar support in databases, which is one of the things that can lead you to storing json in database columns (if your data is truly columnar).
  Currently the vendor lock-in or requirements for installing plugins make it hard to do with cloud sql providers. Especially hard since by the time it's a problem you're probably at enough scale to make switching db/vendors hard or impossible.
  moonikakiss 14 hours ago
  great point.
  with pg_mooncake v0.2 (launching in ~couple weeks), you'll be able to get a columnar copy of your Postgres that's always synced (<s freshness).
  Keep your write path unchanged, and keep your Postgres where it is. Deploy Mooncake as a replica for the columnar queries.
  hobs 17 hours ago
  How does columnar = json? json isn't colunar at all... If you just want to have a schema in json instead of sql, use a no-sql db, postgres nosql features are strong, but the db features are actually much stronger.
vivzkestrel 13 hours ago
did you try using the native pg library or postgres or pg-promise library and scrap the ORM completely to see what effect it has? If you are looking explicitly for migrations, you can simply use node-pg-migrate https://www.npmjs.com/package/node-pg-migrate and scrap the rest of all the FLUFF that ORMs come with. ORMs in general are horribly bloated and their performance for anything more than select from table where name = $1 is very questionable
Inviz 16 hours ago
WHERE CONDITION AND 1=1 results in scanning whole table? I dont think so...
niwtsol 17 hours ago
That was an interesting read, seemed like an overwhelming amount of data for why they should move off cockroach. All of my db work has been read heavy and I’ve never had a need for super fast multi-region writes. Is a multi-region write architecture possible in Postgres? I’m trying to understand if GDPR was the requirement that resulted in cockroach or if the lackluster multi region write was the bigger driver.
- sgarland 17 hours ago
  There are multi-master Postgres options like BDR (I think it’s since renamed; whatever EnterpriseDB calls it now), yes. Most people don’t need it, even if they think they do, and they also usually are in no way capable of dealing with the operational complexity it involves.
  If you’ve ever administered Postgres at scale, multiply it by 10. That’s what dealing with multi-master is like. It’s a nightmare.
  sroussey 16 hours ago
  Most people don’t need multi-region read architecture for that matter. SaaS app devs at 5 person companies really want to do “Facebook” scale problems.
  ketzo 15 hours ago
  I think this is kinda reductive. If you’ve got users all over the world, then global DB replicas are a sizable performance improvement, whether you’ve got 1 engineer or 1,000.
  sgarland 3 hours ago
  Yes, global read replicas can be helpful, though the replica lag needs to be thoughtfully considered.
  IMO, one of the worst offenders is places with microservices, who absolutely do not need microservices, and are doing reads-after-writes.
  For every RDBMS I know of other than MySQL (even MariaDB has this), you can tack a RETURNING clause onto the write, and get the entire row back. This shouldn’t be necessary for most cases anyway, because by definition, you have (almost) all of the row data already. The only times you wouldn’t is DB-generated values like an auto-incrementing ID, UUID from the DB, or a timestamp from the DB. MySQL can actually handle the first case; the cursor holds LAST_INSERT_ID. So if you had a monolith, it’s trivial to pass what you wrote to the next function that needs it – boom, no concern about stale data, and one fewer DB round trip.
  Even with separate services, this is still possible, though somewhat slower, but I’ve not seen anyone do it.
badmonster 9 hours ago
indeed interesting
liampulles 9 hours ago
I wonder increasingly with tools like ChatGPT whether ORMs make sense anymore? The argument I've always heard for ORMs is that they make it quick and easy to make the initial repository method, and that they make migrating to a new DB easier (hypothetically).
It seems to me LLMs now help fill both those roles, but with the benefit that you can now tune the SQL as needed. I also always prefer actually knowing exactly what queries are going to my DB, but YMMV.