Comments Page - Upgrading Uber's MySQL Fleet

« Back Upgrading Uber's MySQL Fleetuber.comSubmitted by benocodes 5 hours ago

remon 2 hours ago
Impressive numbers at a glance but that boils down to ~140qps which is between one and two orders of magnitude below what you'd expect a normal MySQL node typically would serve. Obviously average execution time is mostly a function of the complexity of the query but based on Uber's business I can't really see what sort of non-normative queries they'd run at volume (e.g. for their customer facing apps). Uber's infra runs on Amazon AWS afaik and even taking some level of volume discount into account they're burning many millions of USD on some combination of overcapacity or suboptimal querying/caching strategies.
- aseipp an hour ago
  Dividing the fleet QPS by the number of nodes is completely meaningless because it assumes that queries are distributed evenly across every part of the system and that every part of the system is uniform (e.g. it is unclear what the read/write patterns are, proportion of these nodes are read replicas or hot standbys, if their sizing and configuration are the same). That isn't realistic at all. I would guess it is extremely likely that hot subsets of these clusters, depending on the use case, see anywhere from 1 to 4 orders of magnitude higher QPS than your guess, probably on a near constant basis.
  Don't get me wrong, a lot of people have talked about Uber doing overengineering in weird ways, maybe they're even completely right. But being like "Well, obviously x/y = z, and z is rather small, therefore it's not impressive, isn't this obvious?" is the computer programming equivalent of the "econ 101 student says supply and demand explain everything" phenomenon. It's not an accurate characterization of the system at all.
- Twirrim 18 minutes ago
  They're not on AWS. They use on-prem and are migrating to Google and Oracle clouds.
  https://www.forbes.com/sites/danielnewman/2023/02/21/uber-go...
- Jgrubb 2 hours ago
  See, the problem is that the people who care about cost performance and the people who care about UX performance are rarely the same people, and often neither side is empowered with the data or experience they need to bridge the gap.
  bushbaba an hour ago
  Hardware is cheap relative to salaries. It might take 1 engineer 1 quarter to optimize. Compare that to a few thousand per server.
  sgarland 37 minutes ago
  It might take an engineer with no prior RDBMS knowledge a quarter to be able to optimize a DB for their use case, but then it’s effectively free. You found the optimal parameters to use for writer nodes? Great, roll that out to the fleet.
- nunez 2 hours ago
  Didn't realize their entire MySQL data layer runs in AWS. Given that they went with basically a blue-green update strategy, this was, essentially a "witness our cloud spend" kind of post.
  pocket_cheese 2 hours ago
  They're not. Almost all of their infra was on prem when I worked there 3 years ago.
  remon 2 hours ago
  It's neither. I remember them moving to the cloud but apparently they moved to Google/Oracle (the latter making this article particularly interesting btw). As per the relevant press release : "It’s understood that Uber will close down its own on-premises data centers and move the entirety of its information technology workloads to Oracle and Google Cloud."
remon 2 hours ago
It's sort of funny how can you immediately tell it's LLM sanitized/rewritten.
- jdbdndj 2 hours ago
  It reads like any of those tech blogs, using big words where not strictly necessary but also not wrong
  Don't know about your LLM feeling
  est31 2 hours ago
  It contains the word "delve", a word that got way more popular in use since the introduction of LLMs.
  Also this paragraph sounds a lot like it has been written by LLMs, it's over-expressive:
  We systematically advanced through each tier, commencing from tier 5 and descending to tier 0. At every tier, we organized the clusters into manageable batches, ensuring a systematic and controlled transition process. Before embarking on each stage of the version upgrade, we actively involved the on-call teams responsible for each cluster, fostering collaboration and ensuring comprehensive oversight.
  The paragraph uses "commencing from" together with "descending to". People would probably write something like "starting with". It shows how the LLM has no spatial understanding: tier 0 is not below or above tier 5, especially as the text has not introduced any such spatial ordering previously. And it gets worse: there is no prior mention of the word "tier" in the blog post. The earlier text speaks of stages, and lists 5 steps (without giving them any name, but the standard term is more like "step" instead of "tier").
  There is more signs like "embark", or that specific use of "fostering collaboration" which goes beyond corporate-speak, it also sounds a lot like what an LLM would say. Apparently "safeguard" is also a word LLMs write very often.
  wongarsu an hour ago
  It doesn't get much better if you translate that paragraph from corpo speak to normal language: "We did the upgrade step by step. We did each step in batches. After we already decided how we were going to upgrade the clusters but before actually doing it we asked the teams responsible for keeping the clusters running for their opinion. This helped create an environment where we work together and helped monitoring the process"
  I'm sure there are people who write like that. LLMs have to get it from somewhere. But that part especially is mostly empty phrases, and the meaning that is there isn't all that flattering
  zx76 2 hours ago
  Relevant pg thread on twitter: https://x.com/paulg/status/1777030573220933716
  maeil 2 hours ago
  This [1] is a good piece on it. Here's [2] anorher good one.
  We don't just carry out a MySQL upgrade, oh no. We embark on a significant journey. We don't have reasons, but compelling factors. And then, we use compelling again soon after when describing how "MySQL v8.0 offered a compelling proposition with its promise of substantial performance enhancements", just as any human meatbag would.
  [1] https://www.latimes.com/socal/daily-pilot/opinion/story/2024...
  [2] https://english.elpais.com/science-tech/2024-04-25/excessive...
  remon 2 hours ago
  Nah this isn't a big word salad issue. The content is fine. It's just clearly a text written by humans and then rewritten by an LLM, potentially due to the original author(s) not being native speakers. If you feel it's natural English that's fine too ;)
  exe34 2 hours ago
  I always thought 90% of what management wrote/said could be replaced by a RNN, and nowadays LLMs do even better!
- aprilthird2021 2 hours ago
  Let's delve into why you think that
  fs0c13ty00 2 hours ago
  It's simple. Human writing is short and to the point (either because they're lazy or want to save the reader's time), yet still manages to capture your attention. AI writing tends to be too elaborate and lacks a sense of "self".
  I feel like this article challenges my patience and attention too much, there is really no need to focus on the pros of upgrading here. We reader just want to know how they managed to upgrade at that large scale, challenges they faced and how the solved them. Not to mention any sane tech writers that value their time wouldn't write this much.
  peppermint_gum 29 minutes ago
  >It's simple. Human writing is short and to the point (either because they're lazy or want to save the reader's time), yet still manages to capture your attention. AI writing tends to be too elaborate and lacks a sense of "self".
  Corporate (and SEO) writing has always been overly verbose and tried to sound fancy. In fact, this probably is where LLMs learned that style. There's no reliable heuristic to tell human- and AI-writing apart.
  There's a lot of worry about people being fooled by AI fakes, but I'm also worried about false positives, people seeing "AI" everywhere. In fact, this is already happening in the art communities, with accusations flying left and right.
  People are too confident in their heuristics. "You are using whole sentences? Bot!" I fear this will make people simplify their writing style to avoid the accussations, which won't really accomplish anything, because AIs already can be prompted to avoid the default word-salad style.
  I miss the time before LLMs...
  vundercind 2 hours ago
  > Not to mention any sane tech writers that value their time wouldn't write this much.
  This is a big part of why the tech is so damn corrosive, even in well-meaning use, let alone its lopsided benefits for bad actors.
  Even on the “small” and more-private side of life, it’s tempting to use it to e.g. spit out a polished narrative version of your bullet-point summary of your players’ last RPG session, but then do you go cut it back down to something reasonable? No, by that point it’s about as much work as just writing it yourself in the first place. So the somewhat-too-long version stands.
  The result is that the temptation to generate writing that wasn’t even worth someone’s time to write—which used to act as a fairly effective filter, even if it could be overcome by money—is enormous. So less and less writing is worth the reader’s time.
  As with free long distance calls, sometimes removing friction is mostly bad.
  remon 2 hours ago
  This. Thank you for verbalizing what I struggled to.
  Starlevel004 an hour ago
  every section is just a list in disguise, and gpts LOVE listts
- msoad 2 hours ago
  Yeah, I kinda stopped reading when I felt this. Not sure why? The substance is still interesting and worth learning from but knowing LLM wrote it made me feel icky a little bit
  greenavocado an hour ago
  Scroll to the bottom to see a list of those who claimed to have authored it
- l5870uoo9y 2 hours ago
  AI has a preference for dividing everything into sections, especially "Introduction" and "Conclusion" sections.
whalesalad 5 hours ago
So satisfying to do a huge upgrade like this and then see the actual proof in the pudding with all the reduced latencies and query times.
- hu3 4 hours ago
  Yeah some numbers caught my attention like ~94% reduction in overall database lock time.
  And to think they never have to worry about VACUUM. Ahh the peace.
  InsideOutSanta 4 hours ago
  As somebody who has always used MySQL, but always been told that I should be using Postgres, I'd love to understand what the issues with VACUUM are, and what I should be aware of when potentially switching databases?
  mjr00 2 hours ago
  Worth reading up on Postgres' MVCC model for concurrency.[0]
  Short version is that VACUUM is needed to clean up dead tuples and reclaim disk space. For most cases with smaller amounts of data, auto-vacuum works totally fine. But I've had issues with tables with 100m+ rows that are frequently updated where auto-vacuum falls behind and stops working completely. These necessitated a full data dump + restore (because we didn't want to double our storage capacity to do a full vacuum). We fixed this by sharding the table and tweaking auto-vacuum to run more frequently, but this isn't stuff you have to worry about in MySQL.
  Honestly if you're a small shop without database/postgres experts and MySQL performance is adequate for you, I wouldn't switch. Newer versions of MySQL have fixed the egregious issues, like silent data truncation on INSERT by default, and it's easier to maintain, in my experience.
  [0] https://www.postgresql.org/docs/current/mvcc-intro.html
  williamdclt an hour ago
  As much as I have gripes with the autovac, I’m surprised at the idea of getting to such a broken state. 100M rows is not small but not huge, how frequent is “frequent updates”? How long ago was that (there’s been a lot of changes in autovac since v9)?
  “Stops working completely” should not be a thing, it could be vacuuming slower than the update frequency (although that’d be surprising) but I don’t know of any reason it’d just stop?
  That being said I’ve also had issues with autovac (on aurora to be fair, couldn’t say if it was aurora-specific) like it running constantly without vacuuming anything, like there was an old transaction idling (there wasn’t)
  sgarland 27 minutes ago
  On decently-sized tables (100,000,000 is, as you say, not small but not huge), if you haven’t tuned cost limiting and/or various parameters for controlling autovacuum workers, it’s entirely possible for it to effectively do nothing, especially if you’re in the cloud with backing disks that have limited IOPS / throughput.
  It continues to baffle me why AWS picks some truly terrible defaults for parameter groups. I understand most of them come from the RDBMS defaults, but AWS has the luxury of knowing precisely how many CPUs and RAM any given instance has. On any decently-sized instance, it should allocate far more memory for maintenance_work_mem, for example.
  mjr00 28 minutes ago
  It's been a while, but IIRC it was on pg12. "Stopped working completely" I'm basing on the vacuum statistics saying the last auto-vacuum started weeks ago for these tables and never actually finished. Frequent updates means regularly rewriting 10 million rows (at various places) throughout the table. I also should mention that there were 100+ materialized views built off this table which I'm sure had an impact.
  In any case, this got resolved but caused a huge operational headache, and isn't something that would have been a problem with MySQL. I feel like that's the main reason VACUUM gets hated on; all of the problems with it are solvable, but you only find those problems by running into them, and when you run into them on your production database it ends up somewhere between "pain in the ass" and "total nightmare" to resolve.
  InsideOutSanta 2 hours ago
  Thanks for that, that's valuable information.
  evanelias an hour ago
  For an in-depth read on the differences in MVCC implementations, this post is pure gold: https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postg...
  djbusby 4 hours ago
  VACUUM and VACUUM FULL (and/or with ANALYZE) can lock tables for a very long time, especially when the table is large. Incantation may also require 2x the space for the table being operated on. In short: it's slow.
  sgarland 3 hours ago
  `VACUUM` (with or without `ANALYZE`) on its own neither locks tables nor requires additional disk space. This is what the autovacuumdvaemon is doing. `VACUUM FULL` does both, as it's doing a tuple-by-tuple rewrite of the entire table.
  gomoboo 3 hours ago
  pg_repack gets rid of the need to lock tables for the duration of the vacuum: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appen...
  It is an extension though so downside there is it not being included in most Postgres installs. I’ve used it at work and it felt like a superpower getting the benefits of a vacuum full without all the usual drama.
  take-five 2 hours ago
  pg_repack can generate a lot of WAL, which can generate so much traffic that standby servers can fall behind too much and never recover.
  We've been using https://github.com/dataegret/pgcompacttable to clean up bloat without impacting stability/performance as much as pg_repack does.
  williamdclt an hour ago
  Only FULL takes a serious lock (normal vacuum only takes a weak lock preventing things like other vacuums or table alterations iirc).
  Aside: I wish Postgres forced to make explicit the lock taken. Make me write “TAKE LOCK ACCESS EXCLUSIVE VACUUM FULL my_table”, and fail if the lock I take is too weak. Implicit locks are such a massive footgun that have caused countless incidents across the world, it’s just bad design.
  luhn 21 minutes ago
  `TAKE LOCK ACCESS EXCLUSIVE VACUUM FULL` is just an incantation that will be blindly copy-pasted. I don't see how it would stop anyone from shooting themselves in the foot.
  cooljacob204 an hour ago
  This is sorta mitigated by partitioning or sharding though right?
  Too bad it's sorta annoying to do on plain old pg.
  tomnipotent 2 hours ago
  MySQL stores table data in a b+ tree where updates modify the data directly in place as transactions are committed, and overwritten data is moved to a secondary undo log to support consistent reads. MySQL indexes store primary keys and queries rely on tree traversal to find the row in the b+ tree, but it can also contain references to rows in the undo log.
  PostgreSQL tables are known as heaps, which consist of slotted pages where new data is written to the first page with sufficient free space. Since it's not a b-tree and you can't resolve a row with just a primary key without a table scan, Postgres uses the physical location of the row called a tuple ID (TID, or item pointer) that contains the page and position (slot) of the row within that page. So the TID (10, 3) tells Postgres the row is in block 10 slot 3 which can be fetched directly from the page buffer or disk without having to do a tree traversal.
  When PostgreSQL updates a row, it doesn’t modify the original data directly. Instead, it:
  1) Writes a new version of the row to a new page 2) Marks the old row as outdated by updating its tuple header and relevant page metadata 3) Updates the visibility map to indicate that the page contains outdated rows 4) Adjusts indexes to point to the new TID of the updated row
  This means that indexes need to be updated even if the column value didn't change.
  Old rows continue to accumulate in the heap until the VACUUM process permanently deletes them, but this process can impact normal operations and cause issues.
  Overall this means Postgres does more disk I/O for the same work as MySQL. The upside is Postgres doesn't have to worry about page splits, so things like bulk inserts can be much more efficient.
  sgarland 18 minutes ago
  > The upside is Postgres doesn't have to worry about page splits, so things like bulk inserts can be much more efficient.
  Not in the heap, but if you have any index on the table (I know, don’t do that for bulk loads, but many don’t / it isn’t feasible sometimes) then you’re still dealing with a B+tree (probably).
  Also, MySQL still gets the nod for pure bulk load speed via MySQLShell’s Parallel Import Utility [0]. You can of course replicate this in Postgres by manually splitting the input file and running multiple \COPY commands, but having a tool do it all in one is lovely.
  [0]: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-uti...
  InsideOutSanta an hour ago
  That's a perfect explanation, thank you very much!
  brightball 3 hours ago
  There are always tradeoffs.
  anonzzzies 4 hours ago
  Yeah, until vacuum is gone, i'm not touching postgres. So many bad experiences with our use cases over the decades. I guess most people don't have our uses, but i'm thinking Uber does.
  RedShift1 4 hours ago
  Maybe just vacuum much more aggressively? Also there have been a lot of changes to the vacuuming and auto vacuuming process these last few years, you can pretty much forget about it.
  anonzzzies 3 hours ago
  Not in our experience; for our cases it is still a resource hog. We discussed it even less than a year ago with core devs and with a large postgres consultancy place; they said postgres doesn't fit our use case which was already our conclusion, no matter how much we want it to be. Mysql is smooth as butter. I have nothing to win from picking mysql just that it works; I rather use postgres as features / not oracle but...
  Edit; also, as can be seen here in responses, and elsewhere on the web when discussing this, the fans say it's no problem, but many less religious users feel it's a massive design flaw (perfectly logical at the time, not so logical now) that sometimes will stop users from using it, which is a shame
  yeswecatan 3 hours ago
  What is your use case?
  anonzzzies 2 hours ago
  We have 100000s tables per database (1000s of those) (think sensor/iot data with some magic sauce that 0 of our competitors offer) that are heavy on the changes. And yes, maybe it's the wrong tool (is it though if it works without hickups?) for the job (but migrating would be severe so we would only attempt that if we are 100% sure it will work and if the endresult would be cheaper; remember; we are talking decades here, not a startup), but mysql has been taking this without any issues for decades with us (including the rapid growth of the past decade) now while far smaller setups with postgres have been really painful and all because of vacuum. We were postgres in 1999 when we ran many millions of records through it, but that was when we could do a full vacuum at night without anyone noticing. The internet grew a little bit, so that's not possible anymore. Vacuum improved too like everyone says here, and i'm not spreading the gospel or whatever; just fans (... what other word is there) blindly stating it can do loads 'now' they never considered is, well weird.
  dhoe an hour ago
  I'd generally call this amount of tables an antipattern - doing this basically implies that there's information stored in the table names that should be in rows instead, like IDs etc. -- But I'll admit that sensor related use cases have a tendency to stress the system in unusual ways, which may have forced this design.
  leishman 3 hours ago
  Postgres 17 tremendously improves vacuum performance
  mannyv 3 hours ago
  Vacuuming is a design decision that may have been valid back in the day, but is really a ball and chain today.
  In a low-resource environment deferring work makes sense. But even in low-resource environment the vacuum process would consume huge amounts of resources to do its job, especially given any kind of scale. And the longer it's deferred the longer the process will take. And if you actually are in a low-resource environment it'll be a challenge to have enough disk space to complete the vacuum (I'm looking at you, sunos4) - and don't even talk about downtime.
  I don't understand how large pgsql users handle vacuuming in production. Maybe they just don't do it and let the disk usage grow unbounded, because disk space is cheap compared to the aggravation of vacuuming?
  wongarsu 2 hours ago
  You run VACUUM often enough that you never need a VACUUM FULL. A normal VACUUM doesn't require any exclusive locks or a lot of disk space, so usually you can just run it in the background. Normally autovacuum does that for you, but at scale you transition to running it manually at low traffic times; or if you update rows a lot you throw more CPUs at the database server and run it frequently.
  Vacuuming indices is a bit more finicky with locks, but you can just periodically build a new index and drop the old one when it becomes an issue
  sgarland an hour ago
  People not realizing you can tune autovacuum on a per-table basis is the big one. Autovacuum can get a lot done if you have enough workers and enough spare RAM to throw at them.
  For indices, as you mentioned, doing either a REINDEX CONCURRENTLY (requires >= PG12), or a INDEX CONCURRENTLY / DROP CONCURRENTLY (and a rename if you’d like) is the way to go.
  In general, there is a lot more manual maintenance needed to keep Postgres running well at scale compared to MySQL, which is why I’m forever upset that Postgres is touted as the default to people who haven’t the slightest clue nor the inclination to do DB maintenance. RDS doesn’t help you here, nor Aurora – maintenance is still on you.
  tomnipotent 2 hours ago
  MySQL indexes can contain references to rows in the undo log and has a periodic VACUUM-like process to remove those references, though no where near as impactful.
m4r1k 4 hours ago
Uber's collaboration with Percona is pretty neat. The fact that they've scaled their operations without relying on Oracle's support is a testament to the expertise and vision of their SRE and SWE teams. Respect!
- tiffanyh 3 hours ago
  Aren't they using Persona in lieu of Oracle.
  So it's kind of the same difference, no?
sandGorgon 3 hours ago
so how does an architecture like "2100 clusters" work. so the write apis will go to a database that contains their data ?
how is this done - like a user would have history, payments, etc. are all of them colocated in one cluster ? (which means the sharding is based on userid) ?
is there then a database router service that routes the db query to the correct database ?
- ericbarrett an hour ago
  A query for a given item goes to a router*, as you said, that directs it to a given shard which holds the data. I don't know Uber's schema, but usually the data is "denormalized" and you are not doing too many JOINs etc. Probably a caching layer in front as well.
  If you think this sounds more like a job for a K/V store than a relational database, well, you'd be right; this is why e.g. Facebook moved to MyRocks. But MySQL/InnoDB does a decent job and gives you features like write guarantees, transactions, and solid replication, with low write latency and no RAFT or similar nondeterministic/geographically limited protocols.
  * You can also structure your data so that the shard is encoded in the lookup key so the "routing" is handled locally. Depends on your setup
- bob1029 3 hours ago
  I imagine it works just like any multi-tenant SaaS product wherein you have a database per customer (region/city) with a unified web portal. The primary difference being that this is B2C and the ratio of customers per database is much greater than 1.
denysonique 3 hours ago
Why didn't they move to MariaDB instead? A faster than MySQL 8 drop-in replacement.
- evanelias an hour ago
  While it is indeed often faster, it isn't drop-in. MySQL and MariaDB have diverged over the years, and each has some interesting features that the other lacks.
  I wrote a summary of the DDL / table design differences between MySQL and MariaDB, and that topic alone is fairly long: https://www.skeema.io/blog/2023/05/10/mysql-vs-mariadb-schem...
  Another area with major differences is replication, especially when moving beyond basic async topologies.
candiddevmike 5 hours ago
Does Uber still use Docstore? I'd imagine having built an effectively custom DB on top of MySQL made this upgrade somewhat inconsequential for most apps.
- geitir 2 hours ago
  Yes
xyst 3 hours ago
I wonder if an upgrade like this would be less painful if the db layer was containerized?
The migration process they described would be less painful with k8s. Especially with 2100+ nodes/VMs
- __turbobrew__ 2 minutes ago
  I can tell you that k8s starts to have issues once you get over 10k nodes in a single cluster. There has been some work in 1.31 to improve scalability but I would say past 5k nodes things no longer “just work”: https://kubernetes.io/blog/2024/08/15/consistent-read-from-c...
  The current bottleneck appears to be etcd, boltdb is just a crappy data store. I would really like to try replacing boltdb with something like sqlite or rocksdb as the data persistence layer in etcd but that is non-trivial.
  You also start seeing issues where certain k8s operators do not scale either, for example cilium cannot scale past 5k nodes currently. There are fundamental design issues where the cilium daemonset memory usage scales with the number of pods/endpoints in the cluster. In large clusters the cilium daemonset can be using multiple gigabytes of ram on every node in your cluster. https://docs.cilium.io/en/stable/operations/performance/scal...
  Anyways, the TL;DR is that at this scale (16k nodes) it is hard to run k8s.
- meesles an hour ago
  A pipe dream. Having recently interacted with a modern k8s operator for Postgres, it lacked support for many features that had been around for a long time. I'd be surprised if MySQL's operators are that much better. Also consider the data layer, which is going to need to be solved regardless. Of course at Uber's scale they could write their own, I guess.
  At that point, if you're reaching in and scripting your pods to do what you want, you lose a lot of the benefits of convention and reusability that k8s promotes.
- remon 2 hours ago
  Their entire setup seems somewhat suspect. I can't think of any technical justification for needing 21k instances for their type of business.
- zemo 2 hours ago
  upgrade clients and testing the application logic, changes to the queries themselves as written, the process of detecting the regression and getting MySQL patched by percona, changes to default collation ... all of these things have nothing to do with whether the instances are in containers and whether the containers are managed by k8s or not.
- shakiXBT 2 hours ago
  running databases (or any stateful application, really) on k8s is a mess, especially at that scale
tiffanyh 5 hours ago
Why upgrade to v8.0 (old LTS) and not v8.4 (current LTS)?
Especially given that end-of-support is only 18-months from now (April 2026) … when end-of-support of v5.7 is what drive them to upgrade in the first place.
https://en.m.wikipedia.org/wiki/MySQL
- hu3 4 hours ago
  The upgrade initiative started somewhere in 2023 according to the article.
  MySQL 8.4 was released in April 30, 2024.
  Their criteria for a "battle tested" MySQL version is probably much more rigorous than the average CRUD shop.
  paulryanrogers 4 hours ago
  Considering several versions of 8.0 had a crashing bug if you renamed a table, waiting is probably the right choice.
  blindriver 3 hours ago
  You’re not renaming tables when you’re at scale.
  abhorrence 2 hours ago
  Sure you do! It's how online schema changes tend to be done, e.g. https://docs.percona.com/percona-toolkit/pt-online-schema-ch... describes doing an atomic rename as the last step.
- johannes1234321 4 hours ago
  Since direct upgrade to 8.4 isn't supported. They got to go to 8.0 first.
  Also: 8.0 is old and most issues have been found. 8.4 probably has more unknowns.
- pizza234 4 hours ago
  I suppose they opted for a conservative upgrade policy, as v8.4 probably includes all the functional additions/changes of the previous v8.1+ versions, and moving to it would have been a very big step.
  MySQL is very unstable software - hopefully this will be past - and it's very reasonable to go for the smallest upgrade steps possible.
  hu3 8 minutes ago
  > MySQL is very unstable software
  I've worked on 20+ projects using MySQL in consulting career. Not once stability was a concern. Banking clients would even routinely shut down radom MySQL nodes in production to ensure things continued running smoothly.
  As I'm sure users like Uber and Youtube would agree. And these too: https://mysql.com/customers
  Unless you know something we don't and we're just lucky.
- PedroBatista 4 hours ago
  MySQL 8 and beyond has been riddled with bugs and performance regressions. It was a huge rewrite from 5.7.
  8 has nice features and I think they evaluated it as stable enough to upgrade their whole fleet to it. I'm pretty sure from 8 to 8.4 the upgrades will be much simpler.
- cbartholomew 4 hours ago
  They started in 2023. v8.0 was the current LTS when they started.
- EVa5I7bHFq9mnYK 4 hours ago
  With all the migration code already written and experience gained, I imagine upgrading 8->8.4 would take 1/10 of effort of 5.7->8.0.
- tallanvor 4 hours ago
  According to the article they started the project in 2023. Given that 8.4 was released in April 2024, that wasn't even an option when they started.
- gostsamo 5 hours ago
  > Several compelling factors drove our decision to transition from MySQL v5.7 to v8.0:
  Edit: for the downvoters, the parent comment was initially a question.
donatj 3 hours ago
Interestingly we just went through basically the same upgrade just a couple days ago for similar reasons. We run Amazon Aurora MySQL and Amazon is finally forcing us to upgrade to 8.0.
We ended up spinning up a secondary fleet and bin log replicating from our 5.7 master to the to-be 8.0 master until everything made the switch over.
I was frankly surprised it worked, but it did. It went really smoothly.
- takeda 2 hours ago
  AFAIK the 8.0 release is one where Oracle breaks compatibility. So anyone considering MariaDB needs to switch before going to 8.0, otherwise switching will be much more painful.
edf13 5 hours ago
3 million queries/second across 16k nodes seems pretty heavy on redundancy?
- sgarland 3 hours ago
  I was going to say, that's absolutely nothing. They state 2.1K clusters and 16K nodes; if you divide those, assuming even distribution, you get 7.6 instances/cluster. Round down because they probably rounded up for the article, so 1 primary and 6 replicas per cluster. That's still only ~1400 QPS / cluster, which isn't much at all.
  I'd be interested to hear if my assumptions were wrong, or if their schema and/or queries make this more intense than it seems.
  pgwhalen 2 hours ago
  > assuming even distribution
  I don't work for Uber, but this is almost certainly the assumption that is wrong. I doubt there is just a single workload duplicated 2.1K times. Additionally, different regions likely have different load.
- withinboredom 5 hours ago
  That's 200 qps per node, assuming perfect load balancing.
- 620gelato 3 hours ago
  2100 clusters, 16k nodes, and data is replicated across every node "within a cluster" with nodes placed in different data centers/regions.
  That doesn't sound unreasonable, on average. But I suspect the distribution is likely pretty uneven.
John23832 4 hours ago
Anyone else get a "Not Acceptable" response?
- internetter 2 hours ago
  I did but it worked on a private tab
jauntywundrkind 3 hours ago
Having spent a couple months doing a corporate mandated password rotation on our services - a number of which weren't really designed for password rotation - happy to see the dual password thing mentioned.
Being able to load in a new password while the current one is active is where it's at! Trying to coordinate a big bang where everyone flips over at the same time is misery, and I spent a bunch of time updating services to not have to do that! Great enhancement.
I wonder what other datastores have dual (or more) password capabilities?
- johannes1234321 2 hours ago
  I can't answer with an overview on who got such a feature, but "every" system got a different way of doing that: rotating usernames as well. Create a new user with new password.
  This isn't 100% equal as ownership (thus permissions with DEFINER) in stored procedures etc. needs some thought, but bad access using outdated username is simpler to trace (as username can be logged etc. contrary to passwords; while MySQL allows for tracing using performance_schema logging incl. user defined connection attributes which may ease finding the "bad" application)
gregoriol 5 hours ago
Wait until they find out they have to upgrade to 8.4 now
- gregoriol 4 hours ago
  And also all the passwords away from mysql_native_password
  johannes1234321 2 hours ago
  Which one should do anyways. mysql_native_password is considered broken for a out ten years. (Broken for people who can access the hashed form of the password on the server)
  sgarland 3 hours ago
  They've got until 9.0 for that, it just gives deprecation warnings in 8.4.
  gregoriol 27 minutes ago
  Deprecation warnings are in 8.0. It's disabled in 8.4.
  If you are up-to-date with all your libraries it all should go well, but if some project is stuck on some old code, mostly old mysql libraries, one might get surprises when doing the switch away.
  evanelias an hour ago
  More specifically, mysql_native_password is disabled by default in 8.4, but can be re-enabled if needed: https://www.skeema.io/blog/2024/05/14/mysql84-surprises/#aut...
dweekly 3 hours ago
Am I the only one who saw "delve" at the top of the article and immediately thought "ah, an AI generated piece"? Well, that and the over-structured components of the analysis with nearly uniform word count per point and high-complexity but low signal-to-noise vocabulary using phraseology not common to the domain being discussed. (The article doesn't scan as written by an SRE/DBA.)
- devbas 3 hours ago
  The introduction seems to have AI sprinkled all over it: ..we embarked on a significant journey, ..in this monumental upgrade.
rafram 4 hours ago
Did they have ChatGPT (re)write this? The writing style is very easy to identify, and it’s grating.
- OsrsNeedsf2P 4 hours ago
  > The writing style is very easy to identify,
  Really? At n=1 the rate seems to be 0
paradite 4 hours ago
I can tell from a mile away that this is written by ChatGPT / Claude, at least partially.
"This distinction played a crucial role in our upgrade planning and execution strategy."
"Navigating Challenges in the MySQL Upgrade Journey"
"Finally, minimizing manual intervention during the upgrade process was crucial."
- traceroute66 4 hours ago
  > I can tell from a mile away that this is written by ChatGPT / Claude, at least partially.
  Whilst it may smell of ChatGPT/Claude, I think the answer is actually simpler.
  Look at the authors of the blog, search LinkedIn. They are all based in India, mostly Bangalore.
  It is therefore more likely to be Indian English.
  To be absolutely clear, for absolute avoidance of doubt:
  This is NOT intended a racist comment. Indians clearly speak English fluently. But the style and flow of English is different. Just like it is for US English, Australian English or any other English. I am not remotely saying one English is better than another !
  If, like me, you have spent many hours on the phone to Bangalore call-centres, you will recognise many of the stylistic patterns present in the blog text.
  calmoo 4 hours ago
  There's nothing that sticks out to me as obviously Indian English in this blog post. It's almost certainly entirely run through an LLM though.
  antisthenes 3 hours ago
  If there are large amounts of Indian English in an LLM's training data, it stands to reason the LLM output will be very similar to Indian English, no?
  620gelato 3 hours ago
  (Speaking as an Indian engineer)
  Hate to generalize, but this has less to do with "Indian style" but rather adding a lot of fluff to make a problem appear more complex than it is, OR maybe someone set a template that you must write such and such sections, despite there not being relevant content. [ Half the sections from this article could be cut without losing anything ]
  In this case, the _former_ really shouldn't have been the case. I for one would love to read a whole lot more about rollback planning, traffic shifting, which query patterns saw most improvements, hardware cost optimizations, if any, etc.
  brk 4 hours ago
  I agree (I've posted a similar comment in the past and collected a handful of downvotes). Much like ChatGPT, you tend to see a slight over use of more formal and obscure words and a tone that tends to feel like the topic being discussed is being given just a touch too much focus or dedication relative to the grand scheme of things. It is hard to fully describe, more of a "you know it when you see it".
  excitive 3 hours ago
  Can you elaborate on the last part? What are some stylistic patterns that are different when something is written by a US author v/s Indian?
  albert_e 3 hours ago
  I recently saw a tweet where someone pointed out that "today morning" was an Indian phrase.
  I had to really think hard why it is incorrect / not common elsewhere. Had to see comments to learn -- someone explained that a native English speaker would instead say "this morning" and not "today morning".
  As a Indian ESL speaker -- "today morning" sounded (and still sounds) perfectly fine to me -- since my brain grew up with indian languages where this literal phrase (equivalent of "TODAY morning") is not only very common, but also the normal/correct way to convey the idea, and if we instead try to say "THIS morning" it would feel pretty contrived.
  hodgesrm 3 hours ago
  Not exactly a stylistic difference but there are real differences in the dialects. Here's example from many moons ago: "Even I think that's a bad idea." That was an Indian colleague. It took me weeks to figure out that he was using "even" in place of "also."
  In a like vein when Australians say "goodeye" they usually aren't talking about your vision.
  ssl-3 2 hours ago
  Perhaps.
  Or perhaps it was meant to specify that they, themselves, might have been presumed to be an outlier who would think it was a good idea, but who has in fact come to think that is a bad idea.
  Examples of this kind of counter-presumptive use of the word "even":
  1: On animals and the weather: "It was so cold that even polar bears were suffering from frostbite and frozen digits."
  2: On politics, where one's general stance is well-known and who who might be rationally presumed to be a supporter of a particular thing: "Even I think that this issue is a total non-starter."
  Even if they may have meant something else, that doesn't mean that they didn't intend for the words to be taken non-literally.
  V-eHGsd_ 3 hours ago
  > In a like vein when Australians say "goodeye" they usually aren't talking about your vision.
  They aren’t saying goodeye, they’re saying g’day (good day)
- brunocvcunha 4 hours ago
  I can tell just by the frequency of the word “delve”
- mannyv 25 minutes ago
  Once ChatGPT puts in "we did the needful" we're all doomed.
- godshatter 3 hours ago
  That sounds like regular old English to me. I could see myself saying all those things without thinking it's pushing any boundaries whatsoever. I'm starting to fear that LLMs are going to dumb down our language in the same way that people feared that calculators would remove our ability to calculate mentally.
- aster0id 3 hours ago
  Because the authors are likely non native English speakers. I'm one myself and it is hard to write for a primarily native English speaking audience without linguistic artifacts that give you away, or worse, are ridiculed for.
- notinmykernel 4 hours ago
  Agree. Repetition (e.g., crucial) in ChatGPT is an issue.
- rand_r 3 hours ago
  I know what you mean, and you’re probably right, but there’s a deeper problem, which is the overuse of adjectives and overall wordiness. It’s quite jarring because it reads like someone trying to impress rather than get an important message across.
  Frankly, ChatGPT could have written this better with a simple “improve the style of this text” directive.
  Example from the start:
  > MySQL v8.0 offered a compelling proposition with its promise of substantial performance enhancements.
  That could have just been “MySQL v8.0 promised substantial performance improvements.”
- kaeruct 3 hours ago
  ChatGPT says "While it's plausible that a human might write this content, the consistent tone, structure, and emphasis on fluency suggest it was either fully or partially generated by an LLM."
  gurchik 3 hours ago
  How would ChatGPT know?
jeffbee 2 hours ago
File under "things you will never need to do if you use cloud services".
- mannyv 26 minutes ago
  That's not true. The RDS 5.7 instances are EOL so you have to upgrade them at some point.
  At least in RDS, that will be a one-way upgrade ie: no rollback will be possible. That said, you can upgrade one instance at a time in your cluster for a no-downtime rollout.
  jeffbee 17 minutes ago
  Hosted MySQL is not what I meant. That just means you're paying more to have all the same problems. The kind of cloud service I am alluding to is cloud spanner, cloud bigtable, dynamodb.
- martinsnow 2 hours ago
  Nah random outages because the RDS instance you were on decided to faceplant itself, or the weird memory to bandwidth scaling AWS has chosen will make you pull your hair out on a high traffic day.
  It's just different problems.
  jeffbee 2 hours ago
  The company in the article is doing < 200qps per node. Unless they are returning a feature-length video file from every query, they are nowhere near any hardware resource limits.
- paxys 2 hours ago
  At Uber's scale they are a cloud service.
greenie_beans 4 hours ago
yall should prioritize your focus so you can do better at vetting drivers who don't almost kill me
- JamesSwift 3 hours ago
  I'm not sure how effective the database engineers are going to be at solving this, but I guess we can ask them to try...
  greenie_beans 2 hours ago
  thanks for your help
- lenerdenator 3 hours ago
  Their focus is prioritized according to what returns maximum value to their shareholders.
  greenie_beans 3 hours ago
  beep boop i'm a capitalist robot
  pretty sure safe travels is critical to maximum value to their shareholders (aka stfu or tell me how this blog post has anything to do with maximize shareholder value https://www.uber.com/en-JO/blog/upgrading-ubers-mysql-fleet/... ... shareholder value is a dumb ass thing to prioritize over human life)
  Kennnan 3 hours ago
  Honest question, how do you (or amyone) propose to vet drivers? They require drivers license and car insurance registration, anything like a CDL would make being a driver prohibitively expensive. Their rating system already works as a good signal the few times Ive used uber.
  greenie_beans 3 hours ago
  i don't know, i don't work there. i'm just somebody who almost died because one of their drivers was a terrible driver. that sounds like a problem they should figure out. dude didn't even know how to change a tire, so start with "basic knowledge of car maintenance." and a basic ability to speak english would be a good bar to meet, too. they'll let anybody with a driver's license, car, and a heart beat drive on that app. there should be a higher barrier of entry. but idk, i don't work there. this is just my experience as consumer.
  also, the US should be wayyyyy stricter on who we issue drivers license to. so many terrible drivers on the road driving these death machines.
  robertlagrant 2 hours ago
  If you have one big company with 10 bad drivers, you'll get a much worse impression of it than 100 companies each with one bad driver.
  greenie_beans 2 hours ago
  and your point is?
  this just makes no sense bc the drivers are on all of the different apps. rework your formula.
  croisillon 3 hours ago
  mandatory retest every 5 years
- photochemsyn 3 hours ago
  It's undeniable that the worst drivers on the road are those working for ride-hailing services like Uber. It's a big point in Waymo's favor that their automated vehicles behave predictably - Uber drivers are typically Crazy Ivan types doing random u-turns, staring at their electronic devices while driving, blocking pedestrian walkways and bike lanes, etc.
menaerus 4 hours ago
Lately there's been a shitload of sponsored $$$ and anti-MySQL articles so it's kinda entertaining that their authors are being slapped in their face by Uber, completely unintended.
sinaptia_dev an hour ago
We're launching the second issue of This Week in #devs, a section of our blog that highlights discussions and articles from the #devs Slack channel. https://lnkd.in/gvmvKA49 We hope you find it useful!