Comments Page - You Don’t Know Jack about Bandwidth

« Back You Don’t Know Jack about Bandwidthcacm.acm.orgSubmitted by sohkamyung 9 months ago

thepuppet33r 9 months ago
I have spent hours arguing with someone at my work that the issue we are experiencing at our remote locations is not due to bandwidth, but latency. These graphics are exactly what I've been looking for to help get my point across.
People do a speedtest and see low (sub-100) numbers and think that's why their video call is failing. Never mind the fact that Zoom only needs 3 Mbps for 1080p video.
- EvanAnderson 9 months ago
  Latency is a cruel mistress. Had a Customer who was using an old Win32 app that did a ton of individual SELECT queries against the database server to render the UI. They tried to put it on the end of a VPN connection and it was excruciating. The old vendor kept trying to convince them to add bandwidth. Fortunately the Customer asked the question "Why does the app work fine across the 1.544Mbps T1 to our other office?" (The T1 had sub-5ms latency.)
  chrismorgan 9 months ago
  I was involved in some physical network inventory software a dozen years ago. One team produced some new web-based software for it, for use by field agents and such. The first version we got to review was rather bad in various important areas; my favourite was that search would take over thirty seconds in common real-world deployment environment: it was implemented in .NET stuff that makes server calls easy to do by accident and in unnecessarily blocking fashion, and searched by each of the 34 entity types individually, in sequence; and some .NET session thing meant that if the connection had been idle for ten minutes or something, it would even then need to retry every request that got queued while the session was “stale”, which was all of them. So you ended up with 68 sequential requests, on up to half a second’s latency (high-latency 3G or 2G or geostationary satellite)… so yeah, 30 seconds.
  They’d only developed it with sub-millisecond latency to the server, so they never noticed this.
  I don’t think it was a coincidence that the team was US-based: in Australia, we’re used to internet stuff having hundreds of milliseconds of latency, since so much of the internet is US-hosted, so I think Australians would be more likely to notice such issues early on. All those studies about people abandoning pages if they take more than two seconds to load… at those times, it was a rare page that has even started rendering that soon, because of request waterfalls and high latency. (These days, it’s somewhat more common to have CDNs fix the worst of the problem.)
  dietr1ch 9 months ago
  I think that latency people grow up with has a huge impact on how careful they are when using the internet.
  Having gotten my hands on an experimental 128kbps link early on, but later and moving to the countryside with a 56kb-1Mbps really spotty connection made me really appreciate local state as every time things blocked on the internet made it pretty notorious.
  I'm glad there's a push for synchronized, local-first state now, as roaming around on mobile or with a laptop hopping on wifi can only perform nicely with local state.
  EvanAnderson 9 months ago
  > I think that latency people grow up with has a huge impact on how careful they are when using the internet.
  This reminds me a ton about the studies showing harm to children's future mental health as a result of growing up poor. I definitely "hoard" data locally because I grew up with accessing BBSs at 1200 baud, moving up thru dial-uo Internet, and eventually shitty DSL. Services and products that rely on constant access to the Internet strike me as dodgy and likely to fail as a result of my "upbringing".
  akira2501 9 months ago
  I test all my web dev on the other side of a 4G modem connected to an MVNO. It forces you to be considerate of both bandwidth and latency as it's about 5-20Mbps in the city with 120ms average latency.
  It's not at all impossible to design fast and responsive sites and single page applications under these constraints, you just have to be aware of them, and actively target it during the full course of development.
  martyvis 9 months ago
  Many years ago I was called into troubleshoot the rollout of a new web based customer management application that was replacing a terminal green screen one. I was flummoxed finding all the developers had only ever tested their app running from workstations on 100Mbps switches where the target offices for this application were connected by 128kbps ISDN lines. I was able demonstrate how each 12kB of their application was going to take 1 second. (It was amazing to see their HTML still full of comments, long variable names, etc). I don't even think they had discovered gzip compression. This was after many millions of dollars had already been spent on the development project.
  slt2021 9 months ago
  good for you, you are doing your job very well.
  part of the reason why modern software is so crappy, is because developers often have thee most powerful machines (MacBook pro like) and don't even realize how resource hungry and crrappy their software at lower end devices
  Escapado 9 months ago
  Another part of the reason is that in every company I have ever worked for some SEO dude insists we need to add literally 10 different tracking scripts and insists they need to load first and consume megabytes of data. At my last gig the landing page was 190kb of HTML, CSS and JS from us, 800kb of images and literally 8mb in vendor scripts we were mandated to load as early as possible. Of course we fought it, of course nobody cared.
  justmarc 9 months ago
  Try an e-banking website which literally loads over 25MB of crap just to show the login page.
  This is by far the worst offender I've seen.
  Madness.
  huijzer 9 months ago
  Yes MacBook pro with a glass fibre connection with a ping below 10 ms to the server.
  I’m usually on an old copper line (16 ms ping to Amsterdam) in the Netherlands (130 ms to San Francisco).
  Some sites are just consistently slow. Especially GitHub or cloud dashboards. My theory is that the round trip to the database slows things down.
  FridgeSeal 9 months ago
  GitHub and everything Atlassian deserve to be thrown in the pit-of-shame and laughed at for perpetuity.
  Jira is so agonisingly slow it’s a wonder anyone actually pays for it. Are the devs who work on it held against their will or something? It’s ludicrous.
  GitHub gets worse every day, with the worst sin being their cutesy little homegrown loading bar, which is literally never faster than simple reloading the page.
  oefrha 9 months ago
  These days people on HN love to advocate sockety frontend solutions like Phoenix LiveView, where every stateful interaction causes a roundtrip, including ones that don’t require any new data (or all required data can be sent in a batch at the beginning / milestones). It’s like they forgot the network exists and is uneven and think everything’s happening in-process.
  To ward off potential criticism: I know you can mix client-side updates with server-side updates in LiveView and co. I’ve tried. Maintaining client-side changes on a managed DOM, sort of like maintaining a long-lived topic branch that diverges from master, sucks.
  eru 9 months ago
  In the Google office we had (perhaps they still have) a deliberately crappy wifi you can connect to with your device, to experience extra latency and latency spikes and random low bandwidth. All in the same of this kind of testing.
  chickenbig 9 months ago
  > Latency is a cruel mistress.
  Yes, Bloomberg had fun with latency because of their datacenter locations (about a decade ago they still only had two and a half close to New York). Pages that would paint acceptably in London would be unacceptable in Tokyo as when poorly designed they would require several round trips to render. Once the page rendered there was still the matter of updating the prices, which was handled by separately streaming data from servers close to the markets to the terminals. A very different architecture but rather difficult to test because of the significant terminal-side functionality.
  rahimnathwani 9 months ago
  Multiple requests sent in series (e.g. because each depends on the previous one) is also a problem with web apps. When you're developing locally or testing on a staging server near you, the latency is negligible and you might not notice.
  Chrome's developer tools allow you to disable caching and choose to simulate a specific network type. Most people know those settings restrict the throughput. But they also increase the request latency. They do this on a request-by-request basis, not at packet level, so it's only an approximation. Still a good test.
  Hikikomori 9 months ago
  Had some devs in another country complaining that their database query was taking hours to complete but doing it from a server in the same datacenter took a few minutes. Took some weeks of emails and a meeting or two until they understood that we couldn't do anything, I had to actually say that we couldn't do anything about latency unless they physically move their country closer to us.
  jstanley 9 months ago
  Could you replicate the data to their country and let them run queries locally?
  Could they run their client from your country and operate the UI remotely?
  There are more options than moving the country!
  Hikikomori 9 months ago
  I was the network engineer. It was their server and database, I couldn't solve the latency problem for them.
  rodrigodlu 9 months ago
  It's easier to have an replica. Doesn't matter if it's "realtime" sync or using CDC, from backups, etc.
  You can even ask one of these guys to do the setup for you. They'll do in a pinch with a happy face.
  I know because I did.
  Hikikomori 9 months ago
  It is, but not my problem as a network engineer. We did suggest that though but they refused to believe that we couldn't solve the latency "problem".
  martyvis 9 months ago
  I have been there many times. While the network guy can solve some of the latency in things like TCP handshakes and use compression and caching with magic black boxes you can't fix the actual application query and acknowledgement requirements that might be there.
  therein 9 months ago
  Latency to the database especially a cruel mistress.
  mschuster91 9 months ago
  > Fortunately the Customer asked the question "Why does the app work fine across the 1.544Mbps T1 to our other office?" (The T1 had sub-5ms latency.)
  That reminds me on the atrocious performance of Apple's TimeMachine with small files. Running backups on SSDs is fast, but cable ethernet is noticeably worse, and even WiFi 6 is utterly disgraceful.
  To my knowledge you can't even go and say "do not include any folder named vendor (PHP) or node_modules (JS)", because (at least on my machine) these stacks tend to be the worst offenders in creating hundreds of thousands of small files.
- Roark66 9 months ago
  That is why I've been successfully working from home for almost a decade starting on an LTE connection that was 5Mb up and 10Mb down(notice this is small b as in bits). No problem at all... Why because most of the time latency was good.
  I'm still on the same LTE connection, but everyone kept telling me how my speeds are crap and how I should update to a new LTE cat 21 router. So I got one of more popular models ZTE MF289F. And the speed increased to 50Mb up 75Mb down on a speed test. But all my calls suddenly felt very choppy and the perceived Web browsing was unbearably slow... What happened? Well, the router would just decide every day or so to up it's latency to Google.com from 15ms to 150ms until it was restarted. But that is not all. Even when the ping latency was fine it still felt slower than my ancient tplink lte router... So the zte went into a drawer waiting for the times I'll have time to put Linux on it. And the tplink went back on top of my antenna mast.
  dtaht 9 months ago
  In a lot of cases older is better.
  See also https://github.com/lynxthecat/cake-autorate for an active measurement tool...
- dtaht 9 months ago
  speedtest.net added support for tracking latency under load a few years ago. they show ping during up/dl now. That's the number to show your colleague.
  However they tend to use something like the 75th percentile and throw out real data. The waveform bufferbloat test does 95% and supplies whisker charts. cloudflare also.
  No web test tests up + down at the same time, which is the worst scenario. crusader and flent.org's rrul test do.
  Rathan than argue with your colleague, why not just slap an OpenWrt box as a transparent bridge inline and configure CAKE SQM?
  imp0cat 9 months ago
  Next time you need to assess your connection's capabilities, try https://speed.cloudflare.com/ instead of speedtest.net. Much more informative.
  thomasjudge 9 months ago
  Would you put the "OpenWrt box as a transparent bridge inline" between your home router and the cable modem, or on the house side of the home router?
  dtaht 9 months ago
  I would replace the home router with an OpenWrt router.
  matheusmoreira 9 months ago
  One of the best things I've ever done. OpenWrt is so good. SQM helps a lot with latency.
- danpalmer 9 months ago
  Funnily enough I have found since moving from London to Sydney that people here are far more understanding of the difference between latency and throughput. Being 200ms from anyone else on the internet will do that to you!
- guappa 9 months ago
  "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."
  -- Andrew S. Tanenbaum
- izacus 9 months ago
  Not just latency, but also jitter. Jitter was the biggest issue we had when broadcasting and streaming video. You don't need a lot of bandwidth, latency can be surivived... but jitter will ruin your experience like nothing else.
- dtaht 9 months ago
  BITAG published this a while back.
  https://www.bitag.org/latency-explained.php
  It's worth a read.
- buginprod 9 months ago
  I dont see how people miss latency. It is the only other number shown on the speed check screen! No curiousity as to why that is there?
  I mean I bet they do care about litres/100km for their car AND 0-100km accelerarion (and many other stats)
jiggawatts 9 months ago
Even IT professionals can't tell the difference between latency and bandwidth, or capacity and speed.
A simple rule of thumb is: If a single user experiences poor performance with your otherwise idle cluster of servers, then adding more servers will not help.
You can't imagine how often I have to have this conversation with developers, devops people, architects, business owners, etc...
"Let's just double the cores and see what happens."
"Let's not, it'll just double your costs and do nothing to improve things."
Also another recent conversation:
"Your network security middlebox doesn't use a proper network stack and is adding 5 milliseconds of latency to all datacentre communications."
"We can scale out by adding more instances if capacity is a concern!"
"That's... not what I said."
- JohnMakin 9 months ago
  It’s astounding how many people that work in infrastructure should understand things like this but don’t, particularly network bottlenecks or bottlenecks in general. I’ve seen situations where someone wants to increase the number of replicas for some service because the frontend is 504’ing, but the real reason is because the database has become saturated by the calls from the service. It is possible (a little unlikely, but possible, and the rule with infra at scale is “unlikely” always becomes “certain”) to actually make the problem worse by scaling up there. The number of blank stares I get when explaining things like this is demoralizing sometimes, especially in consulting situations where you have some pig headed engineer manager that thinks he knows everything about everything.
  dtaht 9 months ago
  Being one of the authors of fq_codel, cake and to a small extent, LibreQos, I remain boggled after 15 years of trying, to get these simple points across..
  Have some laughs: https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-...
- floating-io 9 months ago
  That is the perfect high-level overview of my previous job.
  I'm semi-retired now...
  (edit: I forgot to note that the "let's not" part was always overridden by "You're wrong, this will fix it. Do it!" by management. Then we would eventually find and fix the actual problem (because it didn't go away), but the cluster size -- and the cost -- would remain because "No, it was too slow with so few replicas".)
  ozim 9 months ago
  You know that management get bonus points for blaming developers for increasing it at the end of year when annual budget review comes by.
  jiggawatts 9 months ago
  I literally had the same manager complaining to me about his cloud bill just months after rejecting my suggestion for more efficient hosting options (still in the cloud, just less wasteful.)
- ozim 9 months ago
  I can see how increasing resources can be useful as "try to stop the bleeding RIGHT NOW!", but yeah if a single user has issues if there is not much other activity that's like doing CPR on a person that is screaming "stop, my foot is bleeding" and "hey man we will try this first, stop screaming" ;)
- dtaht 9 months ago
  I share your pain. I really really really share your pain.
- hinkley 9 months ago
  “Just try it anyway.”
- ipython 9 months ago
  As they say, if you’re getting impatient for your baby to arrive, just get more pregnant ladies together! The cluster of pregnant women make the process move along quicker!
  /s
  codesections 9 months ago
  And the "they" in question is Warren Buffet, https://nymag.com/intelligencer/2009/06/you_cant_make_a_baby...
  vitus 9 months ago
  I thought this was first attributed to Fred Brooks in the 70s.
  > Brooks points out this limited divisibility with another example: while it takes one woman nine months to make one baby, "nine women can't make a baby in one month".
  https://en.wikipedia.org/wiki/Brooks%27s_law
  fragmede 9 months ago
  You can't take a random 9 women and have a baby in one month, but that makes me wonder, statistically, how many women from the whole population would you need to get to 1 month? 9 women can't make a baby in one month, but given 9 million women, the chances are, one of them are giving birth right now. if I needed a baby tomorrow, what do the statistics say on how many women it would take to have a baby, tomorrow?
  taking the population of the earth and the birth rate and doing some math, you get around to needing 12,000 women of reproductive age for you to have a baby tomorrow.
  12,000 is a lot of women! it's well above Dunbar's number. think about that, next time the 9 women one month baby topic comes up.
iscoelho 9 months ago
This article appears to be written from a solid Linux networking background, but not from an ISP networking background.
ISPs at scale do not use software routers. They use ASIC routers (Juniper/Arista/Cisco/etc.), for many reasons 1) features 2) capacity 3) reliability.
ASIC routers are capable of handling 100-1000x the throughput of the most over-provisioned Linux server (and that may even be an understatement). ASIC routers can also route packets with latency between 750us (0.75ms) and 10us (0.01ms!), complemented by multi-second (>GB) packet buffers.
QoS is rarely used at scale, if anything only on the access layer, because transit has become so cheap that ISPs have more bandwidth than they know what to do with. These days, if a link is congested, it's not cost saving, but instead poor network planning. QoS also has very limited benefits at >100G scale.
With that said, I feel that this article is definitely missing the full picture.
- davecb 9 months ago
  Hi, Dave here.
  I recommend using a small box running LibreQoS adjacent to the big router. Large-scale routers based on Application-Specific ICs do a wonderful job, but are hard to change. Having a transparent fix in an inexpensive device now is way better than waiting and hoping that the router vendor can update their ASICs (:-))
  I emphasised your problem in a video about the article, at https://vimeo.com/1017926413
- inemesitaffia 9 months ago
  some isps are small and buy transit above $10/mbps
  iscoelho 9 months ago
  That is almost impossible pricing these days at any bandwidth level unless you are either a) renewing an existing contract without negotiation b) are negligently bad at negotiation and are being taken advantage of c) are self dealing / sabotaging. (*unless you're in Australia)
  Bandwidth is cheap.
  inemesitaffia 9 months ago
  The world is a big place. I remember a country in Africa selling at $35/Mbps.
  inemesitaffia 9 months ago
  From a 2023 article
  A report by the Internet Society (ISOC) about IXPN and Kenyan IXP revealed that in early 2020, the port charge at IXPN was US$0.428 per month per Mbps (for a 1 Gbps port), while the cost of international IP transit is US$27.45 per Mbps per month (also for 1 Gbps capacity).
  Let me find more recent estimates. Just note that we aren't all in Ashburn, Frankfurt or Amsterdam
  inemesitaffia 9 months ago
  "Richard Obita, the Director, Planning, Research and Development at NITA-U, assured the MPs of his group’s aim to reduce the cost of internet further from the current US$70 for each megabit per second to US$35, which, if correct, is still astronomical compared to many countries."
  From Uganda April 2023. And Zimbabwe and Eritrea are more expensive
cycomanic 9 months ago
I know it's common to say bandwidth casually, but I really wish a Blog trying to explain the difference between data rate and latency would not conflate bandwidth and data rate (one could also say throughput or capacity although the latter is also technically incorrect). The term bandwidth really denotes the spectral width occupied by a signal, and while it is related to the actual data rate, it is much less so nowadays where we use advanced modulation compared to back when everything was just OOK.
Coincidentally, the difference between latency and data rate is also much clearer using these two terms.
- zokier 9 months ago
  its almost as if words have different meaning in different contexts
  https://en.wikipedia.org/wiki/Bandwidth_(computing)
ajb 9 months ago
This article is confused, or at best unclear, about how AQM works:
"CAKE then added Active Queue Management (AQM), which performs the same kind of bandwidth probing that vanilla TCP does but adds in-band congestion signaling to detect congestion as soon as possible. The transmission rate is slowly raised until a congestion signal is received,[...]"
This appears to suggest that Cake (an in-network AQM process) takes over some of the functionality of TCP (implemented in the endpoints). What's actually happening is that the AQM provides a better signal to allow TCP to do a better job.
The rest of the article is more it less accurate, albeit that it's marketing for one particular tool rather than giving you the level of understanding needed to choose one.
The dig at PIE (another AQM) is also a bit misleading, in that their main complaint is not PIE itself but the lack of all these other features they think necessary. If Cake used PIE instead of CODEL I don't think it would be noticeably different.
- dtaht 9 months ago
  I agree that the phrasing in the article is a bit confusing there!
  Pie had a severe problem in the rate estimator which was fixed in 2018, in Linux, at least:
  https://www.sciencedirect.com/science/article/abs/pii/S13891...
  Pie's principal advantage is that it is slightly easier to implement in hardware, it's disadvantages are that it does tail drop, rather than head drop, and struggles to be stable at a target of 16ms, where codel can go down to us and targets 5ms by default. I haven't really revisited pie since the above paper was published.
  COBALT in cake is a codel derivative. It is slightly tighter in some respects (hitting slow start sooner), and looser in others (it never drops the last packet in a queue, which fq_codel does.fq_codel scales to hundreds of instances and 10s of thousands of queues, still aiming for 5ms across that target, where it would be easier to essentially DOS that many instances of cake with tons of flows.
declan_roberts 9 months ago
Working from home has really put a spotlight on the terrible asymmetric upload speeds of most cable internet.
I can get 1 gb down but only 50 mb upload. Certain tasks (like uploading a docker image) I cant do at all from my personal computer.
The layman has no idea the difference, and even most legislators don't understand the issue ("isn't 1 gb fast enough?")
- packetlost 9 months ago
  This. I've been fighting AT&T for awhile because they told the FCC (via their broadband maps [0]) that they supply fiber to my condo, so I bought it expecting to get fiber. Well when I finally go to set up internet service, they only offer 50/5 DSL service. Fortunately I can get cable that has usable down speeds but the up is substantially less than 50 with garbage routing.
  I'm not very happy.
  [0]: https://broadbandmap.fcc.gov/
  dsissitka 9 months ago
  I'm not sure if it's true but I've heard they take that very seriously.
  Have you filed a complaint with the FCC? Both times I had to do it things got sorted very quickly.
  https://consumercomplaints.fcc.gov/hc/en-us/articles/1150022...
  packetlost 9 months ago
  I actually did file a complaint, but the FCC rejected it for reasons that I were not communicated.
  silisili 9 months ago
  Good luck.
  At one time I was experiencing high ping times and near non existent speed from ATT Fiber to Online.fr's network. I did 80% of the diagnostics for them and provided the details and of course a nudge as to what I felt the issue could be.
  It's extremely frustrating to be a networking person having to deal with home internet CS.
  To my surprise, it actually did get to their networking team who replied saying the peer was fine and try again. The problem with that was that it came 8 months later, long after I'd left the area and didn't even have service with them anymore.
- sneak 9 months ago
  Why are you building images at home for upload?
  In my experience, it’s much easier to upload code or commits and build/push artifacts in/from the datacenter, whether manually or via CI.
  It can be as simple as exporting DOCKER_HOST=“ssh://root@host”. Docker handles uploading the relevant parts of your cwd to the server.
  I have a wickedly fast workstation but spot instances that are way way faster (and on 10gbps symmetric) are pennies. Added bonus: I can use them from a slow computer with no degradation.
- __MatrixMan__ 9 months ago
  I got lucky and fiber became available in my neighborhood around the same time I noticed how painful pushing images over cable was. Hopefully you'll get that option soon too.
  For the unlucky, maybe we can take advantage of the fact that most image pushes have a predecessor which they are 99% similar to. With some care about the image contents (nar instead of tar, gzip --rsyncable, etc) we ought to be able to save a lot of bandwidth by using rsync on top of the previous version instead of transferring each image independently.
- LoganDark 9 months ago
  > I can get 1 gb down but only 50 mb upload. Certain tasks (like uploading a docker image) I cant do at all from my personal computer.
  As someone who used to work with LLMs, I feel this pain. It would take days for me to upload models. Other community members rent GPU servers to do the training on just so that their data will already be in the cloud, but that's not really a sustainable solution for me since I like tinkering at home.
  I have around the same speeds, btw. 1Gb down and barely 40Mb up. Factor of 25!
  latency-guy2 9 months ago
  I feel your pain, I haven't been in ML world directly for a few years now but I've done the same exercise multiple times.
  The worst part is that block compression actually does not help if it doesn't do a significantly good job of compression AND decompression. My use case had to immediately deploy the models across a few nodes in a live environment at customer sites. Cloud wasn't an option for us and fiber was also unavailable many times.
  The fastest transport protocol was someone's car and a workday of wages.
  LoganDark 9 months ago
  > The fastest transport protocol was someone's car and a workday of wages.
  This is actually the entire premise of AWS Snowball: send someone a bunch of storage space, have them copy their data to that storage, then just ship the storage back with the data on it. It can be several orders of magnitude faster and easier than an internet transfer.
  Sneakernet really works. https://en.wikipedia.org/wiki/Sneakernet
  fragmede 9 months ago
  it would be totally cyberpunk to have a data cafe where you bring your hard drive to upload to the cloud and you'd pay by the terabyte/s. have all day? cheap. need to do it in 30 mins? pay up.
- imp0cat 9 months ago
  YMMV, but this can be usually mitigated by connecting to a machine in a datacentre at work and doing all the stuff there (like building and uploading docker images).
jrs235 9 months ago
Bandwidth is the numbers of traffic lanes. Latency (speed) is the determined by the material and it's quality of the lanes (gravel road vs asphalt etc.). Throughput is determined by the number of lanes and the speed that vehicles can travel.
- jrs235 9 months ago
  Latency also is determined by the length of the road from start to destination and the number of "hops" or intersections and turns.
kortilla 9 months ago
> Now a company with bad performance can ask its ISP to fix it and point at the software and people who have already used it. If the ISP already knows it has a performance complaint, it can get ahead of the problem by proactively implementing LibreQoS.
The post was a pretty good explanation about a new distro ISPs can use to help with fair queuing, but this statement is laughably naive.
A distro existing is only a baby first step to an ISP adopting this. They need to train on how to monitor these, scale them, take them out for maintenance, and operate them in a highly available fashion.
It's a huge opex barrier and capex is not why ISPs didn’t bother to solve it in the first place.
- dtaht 9 months ago
  We have seen small ISPs get LibreQos running in under an hour, which includes installing ubuntu. Configuring it right and getting it fully integrated with the customer management system takes longer.
  We're pretty sure most of those ISPs see reduced opex from support calls.
  Capex until the appearance of fq_codel (Preseem, Bequant) or cake (LibreQos, Paraqum) middleboxes was essentially infinite. Now it's pennies per subscriber and many just a get a suitable box off of ebay.
  I agree btw, that how to monitor and scale is a learned thing. For example many naive operators look at "drops" as reported by CAKE as a bad thing, when it is actually needed for good congestion control.
  kortilla 9 months ago
  > We have seen small ISPs get LibreQos running in under an hour, which includes installing ubuntu.
  Slapped together as a PoC is different than something production ready. Unless those ISPs are so small they don’t care about uptime, a single Ubuntu box in the only hot path of the network is no bueno.
  > We're pretty sure most of those ISPs see reduced opex from support calls.
  I highly doubt this. As someone who worked in an ISP, the things that people call their ISP for are really unrelated to the ISP (poor WiFi placement, computer loaded with malware, can’t find their WiFi password, can’t get into their gmail/bank/whatever). When Zoom sucks they don’t even think to blame their ISP, they just think zoom sucks.
  There is a tiny fraction of power users who might suspect congestion, but they aren’t the type to go into ISP support for help.
  > Capex until the appearance of fq_codel (Preseem, Bequant) or cake (LibreQos, Paraqum) middleboxes was essentially infinite. Now it's pennies per subscriber and many just a get a suitable box off of ebay.
  These tools have been around for a while now. My point is that the ISPs that haven’t done something about this yet aren’t holding out for a cheaper capex option. They are in the mode of not wanting to change anything at all.
  So this attitude that you only need to tell them “there is an open source thing you can run on an old server that will help with something that isn’t costing you money anyway” is out of touch with how most ISPs are run.
  The ones that care don’t need their customers to tell them. The ones that don’t care aren’t going to do anything that requires change.
  dtaht 9 months ago
  The users that call less post-libreqos are the gamers & zoomers.
  I am sorry you are are so down on the lack of motivations care for customer service that many ISPs have. I am thrilled by how much libreqos's customer base cares.
  davecb 9 months ago
  On of my _evil hidden agendas_ is making end-users aware that their ISP (Rogers, anyone?) is doing a terrible job, and someone like TekSavvy can solve their problems for them (;-))
PeterStuer 9 months ago
ISP's don't mind customers hating them as long as they don't leave, and in many places they can't because theirs is the only game in town, or there is one other player that screws over their clients in exactly the same way.
They have used deep packet inspection and traffic shaping for ages to screw over Over The Top competition to their own services or tier their offerings into higher priced slightly less artificially sabotaged package deals.
I realy like what the libreqos people are aiming for, but lets not pretend ISP's are trying to be great and just technically hampered (and yes, I'm sure there are exceptions to this rule).
codesections 9 months ago
How does OpenWRT fair on these metrics? Does it count as a "debloted router" is the sense used in TFA? Or is additional software above and beyond the core OpenWRT system needed to handle congestion properly?
- wmf 9 months ago
  OpenWRT has SQM but you have to enable it. https://openwrt.org/docs/guide-user/network/traffic-shaping/...
- dtaht 9 months ago
  OpenWrt depreciated pfifo_fast in favor of fq_codel in 2012, and have not looked back. It (and BQL) is ever present on all their Ethernet hardware and most of their wifi, no configuration required. It's just there.
  That said many OpenWrt chips have offloads that bypass that, and while speedier and low power, tend to be over buffered.
tonymet 9 months ago
There are three parameters of concern with your ISP. Bandwidth, Latency (and Jitter), and Data Caps.
Bandwidth is less of a concern for most people as data rates are over 500mb+ . That's enough to comfortably stream 5 concurrent 4k streams (at 20mbps).
Latency and jitter will have a bigger impact on real time applications, particularly video conferencing , VOIP, gaming , and to a lesser extent video streaming when you are scrubbing the feed. You can test yours at https://speed.cloudflare.com/ . If your video is jittery or laggy, and you are having trouble with natural conversation, latency / jitter are likely the issue.
Data Caps are a real concern for most people. At 1gbps, most people are hitting their 1-1.5tb data cap within an hour or so.
Assuming you are around 500mbps or more, latency & data caps are a bigger concern.
- lxgr 9 months ago
  > At 1gbps, most people are hitting their 1-1.5tb data cap within an hour or so.
  Assuming you're talking about consumers: How? All that data needs to go somewhere!
  Even multiple 4K streams only take a fraction of one gigabit/s, and while downloads can often saturate a connection, the total transmitted amount of data is capped by storage capacities.
  That's not to say that data caps are a good thing, but conversely it also doesn't mean that gigabit connections with terabit-sized data caps are useless.
- Izkata 9 months ago
  > gaming
  Gamers tend to have an intuitive understanding of latency, they just use the words "lag" and "ping" instead.
- daemonologist 9 months ago
  Another problem around data caps is that even if you have/pay extra for "unlimited" data, there's still a point where your ISP will fire you as a customer (or threaten to do so) for using too much data - I've heard of it around 8-10 TB on Comcast for example. Unlike with mobile plans there's no soft cap in the contract, they just decide when you've breached the ToS/AUP and can cut you off at their sole discretion.
NoPicklez 9 months ago
In my experience higher latency due to bufferbloat occurs when my internet connection is saturated, like the example in the article of downloading a game.
However, people can still have latency issues from their ISP even if their connection isn't fully saturated at home. Bufferbloat is just one situation in which higher latency is created.
Yes, my Zoom call was terrible BECAUSE I was also downloading Diablo saturating my connection. But my Zoom call could also be terrible without anything else being downloaded if my ISP is bad or any number of other things.
As someone who worked in a large ISP, if a customer says their bandwidth is terrible but they are getting their line saturated most ISPs will test for latency issues.
Bufferbloat is one of many many reasons why someone's network might be causing them high latency.
- ynik 9 months ago
  The really horrible bufferfloat usually happens when the upload bandwidth is saturated -- upload bandwidth tends to be lower so it'll cause more latency for the same buffer size. I used to have issues with my cable modem, where occasionally the upload bandwidth would drop to ~100kbit/s (from normally 5Mbit/s), and if this tiny upload bandwidth was fully used, latency would jump from the normal 20ms to 5500ms. My ISP's customer support (Vodafone Germany) refused to understand the issue and only wanted to upsell me on a plan with more bandwidth. In the end I relented and accepted their upgrade offer because it also came with a new cable modem, which fixed the issue. (back then ISPs didn't allow users to bring their own cable modem -- nowdays German law requires them to allow this)
- jonathanlydall 9 months ago
  It is true that there is an interplay between bandwidth utilization and latency.
  However (assuming no prioritisation), if your bandwidth is at least double your video conference bandwidth requirements then a download shouldn’t significantly affect the video conference since TCP tends to be fair between streams.
  Even when I was on a 10Mb/s line I found gaming and voice was generally fine even with a download.
  However, if you’re using peer to peer (like BitTorrent), then that is utilizing dozens or hundreds of individual TCP streams and then your video conference bandwidth getting equal amount per all other streams is too slow.
  Bufferbloat exacerbates high utilisation symptoms because it confounds the TCP algorithms which struggle to find the correct equalibrium due to “erratic” feedback on if you’re transmitting too much.
  It’s like queuing in person at a government office and not being able to see past a door or corner how bad the queue really is, if you could see it’s bad you might come back later, but because you can’t you stand a while on the queue only to realize quite a bit later you’ll have to wait much longer than you initially expected, but if you’d known upfront it would be bad you might have opted to come back later when it’s more quiet. Most people feel that since they’ve sunk the time already they may as well wait as long as it takes, further keeping the queue long.
  Higher throughput would help, but just knowing ahead that now’s a bad time would help a lot too.
  I do wish most consumer ISPs supported deprioritising packets of my choice, which would allow you to download things heavily at low priority and your video call would be fine.
imp0cat 9 months ago
The Waveform Bufferbloat test page (https://www.waveform.com/tools/bufferbloat) has some recommended routers for mitigating bufferbloat, but the first recommended one (Eero) has apparently already been revised and the new version's capabilities are not as good as they once were. The second one (Netgear Nighthawk) seems to have terrible software and support.
So I'm looking for some opinions, what's your experience? Casual googling seems to suggest that the best solution to implement traffic management would either be a dedicated machine with something like OpenWRT or an all in one solution (ie. a Firewalla gold + some AP to provide wifi).
- archi42 9 months ago
  A router you can flash with a modern OpenWRT is likely a good option. Check the project website and/or forums and/or reddit for recent recommendations. That's what I did in the past.
  Personally I've moved to OpnSense: Some run it natively on a refurbished low-power SFF hardware (6000 or 7000 series Intel should be fine, or some Ryzen), so even in countries with high electricity costs that's feasible these days.
  More specifically, I run OpnSense in a qemu/libvirt VM (2C of a E5-2690v4) and do WiFi with a popular prosumer APs. Mind that VMs are likely to introduce latency, so if you try this route, make sure to PCIe-passthrough your network devices to the VM - I was prepared to ditch the VM for a dedicated SFF.
  danieldk 9 months ago
  One thing to keep in mind is that if your provider uses PPPoE, it is quite CPU-heavy. Moreover the netgraph-based PPPoE implementation that pfsense/opnsense uses is single-threaded. So if you have a multi-gig connection, you need a CPU with very good single-thread performance. Linux has a multi-threaded implementation, so it's probably less of an issue on OpenWrt.
  Interestingly, some ARM SoCs that are worse on paper do much better because they hardware-accelerate PPPoE (e.g. recent MediaTek Filogic SoCs or some Qualcomm/Marvel SoCs). Most of those routers also use less power and have great WiFi. The downside (coming back to buffer bloat) is that they may not be able to do multi-gbit SQM.
rspoerri 9 months ago
I‘ven been thinking about using qos at home to solve the same problem (concurrent zoom and downloads). But i‘m struck by the idea to figure out all the rules for all the games and conferencing systems i‘d have to gather to set it up.
Does anybody have a good idea/link/tipp for an opnsense firewall. I gotta admin i havent researched it yet, the problem arises when in stress with work and outside of that my motivation isnt too high to figure out all by myself :-)
voidwtf 9 months ago
These type of solutions don’t scale to large ISPs, and gets costly to deploy at the edge. It’s also not just about throughput in Gbps, but Mpps.
Also, this doesn’t take into account that the congestion/queueing issue might be at an upstream. I could have 100g from the customers local co to my core routers, but if the route is going over a 20g link to a local IX that’s saturated it probably won’t help to have fq/codel at the edge to the customer.
panosv 9 months ago
MacOS has now a built in dedicated tool called networkQuality that tries to capture these variables https://netbeez.net/blog/measure-network-quality-on-macos/
Also take a look at Measurement Swiss Army-Knife (MSAK) https://netbeez.net/blog/msak/
- dtaht 9 months ago
  big fan of flent.org and this tool, written in rust - is coming along smartly.
  https://github.com/Zoxc/crusader
sandworm101 9 months ago
>> For example, I once measured the time to send a “ping” to downtown Toronto from my home office in the suburbs. It took 0.13s to get downtown and back. That is the normal ping time for Istanbul, Turkey, roughly 8,000 km away.
Ya. Canada is like that. Lack of choice in ISPs, high costs and horrible uptime performance.
- davecb 9 months ago
  Yup, it was Rogers, which is what my building has, and which I'm stuck with for TV. Other, luckier, people can have Bell (sorta-maybe better) or TekSavvy (definitely better)
  The terrible performance is spotty: that was a particularly glaring example I detected when everything was failing (:-))
  sandworm101 9 months ago
  I'm with Bell for my home internet. I had no choice. It was either them or a cellular connection. Install went ok, but the kid got a panicked look in his eyes when he realized I knew what "1.5gb" actually meant. He had me plugged in using some shoddy cables that couldn't handler more than 100mbps. I fixed it once he left.
msla 9 months ago
Previously:
It's The Latency, Stupid: http://www.stuartcheshire.org/rants/latency.html
moffkalast 9 months ago
> If you are an ISP and your customers hate you
So... every ISP that exists then? Networking is one of those fields where the results are just varying shades of terrible no matter how hard you try.
globalnode 9 months ago
could i put the appropriate algorithm onto a raspberry pi and put it inline with my cheap router to fix the issue? in theory?
illiac786 9 months ago
speed.cloudflare.com has a very useful graphical representation of latency depending on traffic I find.
buginprod 9 months ago
130ms latency within a city? Are you using sound or something?
Yeah yeah I know thats only 40m ish for sound.
readingnews 9 months ago
ACM, come on, stop spreading disinformation. You know well and good nothing travels at the speed of light down the wire or fiber. We have converters on the end, and in fact in glass it is the speed of light divided by the refractive index of the glass. Even in the best of times, not c. I just hate that, when a customer is yelling at me telling me that the latency should be absolute 0, they start pointing at articles like this, "see, even the mighty ACM says it should be c".
Ugh.
- davecb 9 months ago
  That's just a comparison to Star Trek, which is set in an imaginary universe where things _do_ exceed the speed of light.
- anotherhue 9 months ago
  And that's before you consider the actual cable length vs the straight line distance.
- lxgr 9 months ago
  It's a reasonable approximation for most calculations. It seems unfair to call that "disinformation".
  Serialization delay, queuing delay etc. often dominate, but these have little to do with the actual propagation delay, which also can't be neglected.
  > when a customer is yelling at me telling me that the latency should be absolute 0
  The speed of light isn't infinity, is it?
  Izkata 9 months ago
  Speed of light also explained why these emails were limited to a little over 500 miles: http://www.ibiblio.org/harris/500milemail.html
  So it can be relevant, even as an approximation.
- thowawatp302 9 months ago
  I don’t think you’re going to have an issue with Cherenkov radiation in the fiber and that fiber is not going to be a straight line over a non trivial distance so the approximation is close enough.
rughouseprod 9 months ago
[dead]
undefined 9 months ago
[deleted]
wtfwtfwtf123 9 months ago
[flagged]
wtfwtfwtf123 9 months ago
[flagged]
wtfwtfwtf123 9 months ago
[flagged]
wtfwtfwtf123 9 months ago
[flagged]
wtfwtfwtf123 9 months ago
[flagged]