• kardianos 3 hours ago

    There was a talk about this prior. This was used in place of TCP, but where TCP is designed to run over unreliable networks, this protocol achieves speed and latency figures comparable to others, while still being able to retain commodity IP switches in the cluster. By having a fixed buffer, no lingers, faster opens, they increase the speed and latency, without going to dedicated vendors or other stacks.

    • vardump 3 hours ago

      > they increase the speed and latency

      I suppose you mean "increase the speed and decrease the latency"?

      • kardianos an hour ago

        Yes. Typo.

        • moomin 2 hours ago

          AnakinPadme.jpg

      • throw0101a 3 hours ago

        Previous discussion from ~month ago, "Tesla’s TTPoE at Hot Chips 2024: Replacing TCP for Low Latency Applications":

        * https://news.ycombinator.com/item?id=41374663

        * https://chipsandcheese.com/2024/08/27/teslas-ttpoe-at-hot-ch...

        • digitallis42 8 hours ago

          I did a skim and didn't see any explanation of why one would want it over TCP. Did I miss, or is it non obvious?

          • lloeki 8 hours ago

            From a cursory look:

            - looks dead simple

            - no IP layer (there's a ttpip folder in that repo though)

            - distributed congestion control (TCP has a "window" field + a bunch of tentative RFCs, this has a purposeful "congestion")

            - 100% implementable in hardware (TCP can, but it's complex)

            Not a general TCP replacement, but the README properly highlights a "many endpoints local link" use case:

            > the protocol executed entirely in hardware and deployed to a very large multi-ExaFlops (fp16) supercomputer with over 10s of thousands of concurrent endpoints. This protocol does not need a CPU or OS to be involved in any way to link and execute.

            • cyp0633 5 hours ago

              In Tesla's presentation slides, "Tesla Transport Protocol Over Ethernet (TTPoE): A New Lossy, Exa-Scale Fabric for the Dojo AI Supercomputer", they mentioned that the network layer is optional (but not removed)

            • Beretta_Vexee 7 hours ago

              I think it's better to think of it as a fibre channel protocol rather than TCP. It's intended for use on managed internal data centre networks. It skips OSI layers to gain speed and probably do 100% hardware routing with FPGAs.

              It's of no interest on the internet or any small scale netwwork.

              • KaiserPro 3 hours ago

                > fibre channel protocol

                Apart from FC is is explicitly lossless and ordered

                • bcrl 16 minutes ago

                  FC is not entirely lossless. One ticket I had the joy of dealing with involved a customer using a Fibre Channel network for their storage using multipathd for failover. In theory it was a fully redundant configuration with dual FC ports on the server with each one going to a different FC switch all the way back to the SAN. However, the system was generating I/O errors on large writes while small writes would succeed. Needless to say that ext4 failed horribly, and there were worries that it was a kernel bug in the FC driver.

                  After a good amount of back and forth with the customer, and several test programs run on the system in question, I eventually came up with a hypothesis that there was an error in the write path of the SAN as small writes succeeded while larger writes failed. The customer ultimately found there was a dirty fibre on one of the links in their FC fabric. It was dirty enough to corrupt large packets, but not so dirty that smaller writes and control packets were unable to get through. Since multipathd only checks to see if a given target can be read from, it would never fail over to the other path (which was fine). So much for trying to build a high availability system using an expensive SAN!

                  Lesson of the story: what you think is a lossless network is not always lossless. Using the IP stack has a lot of beneficial diagnostic tools that you really start missing when something goes awry in a non-IP network.

                  • stonogo 2 minutes ago

                    Broken hardware does not make the protocol lossy. I think you're misunderstanding what 'lossless' is intended to mean in this context; it does not mean that it is error-free. In a lossy protocol, missing data is not necessarily an error. In a lossless protocol, missing data is treated as an error, which is consistent with what you experienced.

                • delfinom 2 hours ago

                  Elon just doesn't want to pay Nvidia for Infiniband. Lol

                  • andix an hour ago

                    If it works and it's cheaper, this is a very reasonable thing to do.

              • FuriouslyAdrift 2 hours ago

                Be interesting to see how this stacks up to the dominant protocol in supercomputers/ai clusters : Infiniband.

                • nine_k 33 minutes ago

                  AFAICT this is very much about handling unreliable links and congestion control.

                  Infiniband instead makes the sides bargain to avoid packet loss, while the medium is supposed to be reliable.

                • elcritch 9 hours ago

                  Twice now I’ve been excited that this was for realtime ethernet used in teslas vehicles. Alas, it is not.

                  • sgu999 6 hours ago

                    Any reason to believe they don't use one of the standard industrial protocols like the poorly named EtherNet/IP?

                    • kvmet 3 hours ago

                      Licensing probably?

                      CAN (or one of its more modern variants) are historically more common in automotive. However with 2-wire Ethernet connections becoming more commonplace I do think you're right that more and more cars will be moving to ethernet fieldbus.

                      EtherNet/IP is not as robust for many applications as its competitors (PROFINET, EtherCAT) since it is not fully deterministic. EtherCAT is my personal favorite.

                      • DannyBee 3 hours ago

                        +1 - ethercat and profinet are the way.

                        Random guessing - Ethercat seems more likely to take over for CAN because CoE (canopen over ethercat) is so common.

                        It's very easy to turn CAN devices into ethercat ones.

                        Harder to turn them into profinet ones.

                        Seems like a more incremental path for car makers.

                        otherwise the main advantage of profinet is that you can treat it like regular ethernet (IE switches, etc), but not sure anyone cares in a car.

                      • LeifCarrotson 2 hours ago

                        Of all the (current) industrial protocols they could have picked, Ethernet/IP would be the worst.

                        Its only advantage is that it can coexist with other TCP traffic and run over standard switches, but that just results in unreliable fieldbus performance.

                        • MisterTea 2 hours ago

                          Please no EIP, its utter crap and designed by an OOP huffing committee. The only serious protocol is EtherCAT with honorable mentions for Sercos 3 and Ethernet Powerlink (CANopen over Ethernet).

                      • high_na_euv 5 hours ago

                        Really interesting

                        • thelastparadise 5 hours ago

                          Why?

                          • high_na_euv 5 hours ago

                            Recreating foundational infra doesnt seem so common, especially for car company

                            • Cthulhu_ 3 hours ago

                              In a sense this wasn't from Tesla the car company, but Tesla the IT department with a supercomputer. I don't know what they do on it though, might be lots of physics simulations (aerodynamics etc) or deep learning for assisted driving tech.

                              • martindbp 20 minutes ago

                                They train an end-to-end model to drive based on 8 camera streams and recorded input from human drivers, training on tens, (if not hundreds now) of millions of 30 second clips from their consumer fleet. That's why they're bought one of the largest GPU clusters and making their own chips and transport protocols.

                                It's not widely known, but Tesla probably has one of the largest training cluster, because practically all the GPUs they buy go towards training, while most of GPUs for e.g. OpenAI go towards inference. Tesla does inference in the car.

                                • literalAardvark 23 minutes ago

                                  In older interviews Musk said that the Dojo is intended for deep learning.

                                  So most likely that. I agree that this seems to have very little to do with cars.

                                • aeonik 5 hours ago

                                  CAN, MOST, Flexray, LIN, K-Line were all invented for automotive use.

                                  2 wire Ethernet is also a thing that they spearheaded.

                            • yobid20 5 hours ago

                              3 times i read this and kept thinking "what's this have to do with Power over Ethernet".

                              nothing. The name is misleading. I though maybe this was used for their supercharging protocols or something.

                              • iamleppert 2 hours ago

                                How is this better than UDP? Or for that matter, just plain old Ethernet MAC addressing? You can achieve lower latency and speed (than this) if you don't care about reliability in your transport layer.

                                This reaks of NIH.

                                • mannyv 28 minutes ago

                                  I worked with a company that wrote its own protocol for Ethernet and got almost wire speed. It was worth it for 10, but not worth it at 100mbps.

                                  You can always beat general purpose solutions like the TCP/IP/UDP stack if you try. For most it isn’t worth it.

                                  • leetharris an hour ago

                                    Did you even try reading the README?

                                    - TTPoE is designed to be implemented at hardware level unlike UDP

                                    - UDP cannot guarantee transmission whereas this does

                                    - TTPoE is built for distributed resilience