Comments Page - Addition is all you need for energy-efficient language models

« Back Addition is all you need for energy-efficient language modelsarxiv.orgSubmitted by InvisibleUp a year ago

shrubble a year ago
I remember that many years ago, when floating point computation was expensive for Intel CPUs to do, there were multiple ways that programmers used integer trickery to work around this.
Chuck Moore of Forth fame demonstrated taking the value, say 1.6 multiplied by 4.1 and doing all the intermediate calculations via integers (16 * 41) and then formatting the output by putting the decimal point back in the "right place"; this worked as long as the range of floating point values was within a range that multiplying by 10 didn't exceed 65536 (16 bit integers), for instance. For embedded chips where for instance, you have an analog reading with 10 bits precision to quickly compute multiple times per second, this worked well.
I also recall talking many years ago with a Microsoft engineer who had worked with the Microsoft Streets and Trips program (https://archive.org/details/3135521376_qq_CD1 for a screenshot) and that they too had managed to fit what would normally be floating point numbers and the needed calculations into some kind of packed integer format with only the precision that was actually needed, that was faster on the CPUs of the day as well as more easily compressed to fit on the CDROM.
- dajoh a year ago
  What you're describing is called fixed point arithmetic, a super cool technique I wish more programmers knew about.
  Proper finance related code should use it, but in my experience in that industry it doesn't seem very common unless you're running mainframes.
  Funnily enough, I've seen a lot more fixed point arithmetic in software rasterizers than anywhere else. FreeType, GDI, WPF, WARP (D3D11 reference rasterizer) all use it heavily.
  kccqzy a year ago
  I have worked on firmware that has plenty of fixed point arithmetic. The firmware usually runs on processors without hardware floating point units. For example certain Tesla ECUs use 32-bit integers where they divide it into four bits of integer part and 28 bits of fractional part. So values are scaled by 2^28.
  phkahler a year ago
  >> The firmware usually runs on processors without hardware floating point units.
  I'm working on control code one an ARM cortex-M4f. I wrote it all in fixed point because I don't trust an FPU to be faster, and I also like to have a 32bit accumulator instead of 24bit. I recently converted it all to floating point since we have the M4f part (f indicate FPU), and it's a little slower now. I did get to remove some limit checking since I can rely on the calculations being inside the limits but it's still a little slower than my fixed point implementation.
  sitkack a year ago
  The other great thing about going fixed point is that it doesn't expose you to device specific floating point bugs, making your embedded code way more portable and easier to test.
  32b float on your embedded device doesn't necessary match your 32b float running on your dev machine.
  bobmcnamara a year ago
  32b float can match your desktop. Really just takes a few compiler flags(like avoiding -funsafe-math), setting rounding modes, and not using the 80bit Intel mode(largely disused after 64bit transition).
  sitkack a year ago
  I understand what you are saying ...
  You aren't guaranteed that your microcontrollers float is going to match your desktop. Microcontrollers are riddled with bugs, unless you need floats and fixedpoint is fast enough. My recommendation is still to use fixedpoint if application is high reliability.
  Esp if your code needs to be portable across arm, risc-v, etc.
  bobmcnamara a year ago
  Many microcontrollers today, including ARM, RISC-V, and Xtensa have IEEE compliant FPUs or libms available. Same numeric format, same rounding, same result.
  Fixed point isn't bad at all, just often slower when a compliant FPU is available.
  Someone a year ago
  > IEEE compliant FPUs or libms available. Same numeric format, same rounding, same result.
  IEEE only mandates results within ½ ULP (= best possible) for basic operations such as addition, subtraction, multiplication, division, and reciprocal.
  For many other ones such as trigonometric functions, exponential and logarithms, results can (and do) vary between conforming implementations.
  https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...:
  “The IEEE standard does not require transcendental functions to be exactly rounded because of the table maker's dilemma. To illustrate, suppose you are making a table of the exponential function to 4 places. Then exp(1.626) = 5.0835. Should this be rounded to 5.083 or 5.084? If exp(1.626) is computed more carefully, it becomes 5.08350. And then 5.083500. And then 5.0835000. Since exp is transcendental, this could go on arbitrarily long before distinguishing whether exp(1.626) is 5.083500...0ddd or 5.0834999...9ddd. Thus it is not practical to specify that the precision of transcendental functions be the same as if they were computed to infinite precision and then rounded. Another approach would be to specify transcendental functions algorithmically. But there does not appear to be a single algorithm that works well across all hardware architectures. Rational approximation, CORDIC,16 and large tables are three different techniques that are used for computing transcendentals on contemporary machines. Each is appropriate for a different class of hardware, and at present no single algorithm works acceptably over the wide range of current hardware.”
  jcranmer a year ago
  IEEE 754-2019 says for the transcendental functions (the ones in §9.2):
  > A conforming operation shall return results correctly rounded for the applicable rounding direction for all operands in its domain.
  so all of them are supposed to be correctly rounded. I think IEEE 754-2008 also requires correct rounding, but I don't have that spec in front of me right now.
  In practice, they're not correctly rounded--the C specification explicitly disclaims the need for them to be (§F.3¶20), reserving the cr_ prefix for future mandatory correctly-rounded variants.
  Someone a year ago
  Thanks! Reading https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/b... (“It is believed that any existing implementation of 754-2008 conforms to 754-2019”) it seems IEEE 754-2008 also required it.
  Even with that and ignoring C’s “we don’t support that”), it still can be hard to write C code that provides identical results on all platforms. For example, I don’t think much code uses float_t or double_t or checks FLT_EVAL_METHOD (https://en.cppreference.com/w/c/types/limits/FLT_EVAL_METHOD)
  jcranmer a year ago
  So the things you mention aren't useful for getting consistent numerical results. You really have to start getting into obscure platforms like mainframes to find stuff where float and double aren't IEEE 754 single and double precision, respectively. FLT_EVAL_METHOD is largely only relevant if you're working on 32-bit x86 code, and even then, you can sidestep those problems if you're willing to require that hardware be newer than 20 years old or so.
  The actual thing you need to do for consistency is to be extremely vigilant in the command line options you use, and also bring your own math library implementations rather than using the standard library. You also need vigilance in your dependencies, for somebody deciding to enable denormal flushing screws everybody in the same process.
  bobmcnamara a year ago
  Ah, I've had a slightly different task many times: porting a high level algorithm from MATLAB or labview or keras to C.
  As part of this I construct a series of test inputs, and confirm that they are bitwise equivalent to the high level language. It's usually as simple as aligning the rounding mode, disabling fused MAC, and a few other compiler flags that shouldn't be project defaults.
  The other fun part is using the vector unit - for that we have to define IEEE arithmetic in the order the embedded device does it(usually 4x or 8x interleaved), port that back up, and verify.
  Never did use a whole lot of transcendentals - maybe due to the domains I worked in.
  cpeterso a year ago
  Why did you decide to convert your code to floating point if your fixed point implementation was faster and already written?
  gatane a year ago
  Are there any good benchmarks for float vs fixed point, specially for ARM systems?
  evoke4908 a year ago
  Just look at the instruction set for your particular CPU. Every CPU is different, but in most architectures I've seen, floating point operations are 2-3 times slower for the same word size.
  Single float adds are usually 2 or 3 CPU cycles while single-word integer adds are usually 1 cycle.
  Again, this is extremely dependent on the particular CPU you have. Some architectures do have single-cycle FPU operations, but it's not very common in microcontrollers as far as I can tell.
  EasyMark a year ago
  That would vary wildly with the ARM chip you are talking about. I would say figure out which ARM you’re interested in and go down the rabbit hole from there.
  aatd86 a year ago
  What do they use? Not float I hope. Plus given that some currencies have different precisions... Don't tell me it's rounding errors over trillion monies?! :o)
  Maxatar a year ago
  As I indicate in another post, I work in finance and I use binary floats. So do a lot of others who work in the industry. I sympathize with people who think that IEEE floating points are some weird or error prone representation and that fixed point arithmetic solves every problem, but in my professional experience that isn't true and systems that start by using fixed point arithmetic eventually end up making a half-assed error prone and slow version of floating point arithmetic as soon as they need to handle more sophisticated use cases like handling multiple currencies, doing calculations involving percentages such as interest rates, etc etc...
  The IEEE 754 floating point standard is a very well thought out standard that is suitable for representing money as-is. If you have requirements such as compliance/legal/regulatory needs that mandate a minimum precision, then you can either opt to use decimal floating point or use binary floating point where you adjust the decimal place up to whatever legally required precision you are required to handle.
  For example the common complaint about binary floating point is that $1.10 can't be represented exactly so you should instead use a fixed integer representation in terms of cents and represent it as 110. But if your requirement is to be able to represent values exactly to the penny, then you can simply do the same thing but using a floating point to represent cents and represent $1.10 as the floating point 110.0. The fixed integer representation conveys almost no benefit over the floating point representation, and once you need to work with and mix currencies that are significantly out of proportion to one another, you begin to really appreciate the nuances and work that went into IEEE 754 for taking into account a great deal of corner cases that a fixed integer representation will absolutely and spectacularly fail to handle.
  rstuart4133 a year ago
  > I work in finance and I use binary floats.
  I build cash registers, and I avoid floats like the plague.
  I think the difference is where you need an exact result. Auditors have forced me to go through a years transactions to find an 1 cent error. They were right - at one point we weren't handling the fractional cents correctly. After finding that the bug was fixed. Had we been using floating point our answer would have been "shrug, if it's a problem chose another vendor".
  You are working in finance so I suspect a 0.00001% error doesn't matter to you. Usually it doesn't. But occasionally, proofs of correctness are important. The can demonstrate for example one of your programmers isn't ripping you off by rounding (0, 0.5) to zero instead of (0, 0.5] and stealing the resulting cents. People have gone to jail for doing exactly that. Which is why, a good auditor can get very picky finding a 1 cent error. He doesn't care about value of that 1c any more that you do. What he cares about greatly is a machine whose job is to add up numbers reliably apparently can't get basic arithmetic right.
  Programmer with battle scars from working in that environment are sick and tired of being told by others how much easier floats are to use 99.9999% of the time. Believe me, they know.
  kbolino a year ago
  There are more problems with using floating-point for exact monetary quantities than just the inexact representations of certain quantities which are exact in base 10. For example, integers have all of the following advantages over floats:
  Integer arithmetic will never return NaN or infinity.
  Integer (a*b)*c will always equal a*(b*c).
  Integer (a+b)%n will always equal (a%n+b%n)%n, i.e. low-order bits are always preserved.
  IEEE 754 is not bad and shouldn't be feared, but it is not a universal solution to every problem.
  It's also not hard to multiply by fractions in fixed-point. You do a widening multiplication by the numerator followed by a narrowing division by the denominator. For percentages and interest rates etc., you can represent them using percentage points, basis points, or even parts-per-million depending on the precision you need.
  Maxatar a year ago
  >Integer arithmetic will never return NaN or infinity.
  I use C++ and what integer arithmetic will do in situations where floating point returns NaN is undefined behavior.
  I prefer the NaN over undefined behavior.
  >Integer (ab)c will always equal a(bc).
  In every situation where an integer will do that, a floating point will do that as well. Floating point numbers behave like integers for integer values, the only question is what do you do for non-integer values. My argument is that in many if not most cases you can apply the same solution you would have applied using integers to floating points and get an even more robust, flexible, and still high performance solution.
  >For percentages and interest rates etc., you can represent them using percentage points, basis points, or even parts-per-million depending on the precision you need.
  And this is precisely when people end up reimplementing their own ad-hoc floating point representation. You end up deciding and hardcoding what degree of precision you need to use depending on assumptions you make beforehand and having to switch between different fixed point representations and it just ends up being a matter of time before someone somewhere makes a mistake and mixes two close fixed point representations and ends up causing headaches.
  With floating point values, I do hardcode a degree of precision I want to guarantee, which in my case is 6 decimal places, but in certain circumstances I might perform operations or work with data that needs more than 6 decimal places and using floating point values will still accommodate that to a very high degree whereas the fixed arithmetic solution will begin to fail catastrophically.
  kbolino a year ago
  C++ is no excuse; it has value types and operator overloading. You can write your own types and define your own behavior, or use those already provided by others. Even if you insist on using raw ints (or just want a safety net), there's compiler flags to define that undefined behavior.
  Putting everything into floats as integers defeats the purpose of using floats. Obviously you will want some fractions at some point and then you will have to deal with that issue, and the denominator of those fractions being a power of 2 and not a power of 10. Approximation is good enough for some things, but not others. Accounts and ledgers are definitely in the latter category, even if lots of other financial math isn't.
  You need always be mindful of your operating precision and scale. Even double-precision floats have finite precision, though this won't be a huge issue until you've compounded the results of many operations. If you use fixed-point and have different denominators all over the place, then it's probably time to break out rational numbers or use the type system to your advantage. You will know the precision and scale of types called BasisPoints or PartsPerMillion or Fixed6 because it's in the name and is automatically handled as part of the operations between types.
  fluoridation a year ago
  >I use C++ and what integer arithmetic will do in situations where floating point returns NaN is undefined behavior. I prefer the NaN over undefined behavior.
  Really? IME it's much more difficult to debug where a NaN value came from, since it's irreversible and infectious. And although the standard defines which integer operations should have undefined behavior, usually the compiler just generates code that behaves reasonably. Like, you can take INT_MAX and then increment and decrement it and get INT_MAX back.
  (That does mean that you're left with a broken program that works by accident, but hey, the program works.)
  bee_rider a year ago
  Are there cases where float could return a NaN or infinity, where you instead prefer the integer result? That seems a little odd to me.
  estebarb a year ago
  Most people would love their bank accounts to underflow.
  kbolino a year ago
  Integer division by zero will raise an exception in most modern languages.
  Integer overflow is more problematic. While some languages in some situations will raise exceptions, most don't. While it's easier to detect overflow that has already occurred with floats (though you'll usually have lost low-order bits long before you get infinity), it's easier to avoid overflow in the first place with integers.
  vidarh a year ago
  It really depends on your need. In some countries e.g. VAT calculations used to specify rounding requirements that were a pain to guarantee with floats. I at one point had our CFO at the time breathing down my neck while I implemented the VAT calculations while clutching a printout of the relevant regulations on rounding because in theory he could end up a defendant in a court case if I got it wrong (in practice not so much, but it spooked him enough that it was the one time he paid attention to what I was up to). Many tax authorities are now more relaxed, as long as your results average out in their favour, but there's a reason for this advice.
  bobmcnamara a year ago
  > if your requirement is to be able to represent values exactly to the penny, then you can simply do the same thing but using a floating point to represent cents and represent $1.10 as the floating point 110.0.
  Not if you need to represent more than about 170 kilo dollars.
  undefined a year ago
  [deleted]
  fluoridation a year ago
  The industry standard in finance is decimal floating point. C# for example has 'decimal', with 128 bits of precision.
  On occasion I've seen people who didn't know any better use floats. One time I had to fix errors of single satoshis in a customer's database because their developer used 1.0 to represent 1 BTC.
  EGreg a year ago
  Smart contracts on EVM and other blockchains all use fixed point, for the simple reason that all machines have to get exactly the same result.
  myst a year ago
  Every half-competent software engineer knows about fixed point arithmetic, my friend.
  phkahler a year ago
  >> Every half-competent software engineer...
  You meant 8192/16384 right? I like q14.
- andrewla a year ago
  I recall playing with FRACTINT, which was a fractal generator that existed before floating point coprocessors were common, that used fixed point math to calculate and display fractals. That was back when fractals were super cool and everyone wanted to be in the business of fractals, and all the Nobel Prizes were given out to fractal researchers.
- touisteur a year ago
  Ozaki has been doing fp64 matrix-multiplication using int8 tensor cores
  https://arxiv.org/html/2306.11975v4
  Interesting AF.
- candiddevmike a year ago
  AFAIK this is still the best way to handle money/financial numbers.
  amanda99 a year ago
  That's got nothing to do with perf tho.
  Maxatar a year ago
  Nothing to do with perf is a strong claim. If you genuinely don't care about performance you can use an arbitrary-precision rational number representation.
  But performance often matters, so you trade off precision for performance. I think people are wrong to dismiss floating point numbers in favor of fixed point arithmetic, and I've seen plenty of fixed point arithmetic that has failed spectacularly because people think if you use it, it magically solves all your problems...
  Whatever approach you take other than going all in with arbitrary precision fractions, you will need to have a good fundamental understanding of your representation and its trade-offs. For me personally I use floating point binary and adjust the decimal point so I can exactly represent any value to 6 decimal places. It's a good trade-off between performance, flexibility, and precision.
  It's also what the main Bitcoin implementation uses.
  fluoridation a year ago
  Huh? Bitcoin uses integers. The maximum supply of BTC in satoshis fits in 64 bits. JS implementations that need to handle BTC amounts use doubles, but only by necessity, since JS doesn't have an integer type. They still use the units to represent satoshis, which works because the maximum supply also fits in 53 bits, so effectively they're also using integers.
  Anyone who uses binary floating point operations on monetary values doesn't know what they're doing and is asking for trouble.
  wbl a year ago
  So if I want to price a barrier in Bermudan rainbow via Monte Carlo I should take the speed hit for a few oddball double rounding problems that are pennies?
  fluoridation a year ago
  I mean, you do you. People generally don't complain if you're a couple hundred nanoseconds (if that) late. They do complain if your accounts don't add up by a single penny.
  wbl a year ago
  The quoting of something exotic like this is not well defined to the penny. It's transactions where people really care about pennies.
- dwattttt a year ago
  That particular trick is known as fixed point arithmetic (not to be confused with a fixed point of a function)
- asadalt a year ago
  this is still true for many embedded projects. like pi pico (2040) uses a table.
- kragen a year ago
  Sure, FRACTINT is called FRACTINT because it uses fixed-point ("integer") math. And fixed-point math is still standard in Forth; you can do your example in GForth like this:
  : organize; gforth Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc. Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license' Type `bye' to exit : %* d>s 10 m*/ ; : %. <# # [char] . hold #s #> type ; ok 1.6 4.1 %* %. 6.5 ok
  Note that the correct answer is 6.56, so the result 6.5 is incorrectly rounded. Here's how this works.
  (If you're not familiar with Forth, Forth's syntax is that words are separated by spaces. "ok" is the prompt, ":" defines a subroutine terminated with ";", and you use RPN, passing parameters and receiving results on a stack.)
  In standard Forth, putting a decimal point in a number makes it a double-precision number, occupying two cells on the stack, and in most Forths the number of digits after the decimal point is stored (until the next number) in the non-standardized variable dpl, decimal point location. Here I've just decided that all my numbers are going to have one decimal place. This means that after a multiplication I need to divide by 10, so I define a subroutine called %* to do this operation. (Addition and subtraction can use the standard d+ and d- subroutines; I didn't implement division, but it would need to pre-multiply the dividend by the scale factor 10.)
  "%*" is defined in terms of the standard subroutine m*/, which multiplies a double-precision number by a single-precision number and divides the result by a divisor, and the standard subroutine d>s, which converts a double-precision number to a single-precision number. (There's probably a better way to do %*. I'm no Forth expert.)
  I also need to define a way to print out such numbers, so I define a subroutine called "%.", using Forth's so-called "pictured numeric output", which prints out an unsigned double-precision number inserting a decimal point in the right place with "hold", after printing out the least significant digit. (In PNO we write the format backwards, starting from the least significant digit.) The call to "type" types out the formatted number from the hold space used by PNO.
  Then I invoked %* on 1.6 and 4.1 and %. on its result, and it printed out 6.5 before giving me the "ok" prompt.
  If you want to adapt this to use two decimal places:
  : %* d>s 100 m*/ ; : %. <# # # [char] . hold #s #> type ; redefined %* redefined %. ok 1.60 4.10 %* %. 6.56 ok
  Note, however, that a fixed-point multiplication still involves a multiplication, requiring potentially many additions, not just an addition. The paper, which I haven't read yet, is about how to approximate a floating-point multiplication by using an addition, presumably because in multiplication you add the mantissas, or maybe using a table of logarithms.
  Forth's approach to decimal numbers was a clever hack for the 01970s and 01980s on sub-MIPS machines with 8-bit and 16-bit ALUs, where you didn't want to be invoking 32-bit arithmetic casually, and you didn't have floating-point hardware. Probably on 32-bit machines it was already the wrong approach (a double-precision number on a 32-bit Forth is 64 bits, which is about 19 decimal digits) and clearly it is on 64-bit machines, where you don't even get out of the first 64-bit word until that many digits:
  0 1 %. 184467440737095516.16 ok
  GForth and other modern standard Forths do support floating-point, but for backward compatibility, they treat input with decimal points as double-precision integers.
visarga a year ago
> can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products
It this were about convolutional nets then optimizing compute would be a much bigger deal. Transformers are lightweight on compute and heavy on memory. The weakest link in the chain is fetching the model weights into the cores. The 95% and 80% energy reductions cited are for the multiplication operations in isolation, not for the entire inference process.
- woadwarrior01 a year ago
  Pre-fill (even in the single batch case) and multi-batch decoding are still compute dominated. The oft repeated trope of "decoder only transformer inference is bottle-necked on memory bandwidth" is only strictly true in the single batch decoding case, because you're mostly doing vector matrix mults when the batch size is one.
  ein0p a year ago
  Not even single batch. If you want reasonable latency per token (TPOT) even larger batches do not give you high compute utilization during extend. It’s only when you don’t care about TPOT at all, and your model is small enough to leave space for a large batch on an 8 GPU host, that’s when you could get decent utilization. That’s extend only - it’s easy to get high utilization in prefill.
- SuchAnonMuchWow a year ago
  Its worse than that: the energy gains are when comparing computations made with fp32, but for fp8 the multipliers are really tiny and the adder/shifters represent a largest part of the operators (energy-wise and area-wise) and this paper will only have small gains.
  On fp8, the estimated gate count of fp8 multipliers is 296 vs. 157 with their technique, so the power gain on the multipliers will be much lower (50% would be a more reasonable estimation), but again for fp8 the additions in the dot products are a large part of the operations.
  Overall, its really disingenuous to claim 80% power gain and small drop in accuracy, when the power gain is only for fp32 operations and the small drop in accuracy is only for fp8 operators. They don't analyze the accuracy drop in fp32, and don't present the power saved for fp8 dot product.
  bobsyourbuncle a year ago
  I’m new to neural nets, when should one use fp8 vs fp16 vs fp32?
  reissbaker a year ago
  Basically no one uses FP32 at inference time. BF16/FP16 is typically considered unquantized, whereas FP8 is lightly quantized. That being said there's pretty minimal quality loss at FP8 compared to 16-bit typically; Llama 3.1 405b, for example, only benchmarks around ~1% worse when run at FP8: https://blog.vllm.ai/2024/07/23/llama31.html
  Every major inference provider other than Hyperbolic Labs runs Llama 3.1 405b at FP8, FWIW (e.g. Together, Fireworks, Lepton), so to compare against FP32 is misleading to say the least. Even Hyperbolic runs it at BF16.
  Pretraining is typically done in FP32, although some labs (e.g. Character AI, RIP) apparently train in INT8: https://research.character.ai/optimizing-inference/
  tarasglek a year ago
  SambaNova does bf16
  undefined a year ago
  [deleted]
  ericlewis a year ago
  Higher the precision the better. Use what works within your memory constraints.
  jasonjmcghee a year ago
  With serious diminishing returns. At inference time, no reason to use fp64 and should probably use fp8 or less. The accuracy loss is far less than you'd expect. AFAIK Llama 3.2 3B fp4 will outperform Llama 3.2 1B at fp32 in accuracy and speed, despite 8x precision.
  undefined a year ago
  [deleted]
- lifthrasiir a year ago
  I'm also sure that fp8 is small enough that multiplication can really be done in a much simpler circuit than larger fp formats. Even smaller formats like fp4 would be able to just use a lookup table, and that makes them more like sort-of-standardized quantization schemes.
  tankenmate a year ago
  i suspect that you could do fp8 with log tables and interpolation if you really wanted to (compared to the memory required for the model it's peanuts), it just turns into a LUT (log table look up) and bit shift (interpolation). so again, memory bandwidth is the limiting factor for transformers (as far as energy is concerned).
  lifthrasiir a year ago
  This time though LUT exists in a circuit, which is much more efficient than typical memory lookup. Such LUT would have to exist per each ALU though, so it can't be too large.
  brilee a year ago
  fp4/fp8 for neural networks don't work the way you think they do - they are merely compression formats - a set of, say, 256 fp32 weights from 1 neuron are lossily turned into 1 max value (stored in fp32 precision) and 256 fp4/fp8 numbers. Those compressed numbers are multiplied by the fp32 number at runtime to restore the original weights and full fp32 multiplication + additions are executed.
  lifthrasiir a year ago
  You are correct that the accumulation (i.e. additions in dot products) has to be done in a higher precision, however the multiplication can still be done via LUT. (Source: I currently work at a hardware-accelerated ML hardware startup.)
  SuchAnonMuchWow a year ago
  The goal of this type of quantization is to move the multiplication by the fp32 rescale factor outside of the dot-product accumulation.
  So the multiplications+additions are done on fp8/int8/int4/whatever (when the hardware support those operators of course) and accumulated in a fp32 or similar, and only the final accumulator is multiplied by the rescale factor in fp32.
  rajnathani a year ago
  That's how Nvidia's mixed precision training worked with FP32-FP16, but it isn't the case for Bfloat16 on TPUs and maybe (I'm not sure) FP8 training on Nvidia Hopper GPUs.
  imjonse a year ago
  With w8a8 quantization the hw (>= hopper) can do the heavy math in fp8 twice as fast as fp16.
  bee_rider a year ago
  What is fp4? 3 bits of exponent and one of mantissa?
  wruza a year ago
  SEEM (sign, exp, mantissa)
  bee_rider a year ago
  Interesting… I guess it must be biased, m*2^ee would leave like half of the limited space wasted, so 1.m*2^ee?
  I always wonder with these tiny formats if 0 should even be represented…
  wruza a year ago
  I’m not a binary guy that much, but iirc all floats are 1.m*2^e — “1.” is always there except for subnormals. There’s also SEEE FP4 which is basically +-2^([u?]int3).
  https://medium.com/@harrietfiagbor/floating-points-and-deep-...
- api a year ago
  Sounds like the awesome architecture for transformers would be colocation of memory and compute.
  Joker_vD a year ago
  Yes, that's why we generally run them on GPUs.
  phkahler a year ago
  That's why we need a row of ALUs in RAM chips. Read a row of DRAM and use it in a vector operation. With the speed of row reading, the ALU could take many cycles per operation to limit area.
  namibj a year ago
  The big problem is that DRAM is extremely secretive about their processes, and they largely don't do that well for logic.
  moffkalast a year ago
  GPUs that pull a kilowatt when running yes. This might actually work on an FPGA if the addition doesn't take too many clock cycles compared to matmuls which were too slow.
  api a year ago
  GPUs are better but I'm thinking of even tighter coupling, like an integrated architecture.
- imjonse a year ago
  That is true for single user/light inference only. For training and batch inference you can get compute bound fast enough.
  saagarjha a year ago
  That really depends on what you're doing. Trying to feed a tensor core is pretty hard–they're really fast.
- kendalf89 a year ago
  Maybe this technique can be used for training then since that is a lot more compute intensive?
- mikewarot a year ago
  Imagine if you had a systolic array large enough that all the weights would only have to be loaded once at startup. Eliminating the memory-compute bottleneck of the von Neumann architecture could make this quite a bit more efficient.
- h_tbob a year ago
  Bro... they are NOT lightweight on compute!
tantalor a year ago
[2023] GradIEEEnt half decent: The hidden power of imprecise lines
http://tom7.org/grad/murphy2023grad.pdf
Also in video form: https://www.youtube.com/watch?v=Ae9EKCyI1xU
- dang a year ago
  GradIEEEnt half decent: The hidden power of imprecise lines [video] - https://news.ycombinator.com/item?id=36806970 - July 2023 (9 comments)
  GradIEEEnt half decent - https://news.ycombinator.com/item?id=35780921 - May 2023 (32 comments)
- indrora a year ago
  I had hoped that they would reference this in their paper as some kind of "supporting previous exploration" but no, alas.
js8 a year ago
Haven't read it, but isn't this just logarithmic tables in some form?
I am asking not to dismiss it, I genuinely feel I don't understand logarithms on a fundamental level (of logic gates etc.). If multiplication can be replaced with table lookup and addition, then there has to be a circuit that gives you difficult addition and easy multiplication, or any combination of those tradeoffs.
- sabhiram a year ago
  Log space is nice, multiplication can be replaced by addition.
  This part is easy and anyone can implement hardware to do this. The tricky bit is always the staying in log space while doing accumulations, especially ones across a large range.
- pclmulqdq a year ago
  Yes, this is logarithmic number systems at work.
jenda23 a year ago
Highly recommended!! Success achieved! Previously I had worked with another well regarded company to attempt recovering an Ethereum presale wallet passphrase that I had forgotten. After 14 months of trying there was no success, so then I looked into ReWallet. They were able to find the password solution in 6 weeks! Since I only remembered a few portions or clues, it seemed like a nearly impossible task. They worked diligently and very professionally. I fully recommend and trust these guys, the result speaks for itself. Contact email, ‎rewalletshieldcoinrecovery@ aol.com or WhatsApp::+1 (757) 332-1885
cpldcpu a year ago
It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.
- Lerc a year ago
  The paper has an odd feel about it to me too. Doing a gate estimation as a text explanation without a diagram makes it too easy to miss some required part. It wouldn't need to be a full gate level explanation but blocks labeled 'adder'.
  Seeing the name de Vries in the first paragraph didn't help my sense of confidence either.
  brcmthrowaway a year ago
  Because of the twisted mentat?
  Lerc a year ago
  No more because of things like
  http://blog.zorinaq.com/bitcoin-electricity-consumption/
  It's a long read to go over multiple years worth of posts and comments but gives you a measure of the man.
pjc50 a year ago
"We recommend training and hosting L-Mul-based models on devices integrated with specialized architectural designs. Patent pending"
(from footnote in method section)
CGamesPlay a year ago
I believe this reduces the compute required, but still uses 8 bits per value, so it does not reduce the memory requirements required to run inference, so it doesn’t particularly make the models more accessible for inference. Is this storage method suitable for training? That could potentially be an interesting application.
- Manabu-eo a year ago
  It actually is about 0.5 bits less efficient per weight in terms of precision/range, something the paper never highlights.
ein0p a year ago
More than 10x the amount of energy is spent moving bytes around. Compute efficiency is not as big of an issue as people think. It’s just that the compute is in the wrong place now - it needs to be right next to memory cells, bypassing the memory bus, at least in the initial aggregations that go into dot products.
- entropicdrifter a year ago
  This could still be useful for battery constrained devices, right?
  ein0p a year ago
  It’s even worse in battery constrained devices - they tend to also be memory constrained and run with batch size 1 during extend. IOW the entire model (or parts thereof, if the model is MoE), gets read for every generated token. Utilization of compute is truly abysmal in that case and almost all energy is spent pushing bytes through the memory bus, which on battery powered devices doesn’t have high throughput
presspot a year ago
From my experience, the absolute magicians in fixed point math were the 8-bit and 16-bit video game designers. I was in awe of the optimizations they did. They made it possible to calculate 3D matrix maths in real time, for example, in order to make the first flight simulators and first person shooter games.
- hinkley a year ago
  Redefining degrees to be 2pi = 256 was a pretty clever trick.
Buttons840 a year ago
Would using this neural network based on integer addition be faster? The paper does not claim it would be faster, so I'm assuming not?
What about over time? If this L-Mul (the matrix operation based on integer addition) operation proved to be much more energy efficient and became popular, would new hardware be created that was faster?
cpldcpu a year ago
Bill Dally from nvidia introduced a log representation that basically allows to replace a multiplication with an add, without loss of accuracy (in contract to proposal above)
https://youtu.be/gofI47kfD28?t=2248
- nickpsecurity a year ago
  Paper?
  https://research.nvidia.com/publication/2022-12_lns-madam-lo...
undefined a year ago
[deleted]
scotty79 a year ago
All You Need is Considered Harmful.
- TaurenHunter a year ago
  We will need a paper titled '"Considered Harmful" Articles is All You Need' to complete that cycle.
concrete_head a year ago
Just too add an alternative addition based architecture into the mix.
https://www.youtube.com/watch?v=VqXwmVpCyL0
dwrodri a year ago
7 years of the same title format is all you need.
md_rumpf a year ago
The return of the CPU?!
- anticensor a year ago
  The reign of Threadripper!
A4ET8a8uTh0 a year ago
Uhh.. I hate to be the one to ask this question, but shouldn't we be focused on making LLMs work well first and then focused on desired optimizations? Using everyone's car analogy, it is like making sure early cars are using lower amount of coal. It is a fool's errand.
- itishappy a year ago
  Coal (and even wood!) powered cars actually existed long before Ford, but didn't take off because they were too heavy and unwieldly. The Model T was the result of a century of optimization.
  https://en.wikipedia.org/wiki/Nicolas-Joseph_Cugnot
- lukev a year ago
  Also, making neural networks faster/cheaper is a big part of how they advance.
  We've known about neural architectures since the 70s, but we couldn't build them big enough to be actually useful until the advent of the GPU.
  Similarly, the LLM breakthrough was because someone decided it was worth spending millions of dollars to train one. Efficiency improvements lower that barrier for all future development (or alternatively, allow us to build even bigger models for the same cost.)
- spencerchubb a year ago
  Cheaper compute is basically a prerequisite to making better models. You can get some improvements on the margins by making algorithms better with current hardware, but not an order of magnitude improvement.
  When there is an order of magnitude improvement in hardware, the AI labs will figure out an algorithm to best take advantage of it.
- Maken a year ago
  The optimizations described could easily work on other models, not just transformers. Following your analogy, this is optimizing plumbing, pistons and valves on steam engines, it could be useful for whatever follows.
- fennecfoxy a year ago
  You're also welcome to contribute. There are many people doing many things at once in this space, I don't think experiments like this are a problem at all.
- andrewchambers a year ago
  What if working well means making them efficient enough to run more 'neurons' on our current hardware?
m3kw9 a year ago
So instead of say 2x3 you go 2+2+2?
maryfriese57 a year ago
[dead]
alvinadiaz2 a year ago
[dead]
ranguna a year ago
I've seen this claim a few time across the last couple years and I have a pet theory why this isn't explored a lot:
Nvidia funds most research around LLMs, and they also fund other companies that fund other research. If transformers were to use addition and remova all usage of floating point multiplication, there's a good chance the gpu would no longer be needed, or in the least, cheaper ones would be good enough. If that were to happen, no one would need nvidia anymore and their trillion dollar empire would start to crumble.
University labs get free gpus from nvidia -> University labs don't want to do research that would make said gpus obsolete because nvidia won't like that.
If this were to be true, it would mean that we are stuck on an inificient research path due to corporate greed. Imagine if this really was the next best thing, and we just don't explore it more because the ruling corporation doesn't want to lose their market cap.
Hopefully I'm wrong.
- cpldcpu a year ago
  I have to disagree. Nvidia spent a lot of effort on researching improved numerical representations. You can see a summary in this talk:
  https://www.youtube.com/watch?v=gofI47kfD28
  A lot of their work was published but went by unnoticed. But in fact the majority of their performance increase in new architecture is resulting from this work.
  Reading between the lines, it seems that they came to the conclusion that a 4 bit representation with a group exponent ("FP4") is the most efficient representation of weights for inference. Reducing the number of bits in weights has the biggest impact on LLMs inference, since they are mostly memory bound. At these low bit numbers, the impact of using multiplication or other approaches is not really significiant anymore.
  (multiplying a 4 bit wight with a larger activation is effectively 4 additions, barely more than what the paper proposes)
- nayroclade a year ago
  "Good enough" for what? We're in the middle of an AI arms race. Why do you believe people would choose to run the same LLMs on cheaper equipment instead of using the greater efficiency to train and run even larger LLMs?
  Given LLM performance seems to scale with their size, this would result in more powerful models, which would grow the applicability, use and importance of AI, which would in turn grow the use and importance of Nvidia's hardware.
  So this theory doesn't really stack up for me.
- chpatrick a year ago
  It's still a massively parallel problem suited to GPUs, whether it's float or int, or addition or multiplication doesn't really matter.
- londons_explore a year ago
  If an addition-only LLM performed better, nvidia would probably still be the market leader.
  Next gen nvidia chips would have more adders and fewer multipliers.
- yunohn a year ago
  Google & Apple already run custom chips, Meta and MS are deploying their own soon too. Your theory is that none of them have researched non-matrix-multiplication solutions before investing billions?
  miohtama a year ago
  There are several patents on this topic so they have
- twoodfin a year ago
  I’d estimate that fraction of Nvidia’s dominance that’s dependent on their distinctive advantages in kernel primitives (add vs. multiply) would be a rounding error in FP8.
  The CUDA tooling and ecosystem, VLSI architecture, organizational prowess… all matter at multiple orders of magnitude more.
- teaearlgraycold a year ago
  NVidia GPUs support integer operations specifically for use with deep learning models.
- WrongAssumption a year ago
  So let me get this straight. Universities don’t want to show that Nvidia gpus are obsolete, so they can receive a steady stream of obsolete gpus? For what possible reason, that doesn’t make sense.
- iamgopal a year ago
  no matter how fast cpu, network and browser has become, websites are still slow. we will run out of data to train much earlier than people will stop inventing even larger models.
- yieldcrv a year ago
  Alternatively, other people fund LLM research
- raincole a year ago
  > I have a pet theory
  You mean you have a conspiracy theory.
  Why wouldn't other companies that buy Nvidia GPU fund these researches? It would greatly cut their cost.