« BackCuPy: NumPy and SciPy for GPUgithub.comSubmitted by tanelpoder 4 hours ago
  • gjstein 3 hours ago

    The idea that this is a drop in replacement for numpy (e.g., `import cupy as np`) is quite nice, though I've gotten similar benefit out of using `pytorch` for this purpose. It's a very popular and well-supported library with a syntax that's similar to numpy.

    However, the AMD-GPU compatibility for CuPy is quite an attractive feature.

    • ogrisel 2 hours ago

      Note that NumPy, CuPy and PyTorch are all involved in the definition of a shared subset of their API:

      https://data-apis.org/array-api/

      So it's possible to write array API code that consumes arrays from any of those libraries and delegate computation to them without having to explicitly import any of them in your source code.

      The only limitation for now is that PyTorch (and to some lower extent cupy as well) array API compliance is still incomplete and in practice one needs to go through this compatibility layer (hopefully temporarily):

      https://data-apis.org/array-api-compat/

      • ethbr1 an hour ago

        It's interesting to see hardware/software/API co-development in practice again.

        The last time I think this happen at market-scale was early 3d accelerator APIs? Glide/opengl/directx. Which has been a minute! (To a lesser extent CPU vectorization extensions)

        Curious how much of Nvidia's successful strategy was driven by people who were there during that period.

        Powerful first mover flywheel: build high performing hardware that allows you to define an API -> people write useful software that targets your API, because you have the highest performance -> GOTO 10 (because now more software is standardized on your API, so you can build even more performant hardware to optimize its operations)

        • kmaehashi an hour ago

          An excellent example of Array API usage can be found in scikit-learn. Estimators written in NumPy are now operable on various backends courtesy of Array API compatible libraries such as CuPy and PyTorch.

          https://scikit-learn.org/stable/modules/array_api.html

          Disclosure: I'm a CuPy maintainer.

          • kccqzy an hour ago

            And of course the native Python solution is memoryview. If you need to inter-operate with libraries like numpy but you cannot import numpy, use memoryview. It is specifically for fast low-level access which is why it has more C documentation than Python documentation: https://docs.python.org/3/c-api/memoryview.html

          • KeplerBoy 3 hours ago

            One could also "import jax.numpy as jnp". All those libraries have more or less complete implementations of numpy and scipy (i believe CuPy has the most functions, especially when it comes to scipy) functionality.

            Also: You can just mix match all those functions and tensors thanks to the __cuda_array_interface__.

            • kmaehashi an hour ago

              For those interested in the NumPy/SciPy API coverage in CuPy, here is the comparison table:

              https://docs.cupy.dev/en/latest/reference/comparison.html

              • yobbo 2 hours ago

                Jax variables are immutable.

                Code written for CuPy looks similar to numpy but very different from Jax.

                • bbminner 2 hours ago

                  Ah, well, that's interesting! Does anyone know how cupy manages tensor mutability?

                  • kmaehashi an hour ago

                    CuPy tensors (or `ndarray`) provide the same semantics as NumPy. In-place operations are permitted.

              • amarcheschi 36 minutes ago

                I'm supposed to end my undergraduate degree with an internship at the italian national research center and i'll have to use pytorch to write ml models from paper to code, i've tried looking at the tutorial but i feel like there's a lot going on to grasp. until now i've only used numpy (and pandas in combo with numpy), i'm quite excited but i'm a bit on the edge because i can't know whether i'll be up to the task or not

              • curvilinear_m 14 minutes ago

                I'm surprised to see pytorch and Jax mentioned as alternatives but not numba : https://github.com/numba/numba

                I've recently had to implement a few kernels to lower the memory footprint and runtime of some pytorch function : it's been really nice because numba kernels have type hints support (as opposed to raw cupy kernels).

                • setopt 13 minutes ago

                  I’ve been using CuPy a bit and found it to be excellent.

                  It’s very easy to replace some slow NumPy/SciPy calls with appropriate CuPy calls, with sometimes literally a 1000x performance boost from like 10min work. It’s also easy to write “hybrid code” where you can switch between NumPy and CuPy depending on what’s available.

                  • meisel 3 hours ago

                    When building something that I want to run on both CPU and GPU, depending, I’ve found it much easier to use PyTorch than some combination of NumPy and CuPy. I don’t have to fiddle around with some global replacing of numpy.* with cupy.*, and PyTorch has very nearly all the functions that those libraries have.

                    • johndough 2 hours ago

                      CuPy is probably the easiest way to interface with custom CUDA kernels: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...

                      And I recently learned that CuPy has a JIT compiler now if you prefer Python syntax over C++. https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-k...

                      • kunalgupta022 an hour ago

                        Is anyone aware of a pandas like library that is based on something like CuPy instead of Numpy. It would be great to have the ease of use of pandas with the parallelism unlocked by gpu.

                      • sdenton4 3 hours ago

                        Why not Jax?

                        • johndough 2 hours ago

                          > Why not Jax?

                          - JAX Windows support is lacking

                          - CuPy is much closer to CUDA than JAX, so you can get better performance

                          - CuPy is generally more mature than JAX (fewer bugs)

                          - CuPy is more flexible thanks to cp.RawKernel

                          - (For those familiar with NumPy) CuPy is closer to NumPy than jax.numpy

                          But CuPy does not support automatic gradient computation, so if you do deep learning, use JAX instead. Or PyTorch, if you do not trust Google to maintain a project for a prolonged period of time https://killedbygoogle.com/

                          • gnulinux an hour ago

                            What about CPU-only loads? If one wants to write code that'll eventually run in both CPU and GPU but in the short-to-mid term will only be used in CPU? Since JAX natively support CPU (with numpy backend), but CuPy doesn't, this seems like a potential problem for some.

                            • nextaccountic an hour ago

                              Isn't there a way to dynamically select between numpy and cupy, depending on whether you want cpu or gpu code?

                              • kmaehashi 3 minutes ago

                                NumPy has a mechanism to dispatch execution to CuPy: https://numpy.org/neps/nep-0018-array-function-protocol.html

                                Just prepare the input on NumPy or CuPy, and then you can just feed it to NumPy APIs. NumPy functions will handle itself if the input is NumPy ndarray, or dispatch the execution to CuPy if the input is CuPy ndarray.

                                • gnulinux 35 minutes ago

                                  There is but then you're using two separate libraries, that seems like a fragile point of failure compared to just using jax. But regardless since jax will use different backends anyway, it's arguably not any worse (but it ends up being your responsibility to ensure correctness as opposed to the jax team).

                              • insane_dreamer 26 minutes ago

                                > CuPy does not support automatic gradient computation, so if you do deep learning, use JAX instead

                                DL is major use case; is CuPy planning on adding auto gradient comp?

                              • bee_rider 3 hours ago

                                Real answer: CuPy has a name that is very similar to SciPy. I don’t know GPU, that’s why I’m using this sort of library, haha. The branding for CuPy makes it obvious. Is Jax the same thing, but implemented better somehow?

                                • sdenton4 2 hours ago

                                  Yeah, Jax provides a one-to-one reimplementation of the Numpy interface, and a decent chunk of the scipy interface. Random number handling is a bit different, but Numpy random number handling seeeeems to be trending in the Jax direction (explicitly passed RNG objects).

                                  Jax also provides back-propagation wherever possible, so you can optimize.

                                  • whimsicalism 3 hours ago

                                    yes

                                  • palmy 2 hours ago

                                    cupy came out a long time before Jax; remember using it in a project for my BSc around 2015-2016.

                                    Cool to see that it's still kicking!

                                  • __mharrison__ 3 hours ago

                                    I taught my numpy class to a client who wanted to use GPUs. Installation (at that time) was a chore but afterwards it was really smooth using this library. Big gains with minimal to no code changes.

                                    • whimsicalism 3 hours ago

                                      I was just thinking we didn’t have enough CUDA-accelerated numpy libraries.

                                      Jax, pytorch, vanilla TF, triton. They just don’t cut it

                                      • hamilyon2 2 hours ago

                                        There is a bit similar project which supports Intel GPU offloading: https://github.com/intel/scikit-learn-intelex

                                        • bee_rider 3 hours ago

                                          Good a place as any to ask I guess. Do any of these GPU libraries have a BiCGStab (or similar) that handles multiple right hand sides? CuPy seems to have GMRES, which would be fine, but as far as I can tell it just does one right hand side.

                                          • johndough 2 hours ago

                                            If you have many right hand sides, you could also compute an LU factorization and then solve the right hand sides via back-substitution.

                                            https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc...

                                            or https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc... if your linear system is sparse.

                                            But whether that works well depends on the problem you are trying to solve.

                                            • trostaft 2 hours ago

                                              IIRC jax's `scipy.sparse.linalg.bicgstab` does support multiple right hand sides.

                                              EDIT: Or rather, all the solvers under jax's `scipy.sparse.linalg` all support multiple right hand sides.

                                              • bee_rider 2 hours ago

                                                Oh dang, that’s pretty awesome, thanks.

                                                “array or tree of arrays” sounds very general, probably even better than an old fashioned 2D array.

                                                • trostaft 2 hours ago

                                                  'tree of arrays'

                                                  Ahh, that's just Jax's concept of pytrees. It was something that they invented to make it easier (this is how I view it, not complete) to pass complex objects to function but still be able to easily consider them as a concatenated vector for AD etc.. E.g. a common pattern is to pass parameters `p` to a function and then internally break them into their physical interpretations, e.g. `mass = p[0]`, `velocity = p[1]`. Pytrees let you just use something like a dictionary `p = {'mass' = 1.0, 'velocity = 1.0'}`, which is a stylistically more natural structure to pass around, and then jax is structured to understand later when AD'ing or otherwise that you're doing so with respect to the 'leaves' of the tree, or the values of the mass and velocity.

                                                  Hopefully someone corrects me if I'm not right about this. I'm hardly 100% on Jax's vision on PyTrees.

                                                  As an aside, just a list of right hand sides `[b1, b2, ..., bm]` is valid.

                                            • lmeyerov 3 hours ago

                                              We are fans! We mostly use cudf/cuml/cugraph (GPU dataframes etc) in the pygraphistry ecosystem, and when things get a bit tricky, cupy is one of the main escape hatches

                                              • adancalderon 3 hours ago

                                                If it ran in the background it could be CuPyd

                                                • SubiculumCode 2 hours ago

                                                  As an aside, since I was trying to install CuPy the other day and was having issues.

                                                  Open projects on github often (at least superficially) require specific versions of Cuda Toolkit (and all the specialty nvidia packages e.g. cudann), Tensorflow, etc, and changing the default versions of these for each little project, or step in a processing chain, is ridiculous.

                                                  pyenv et al have really made local, project specific versions of python packages much easier to manage. But I haven't seen a similar type solution for cuda toolkit and associated packages, and the solutions I've encountered seem terribly hacky..but I'm sure though that this is a common issue, so what do people do?

                                                  • whimsicalism 4 minutes ago

                                                    in real life everyone just uses containers, might not be the answer you want to hear though

                                                    • m_d_ 2 hours ago

                                                      conda provides cudatoolkit and associated packages. Does this solve the situation?

                                                      • nyrikki an hour ago

                                                        The condos 200-employee threshold licence change is problematic for some.

                                                        • boldlybold an hour ago

                                                          As long as you stay out of the "defaults" and "anaconda" repos, you're not subject to that license. For my needs conda-forge and bioconda have everything. I'm not sure about the nvidia repo but I assume it's similar.

                                                    • coeneedell 2 hours ago

                                                      Ugh… docker containers. I also wish there was a simpler way but I don’t think there is.

                                                      • SubiculumCode 2 hours ago

                                                        this is not what I wanted to hear. NOT AT ALL. Please whisper sweet lies into my ears.

                                                        • coeneedell 2 hours ago

                                                          At the moment I’m working on a system to quickly replicate academic deep learning repos (papers) at scale. At least Amazon has a catalogue of prebuilt containers with cuda/pytorch combos. I still occasionally have an issue where the container works on my 3090 test bench but not on the T4 cloud node…