Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Arrays with CUDA.jl (feature request) #214

Open
jenkspt opened this issue Apr 10, 2022 · 5 comments
Open

Lazy Arrays with CUDA.jl (feature request) #214

jenkspt opened this issue Apr 10, 2022 · 5 comments

Comments

@jenkspt
Copy link

jenkspt commented Apr 10, 2022

This is a feature request to get LazyArrays to work with CuArrays (and potentially other GPUArray implementations)

Some of the features of LazyArrays works with CuArrays out of the box, which is awesome. For example:

using CUDA
using LazyArrays
using BenchmarkTools

a, b = CUDA.rand(1, 1000), CUDA.rand(1000, 1);

bench1(a, b) = CUDA.@sync sum(@~ a .+ b; dims=1)
@btime bench1(a, b)

bench2(a, b) = CUDA.@sync sum(a .+ b; dims=1)
@btime bench2(a, b)

Getting 125 μs and 156 μs μs respectively on my GTX 1080 Ti
So using LazyArrays.jl is faster using less memory!

However reductions on BroadcastArray (via the LazyArray constructor) reverts to scalar indexing on the GPU

CUDA.allowscalar(false)
sum(LazyArray(@~ a .+ b); dims=1)

ERROR: Scalar indexing is disallowed.

Also:

  • displaying LazyArray of broadcasted CuArrays doesn't work
  • Tried using Adapt.jl on BroadcastArray, ApplyArray, and Base.Broadcast.Broadcasted, but didn't seem to help with the above problems.

I think this would be a really powerful feature -- which would make writing fast and efficient non-trivial gpu functions much easier write.

@dlfivefifty
Copy link
Member

I don’t have easy access to an Nvidia chip but would love to accept a PR.

I don’t quite see what you are proposing for sum. Do you want sum(a .+b ) to lower to sum(a) + sum(b)? That’s pretty easy to add

@jenkspt
Copy link
Author

jenkspt commented Apr 21, 2022

Sorry that's a bad example. More generally i'm interested in reductions and accumulation over broadcasted functions i.e. reduce(*, a .+ b) or accumulate(*, a .+ b). I see this as mapreduce but with the ability to broadcast the 'mapped' function. Interestingly I'm seeing that sum(@~ a .+ b; dims=1) already works with CuArrays but not with Array (I think this isn't the intention since Broadcasted isn't an AbstractArray -- so maybe this is a peculiarity of CUDA.jl sum`

using CUDA
a, b = CUDA.rand(1, 10), CUDA.rand(8, 1);
sum(@~ a .+ b; dims=1)    # works

a, b = Array(a), Array(b)
sum(@~ a .+ b; dims=1)    # doesn't work

and the opposite is true for LazyArray

a, b = CUDA.rand(1, 10), CUDA.rand(8, 1);
sum(LazyArray(@~ a .+ b; dims=1))    # uses scalar indexing

a, b = Array(a), Array(b)
sum(LazyArray(@~ a .+ b; dims=1))    # works

With all that said -- I think my specific ask is to get LazyArray w/ reduce and accumulate to play nicely with CuArray.
I'm happy to take a stab at this -- any ideas on where to focus my attention would be greatly appreciated.

@dlfivefifty
Copy link
Member

I think you meant

sum(LazyArray(@~ a .+ b); dims=1)

I'm a bit confused what you want to happen. We have

reduce(*, a .+ b) == (a[1] + b[1]) *  * (a[end] + b[end])

But what do you want to do when a and b are on the GPU? If I'm not mistaken calling getindex is very slow....

@RainerHeintzmann
Copy link

The LazyArray defers the evaluation to the point of access in the reduce (e.g. sum) or collect operation. On the GPU reduce operations are not performed sequentially but in a tree-like scheme or (from a programming point of view) even completely in parallel for all pointwise (i.e. broadcasting) operations to minimize computation (possibly with some copy to the CPU for the last 100 elements or so). The trouble seems to be that LazyArrays and broadcasting seems to somehow have the sequential operation (iterating through the elements) bit engraved into it when executing the broadcasting?
In my application I am currently not even needing the reductions to work, but would still like to use LazyArrays for the broadcast. E.g. to save memory (and memory transfer time) when calculating outer products. But this also triggers the individual scalar index access, which is to be avoided when operating on a GPU:

Warning: Performing scalar indexing on task Task (runnable) @0x000000000a2107d0.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore C:\Users\pi96doc\.julia\packages\GPUArraysCore\ZBmfM\src\GPUArraysCore.jl:90

Can the execution of the broadcast somehow be deferred to the standard broadcasting mechanism? If so, I would expect the package to also work with CUDA.

@RainerHeintzmann
Copy link

... digging into the code, it seems that the issue with the loop is a minor display issue. However, when calling collect on an object created by LazyArray(@~ ...), the dest object in the materialize call, line 23 in lazybroadcasting.jl has a wrong type, being a plain vanilla Matrix rather than a CuArray. So maybe the allocation of the destination goes wrong here? But if I do a broadcast assignment to a destination, I obtain an error about some assignment using a non bits-type argument, which is a bit strange since I cannot find LazyArrays anywhere in the call stack, yet it works fine with CuArray objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants