The problem was not related to PowerPC architecture. I needed to pass the fatbin file to the host-side compilation command with -Xclang -fcuda-include-gpubinary -Xclang axpy.fatbin to replicate the whole compilation behavior.

Here is the corrected Makefile:

BIN_FILE=axpy
SRC_FILE=$(BIN_FILE).cu

main: $(BIN_FILE)

# Host Side
$(BIN_FILE).ll: $(SRC_FILE) $(BIN_FILE).fatbin
    clang++ -stdlib=libc++ -Wall -Werror $(BIN_FILE).cu -march=ppc64le --cuda-host-only -relocatable-pch \
        -Xclang -fcuda-include-gpubinary -Xclang $(BIN_FILE).fatbin -S -g -c -emit-llvm

$(BIN_FILE).o: $(BIN_FILE).ll
    llc -march=ppc64le $(BIN_FILE).ll -o $(BIN_FILE).s
    clang++ -c -Wall $(BIN_FILE).s -o $(BIN_FILE).o

# GPU Side
$(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll: $(SRC_FILE)
    clang++ -x cuda -stdlib=libc++ -Wall -Werror $(BIN_FILE).cu --cuda-device-only \
        --cuda-gpu-arch=sm_70 -S -g -emit-llvm

$(BIN_FILE).ptx: $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll
    llc -march=nvptx64 -mcpu=sm_70 -mattr=+ptx64 $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll -o $(BIN_FILE).ptx

$(BIN_FILE).ptx.o: $(BIN_FILE).ptx
    ptxas -m64 --gpu-name=sm_70 $(BIN_FILE).ptx -o $(BIN_FILE).ptx.o

$(BIN_FILE).fatbin: $(BIN_FILE).ptx.o
    fatbinary --64 --create $(BIN_FILE).fatbin --image=profile=sm_70,file=$(BIN_FILE).ptx.o \
        --image=profile=compute_70,file=$(BIN_FILE).ptx -link

$(BIN_FILE)_dlink.o: $(BIN_FILE).fatbin
    nvcc $(BIN_FILE).fatbin -gencode arch=compute_70,code=sm_70 \
        -dlink -o $(BIN_FILE)_dlink.o -lcudart -lcudart_static -lcudadevrt

# Link both object files together (either nvcc or clang works here):
$(BIN_FILE): $(BIN_FILE).o $(BIN_FILE)_dlink.o
    nvcc $(BIN_FILE).o $(BIN_FILE)_dlink.o -o $(BIN_FILE) -arch=sm_70 -lc++

Figure 1 in this link includes the creation steps of the fatbinary file.

Answer from AmirSojoodi on Stack Overflow
🌐
NVIDIA Developer
developer.nvidia.com › cuda-llvm-compiler
CUDA LLVM Compiler | NVIDIA Developer
NVIDIA has worked with the LLVM organization to contribute the CUDA compiler source code changes to the LLVM core and parallel thread execution backend, enabling full support of NVIDIA GPUs.
🌐
LLVM
llvm.org › docs › CompileCudaWithLLVM.html
Compiling CUDA with clang — LLVM 23.0.0git documentation
This document describes how to compile CUDA code with clang, and gives some details about LLVM and clang’s CUDA implementations.
Discussions

Struggling with CUDA, Clang and LLVM IR, and getting: CUDA failure: 'Invalid device function' - Stack Overflow
I am trying to optimize a CUDA code with LLVM passes on a PowerPC system (RHEL 7.6 with no root access) equipped with V100 GPUs, CUDA 10.1, and LLVM 11 (built from source). Also, I tested clang, ll... More on stackoverflow.com
🌐 stackoverflow.com
Full CUDA support
Hi Folks, I love what you're doing here and would like to make use of oneAPI at my company for realtime image processing. We currently use CUDA and would like to move to a platform with less ve... More on github.com
🌐 github.com
2
2
NVIDIA Open Sources CUDA, LLVM-Based Compiler
now just OPENSOURCE YOUR DRIVERS !!!!!!!! More on reddit.com
🌐 r/programming
139
656
December 14, 2011
compiler construction - how to compile CUDA to llvm IR? - Stack Overflow
I've been trying for three days to compile a CUDA kernel into llvm IR and I couldn't do it. I've changed langoptions.cpp and added CUDA=1; in the constructor , but still the clang give me Error me... More on stackoverflow.com
🌐 stackoverflow.com
🌐
LLVM
releases.llvm.org › 3.9.1 › docs › CompileCudaWithLLVM.html
Compiling CUDA C/C++ with LLVM — LLVM 3.9 documentation
This document contains the user guides and the internals of compiling CUDA C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs. This document assumes a basic familiarity with CUDA.
🌐
Llvm
libc.llvm.org › gpu › building.html
Building libs for GPUs - The LLVM C Library
cmake -G Ninja -S llvm -B $HOST_BUILD_DIR \ -DLLVM_ENABLE_PROJECTS="clang" \ -DCMAKE_C_COMPILER=$HOST_C_COMPILER \ -DCMAKE_CXX_COMPILER=$HOST_CXX_COMPILER \ -DLLVM_LIBC_FULL_BUILD=ON \ -DCMAKE_BUILD_TYPE=Release ... Once this has finished, use the newly built compiler to build the C library for the GPU. Select your target architecture (amdgcn-amd-amdhsa or nvptx64-nvidia-cuda).
Top answer
1 of 1
3

The problem was not related to PowerPC architecture. I needed to pass the fatbin file to the host-side compilation command with -Xclang -fcuda-include-gpubinary -Xclang axpy.fatbin to replicate the whole compilation behavior.

Here is the corrected Makefile:

BIN_FILE=axpy
SRC_FILE=$(BIN_FILE).cu

main: $(BIN_FILE)

# Host Side
$(BIN_FILE).ll: $(SRC_FILE) $(BIN_FILE).fatbin
    clang++ -stdlib=libc++ -Wall -Werror $(BIN_FILE).cu -march=ppc64le --cuda-host-only -relocatable-pch \
        -Xclang -fcuda-include-gpubinary -Xclang $(BIN_FILE).fatbin -S -g -c -emit-llvm

$(BIN_FILE).o: $(BIN_FILE).ll
    llc -march=ppc64le $(BIN_FILE).ll -o $(BIN_FILE).s
    clang++ -c -Wall $(BIN_FILE).s -o $(BIN_FILE).o

# GPU Side
$(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll: $(SRC_FILE)
    clang++ -x cuda -stdlib=libc++ -Wall -Werror $(BIN_FILE).cu --cuda-device-only \
        --cuda-gpu-arch=sm_70 -S -g -emit-llvm

$(BIN_FILE).ptx: $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll
    llc -march=nvptx64 -mcpu=sm_70 -mattr=+ptx64 $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll -o $(BIN_FILE).ptx

$(BIN_FILE).ptx.o: $(BIN_FILE).ptx
    ptxas -m64 --gpu-name=sm_70 $(BIN_FILE).ptx -o $(BIN_FILE).ptx.o

$(BIN_FILE).fatbin: $(BIN_FILE).ptx.o
    fatbinary --64 --create $(BIN_FILE).fatbin --image=profile=sm_70,file=$(BIN_FILE).ptx.o \
        --image=profile=compute_70,file=$(BIN_FILE).ptx -link

$(BIN_FILE)_dlink.o: $(BIN_FILE).fatbin
    nvcc $(BIN_FILE).fatbin -gencode arch=compute_70,code=sm_70 \
        -dlink -o $(BIN_FILE)_dlink.o -lcudart -lcudart_static -lcudadevrt

# Link both object files together (either nvcc or clang works here):
$(BIN_FILE): $(BIN_FILE).o $(BIN_FILE)_dlink.o
    nvcc $(BIN_FILE).o $(BIN_FILE)_dlink.o -o $(BIN_FILE) -arch=sm_70 -lc++

Figure 1 in this link includes the creation steps of the fatbinary file.

🌐
LLVM
llvm.org › docs › NVPTXUsage.html
User Guide for NVPTX Back-end — LLVM 23.0.0git documentation
To support GPU programming, the ... the back- end, including a description of the conventions used and the set of accepted LLVM IR. ... This document assumes a basic familiarity with CUDA and the PTX assembly language....
Find elsewhere
🌐
LLVM
releases.llvm.org › 3.8.0 › docs › CompileCudaWithLLVM.html
Compiling CUDA C/C++ with LLVM — LLVM 3.8 documentation
This document contains the user guides and the internals of compiling CUDA C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs. This document assumes a basic familiarity with CUDA.
🌐
GitHub
github.com › apc-llc › nvcc-llvm-ir
GitHub - apc-llc/nvcc-llvm-ir: Enabling on-the-fly manipulations with LLVM IR code of CUDA sources · GitHub
What is the best set of GPU-specific LLVM optimizations and how to continue modifying IR after applying them? The first question is the result of opensource CUDA frontend unavailability. In fact the EDG frontend (by Edison Design Group Inc.) used by NVIDIA CUDA compiler is the only frontend that is able to translate CUDA source into LLVM IR.
Starred by 123 users
Forked by 25 users
Languages   C++ 39.4% | Shell 23.6% | Python 19.7% | CMake 10.0% | Dockerfile 6.2% | Cuda 1.1%
🌐
Phoronix
phoronix.com › news › Compile-CUDA-LLVM
How To Compile CUDA Code With LLVM - Phoronix
November 11, 2015 - Building CUDA codes with LLVM/Clang still currently requires one out-of-tree patch, still obviously requires the CUDA driver/runtime from NVIDIA Corp, and setting various Clang arguments for generating the NVPTX code to then be consumed by NVIDIA's driver stack.
🌐
GitHub
github.com › lennyerik › cutransform
GitHub - lennyerik/cutransform: CUDA kernels in any language supported by LLVM · GitHub
Are you tired of having to write your CUDA kernel code in C++? This project aims to make it possible to compile CUDA kernels written in any language supported by LLVM without much hassle.
Author   lennyerik
🌐
Llvm
prereleases.llvm.org › 18.1.0 › rc3 › docs › CompileCudaWithLLVM.html
Compiling CUDA with clang — LLVM 18.1.0rc documentation
February 22, 2024 - CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through 12.1.
🌐
Reddit
reddit.com › r/programming › nvidia open sources cuda, llvm-based compiler
r/programming on Reddit: NVIDIA Open Sources CUDA, LLVM-Based Compiler
December 14, 2011 - The real question is what the front-end is, not what the backend is (LLVM is the obvious choice there). Their previous CUDA compiler work has used EDG as the front-end, which has a restricted license. The other alternative is clang.
🌐
NVIDIA Developer Forums
forums.developer.nvidia.com › accelerated computing › cuda › cuda programming and performance
CUDA LLVM compiler samples - CUDA Programming and Performance - NVIDIA Developer Forums
December 15, 2013 - Hello, the page https://developer.nvidia.com/cuda-llvm-compiler says there are samples for building compilers targeting NVidia GPUs. I joined the program, but they said this is now distributed with the SDK. All I found in SDK is /usr/local/cuda/nvvm/libnvvm-samples which has some codes but ...
🌐
NVIDIA Developer Forums
forums.developer.nvidia.com › accelerated computing › cuda › cuda nvcc compiler
Using nvcc with the latest version of LLVM/clang - CUDA NVCC Compiler - NVIDIA Developer Forums
March 2, 2023 - Hi, I guess this is a bit more of a “bug report” than a help request. (As by now I do know how to work around this issue.) Also, “bug report” is in quotes as the thing not working is not actually something that NVIDIA w…
🌐
Hackage
hackage.haskell.org › package › accelerate-llvm-ptx
accelerate-llvm-ptx: Accelerate backend for NVIDIA GPUs
April 2, 2026 - This library implements a backend for the Accelerate language which generates LLVM IR targeting CUDA capable GPUs.