The problem was not related to PowerPC architecture. I needed to pass the fatbin file to the host-side compilation command with -Xclang -fcuda-include-gpubinary -Xclang axpy.fatbin to replicate the whole compilation behavior.
Here is the corrected Makefile:
BIN_FILE=axpy
SRC_FILE=$(BIN_FILE).cu
main: $(BIN_FILE)
# Host Side
$(BIN_FILE).ll: $(SRC_FILE) $(BIN_FILE).fatbin
clang++ -stdlib=libc++ -Wall -Werror $(BIN_FILE).cu -march=ppc64le --cuda-host-only -relocatable-pch \
-Xclang -fcuda-include-gpubinary -Xclang $(BIN_FILE).fatbin -S -g -c -emit-llvm
$(BIN_FILE).o: $(BIN_FILE).ll
llc -march=ppc64le $(BIN_FILE).ll -o $(BIN_FILE).s
clang++ -c -Wall $(BIN_FILE).s -o $(BIN_FILE).o
# GPU Side
$(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll: $(SRC_FILE)
clang++ -x cuda -stdlib=libc++ -Wall -Werror $(BIN_FILE).cu --cuda-device-only \
--cuda-gpu-arch=sm_70 -S -g -emit-llvm
$(BIN_FILE).ptx: $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll
llc -march=nvptx64 -mcpu=sm_70 -mattr=+ptx64 $(BIN_FILE)-cuda-nvptx64-nvidia-cuda-sm_70.ll -o $(BIN_FILE).ptx
$(BIN_FILE).ptx.o: $(BIN_FILE).ptx
ptxas -m64 --gpu-name=sm_70 $(BIN_FILE).ptx -o $(BIN_FILE).ptx.o
$(BIN_FILE).fatbin: $(BIN_FILE).ptx.o
fatbinary --64 --create $(BIN_FILE).fatbin --image=profile=sm_70,file=$(BIN_FILE).ptx.o \
--image=profile=compute_70,file=$(BIN_FILE).ptx -link
$(BIN_FILE)_dlink.o: $(BIN_FILE).fatbin
nvcc $(BIN_FILE).fatbin -gencode arch=compute_70,code=sm_70 \
-dlink -o $(BIN_FILE)_dlink.o -lcudart -lcudart_static -lcudadevrt
# Link both object files together (either nvcc or clang works here):
$(BIN_FILE): $(BIN_FILE).o $(BIN_FILE)_dlink.o
nvcc $(BIN_FILE).o $(BIN_FILE)_dlink.o -o $(BIN_FILE) -arch=sm_70 -lc++
Figure 1 in this link includes the creation steps of the fatbinary file.
Answer from AmirSojoodi on Stack OverflowStruggling with CUDA, Clang and LLVM IR, and getting: CUDA failure: 'Invalid device function' - Stack Overflow
Full CUDA support
NVIDIA Open Sources CUDA, LLVM-Based Compiler
compiler construction - how to compile CUDA to llvm IR? - Stack Overflow
Videos
Thanks to contributions from Google and others, Clang now supports building CUDA. Command line parameters are slightly different from nvcc, though. According to the official documentation, assuming your file is named axpy.cu, the basic usage is:
$ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
-L<CUDA install path>/<lib64 or lib> \
-lcudart_static -ldl -lrt -pthread
Note that using Clang for compiling CUDA still requires that you have the proprietary CUDA runtime from the NVIDIA CUDA toolkit installed.
2016-05-01 Update: clang now supports CUDA. See @rivanvx' answer.
The CUDA compiler is based on LLVM. Clang, though also based on LLVM, does not support CUDA.