TL;DR:
The up-to-date Nvidia tool to optimize a single compute kernel is Nsight Compute.
Details:
nvprof and nvvp are legacy profilers, while the Nsight profilers are newer and are regularly updated with new features. So as long as the Nsight profilers support your GPU architecture, you should probably use them.
I do not know how exactly Nvidia categorizes its software products into "Nsight" or not, but Nsight certainly is not a single product/piece of software and not everything called "Nsight" has something to do with profiling. As you noted, there are multiple IDE plugins under this moniker which give better syntax highlighting, a debugging GUI (wrapping cuda-gdb) etc.
The two available profilers for use in the compute context (vs 3D/Ray-Tracing with "Nsight Graphics") are Nsight Systems (nsys) and Nsight Compute (ncu). Both can be called in CLI mode for data-collection on a remote server, or with a GUI (nsys-ui and ncu-ui) to view the collected data or interactively collect data.
Nsight Systems gives you a timeline for the whole application, i.e. as OP described it, it "minimize[s] bottlenecks between multiple kernel invocations/data transfers, etc." and is therefore not what OP is searching for.
For more information on the relation between legacy and Nsight profilers see the Nvidia blog post Migrating to NVIDIA Nsight Tools from NVVP and Nvprof
Answer from paleonix on Stack ExchangeVideos
I've written a guide on using Nvidia tools (Nsight systems, Nsight Compute,..) from zero to hero, here is content:
Fix-Bug
Chapter01: Introduction to Nsight Systems - Nsight Compute
Chapter02: Cuda toolkit - Cuda driver
Chapter03: NVIDIA Compute Sanitizer Part 1
Chapter04: NVIDIA Compute Sanitizer Part 2
Chapter05: Global Memory Coalescing
Chapter06: Warp Scheduler
Chapter07: Occupancy Part 1
Chapter08: Occupancy Part 2
Chapter09: Bandwidth - Throughput - Latency
Chapter10: Compute Bound - Memory Bound