🌐
Kontext
kontext.tech › home › blogs › spark & pyspark › debug pyspark code in visual studio code
Debug PySpark Code in Visual Studio Code - Kontext Labs
January 4, 2020 - The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code.
🌐
Stack Overflow
stackoverflow.com › questions › 73051927 › debug-pyspark-in-vs-code
apache spark - Debug PySpark in VS Code - Stack Overflow
13 best practice for debugging python-spark code · 7 How to troubleshoot 'pyspark' is not recognized... error on Windows? 1 E0401:Unable to import 'pyspark in VSCode in Windows 10 · 3 pyspark dataframe methods (i.e. show()) can not be printed in vs code debug console ·
🌐
Stack Overflow
stackoverflow.com › questions › 65500267 › debug-pyspark-in-visual-studio-code
apache spark - Debug PySpark in Visual Studio Code - Stack Overflow
December 29, 2020 - Modifying Pyspark source code for debugging · 1 · E0401:Unable to import 'pyspark in VSCode in Windows 10 · 3 · pyspark dataframe methods (i.e. show()) can not be printed in vs code debug console · 3 · Visual studio code using pytest for Pyspark getting stuck at SparkSession Creation ·
🌐
GitHub
github.com › microsoft › vscode-python › issues › 15385
Impossible to debug pyspark script after upgrade vscode v1.53.1 · Issue #15385 · microsoft/vscode-python
1- Set debug configuration : { "name": "PySpark : test", "type": "python", "request": "launch", "stopOnEntry": true, "pythonPath": "${workspaceRoot}/sparkSubmit.sh", "program": "${workspaceRoot}/test_pyspark.py", "args": [], "cwd": "${workspaceRoot}/", "console": "integratedTerminal", "envFile": "${workspaceRoot}/.env", "env": {"PYSPARK_PYTHON":"python3"} } 2- sparkSubmit.sh call sparkSubmit standard tool 3- After upgrading vscode to 1.53.1, the debugger cannot start.
🌐
GitHub
github.com › microsoft › vscode-data-wrangler › issues › 255
Pyspark Dataframe Support · Issue #255 · microsoft/vscode-data-wrangler
August 10, 2024 - Error message "Could not retrieve variable df_config from the Jupyter extension. Please file an issue on the Data Wrangler GitHub repository." appears instead. Debug a pyspark application in VS Code, right click.
Author   0xbadidea
🌐
Microsoft Learn
learn.microsoft.com › en-us › fabric › data-engineering › author-sjd-with-vs-code
Create and manage Apache Spark job definitions in VS Code - Microsoft Fabric | Microsoft Learn
If the Spark job definition is created with PySpark (Python), you can download the .py script of the main definition file and the referenced file, and debug the source script in VS Code.
🌐
GitHub
github.com › microsoft › vscode-python › issues › 2531
Debugging pyspark applications no longer works after August update · Issue #2531 · microsoft/vscode-python
September 8, 2018 - The following command is displayed in the terminal after starting the debugger · cd /home/user/etl ; env "PYSPARK_PYTHON=python" "PYTHONPATH=/home/user/etl" "PYTHONIOENCODING=UTF-8" "PYTHONUNBUFFERED=1" /home/user/Spark/spark-2.2.1-bin-hadoop2.7/bin/spark-submit /home/user/.vscode/extensio...
🌐
Medium
nidhig631.medium.com › exploring-pyspark-setup-in-visual-studio-code-23f2bb778dde
Exploring PySpark Setup in Visual Studio Code | by Namaste Databricks | Medium
August 24, 2024 - This article provides a step-by-step guide to setting up your environment, leveraging the robust capabilities of PySpark, and seamlessly integrating it into the VS Code. Discover the efficiency and flexibility of developing, debugging, and optimizing your PySpark applications in a user-friendly and powerful IDE environment.
🌐
Stack Overflow
stackoverflow.com › questions › 64413494 › how-do-i-setup-pyspark-in-vs-code
How do I setup pyspark in VS Code? - Stack Overflow
export PYSPARK_PYTHON=python3.8 export PYSPARK_DRIVER_PYTHON=python3.8 · AND in vscode setting python interpreter to 3.8 too (you can set it from command palette and typing Python:Select Interpreter.
Find elsewhere
🌐
Microsoft SQL Server Blog
cloudblogs.microsoft.com › home › visual studio code: develop pyspark jobs for sql server 2019 big data clusters
Visual Studio Code: Develop PySpark jobs for SQL Server 2019 Big Data Clusters - Microsoft SQL Server Blog
January 9, 2024 - With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current line, run selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below.
🌐
Microsoft Community
community.fabric.microsoft.com › t5 › Data-Engineering › Debugging-in-VSCode-pyspark-errors-exceptions-captured › m-p › 4602186
Solved: Debugging in VSCode: pyspark.errors.exceptions.cap... - Microsoft Fabric Community
April 3, 2025 - # Install mamba and PySpark ENV SPARK_VERSION=3.4.1 RUN conda install -n base -c conda-forge mamba -y && \ mamba install -n base -c conda-forge pyspark==$SPARK_VERSION -y && \ conda clean --all -y · As the original conda solver had issue with resolving dependency's and was not able to build an image. The default checks I'm doing are: What's my JAVA_HOME? Am I on the latest (prerelease)version of the fabric data enginering extension · Then I've ran into issue's with: - A refresh token: Solved: Re: Debugging in VSCode: Failed to get refresh tok...
🌐
GitHub
github.com › microsoft › vscode-python › issues › 5921
Make spark-submit arguments come right after "spark-submit" when debugging · Issue #5921 · microsoft/vscode-python
June 5, 2019 - spark-submit --queue ds-others training.py That works fine when executed in my terminal · Based on the launch.json above, when I start the debugger, the Debug console executes: cd /hadoop/met_scripts/datascience/dnaanalytics_us_group/us_grp_mo_idi_fwa ; env PYTHONIOENCODING=UTF-8 PYTHONUNBUFFERED=1 /bin/spark-submit /home/nokyere/.vscode-server-insiders/extensions/ms-python.python-2019.5.17517/pythonFiles/ptvsd_launcher.py --default --client --host localhost --port 40845 /hadoop/met_scripts/datascience/dnaanalytics_us_group/us_grp_mo_idi_fwa/python/training.py --queue ds-others
Author   okyere
🌐
Microsoft Azure
azure.microsoft.com › blog home › developer tools › run your pyspark interactive query and batch job in visual studio code
Run your PySpark Interactive Query and batch job in Visual Studio Code | Microsoft Azure Blog
June 26, 2025 - We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. This interactivity brings the best properties of Python and Spark to developers and empowers ...
🌐
Medium
bradyjiang.medium.com › introducing-pyspark-xray-a-diagnostic-tool-that-enables-local-debugging-of-pyspark-applications-17c06d6eca8d
Introducing pyspark_xray: a diagnostic tool that enables local debugging of PySpark applications using VSCode or PyCharm | by Brady Jiang | Medium
February 25, 2021 - pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.
🌐
Medium
medium.com › @marcelopedronidasilva › how-to-install-and-run-pyspark-locally-integrated-with-vscode-via-jupyter-notebook-on-windows-ff209ac8621f
How to install and run Pyspark locally integrated with VSCode via Jupyter Notebook (on Windows). | by Marcelo Pedroni da Silva | Medium
July 18, 2024 - pyspark --version to see if everything is installed and configurated okay. If it did, you should see something like this: ... Now that all steps are done, you can work locally using PySpark, also can see the UI just typing pyspark on a cmd, ...
Top answer
1 of 2
29

As far as I understand your intentions what you want is not directly possible given Spark architecture. Even without subprocess call the only part of your program that is accessible directly on a driver is a SparkContext. From the rest you're effectively isolated by different layers of communication, including at least one (in the local mode) JVM instance. To illustrate that, lets use a diagram from PySpark Internals documentation.

What is in the left box is the part that is accessible locally and could be used to attach a debugger. Since it is most limited to JVM calls there is really nothing there that should of interest for you, unless you're actually modifying PySpark itself.

What is on the right happens remotely and depending on a cluster manager you use is pretty much a black-box from an user perspective. Moreover there are many situations when Python code on the right does nothing more than calling JVM API.

This is was the bad part. The good part is that most of the time there should be no need for remote debugging. Excluding accessing objects like TaskContext, which can be easily mocked, every part of your code should be easily runnable / testable locally without using Spark instance whatsoever.

Functions you pass to actions / transformations take standard and predictable Python objects and are expected to return standard Python objects as well. What is also important these should be side effects free

So at the end of the day you have to parts of your program - a thin layer that can be accessed interactively and tested based purely on inputs / outputs and "computational core" which doesn't require Spark for testing / debugging.

Other options

That being said, you're not completely out of options here.

Local mode

(passively attach debugger to a running interpreter)

Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example:

sc.parallelize([], n).count()

where n is a number of "cores" available in the local mode (local[n]). Example procedure step-by-step on Unix-like systems:

  • Start PySpark shell:

    $SPARK_HOME/bin/pyspark 
    
  • Use pgrep to check there is no daemon process running:

    ➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
    ➜  spark-2.1.0-bin-hadoop2.7$
    
  • The same thing can be determined in PyCharm by:

    alt+shift+a and choosing Attach to Local Process:

    or Run -> Attach to Local Process.

    At this point you should see only PySpark shell (and possibly some unrelated processes).

  • Execute dummy action:

    sc.parallelize([], 1).count()

  • Now you should see both daemon and worker (here only one):

    ➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
    13990
    14046
    ➜  spark-2.1.0-bin-hadoop2.7$
    

    and

    The process with lower pid is a daemon, the one with higher pid is (possibly) ephemeral worker.

  • At this point you can attach debugger to a process of interest:

    • In PyCharm by choosing the process to connect.
    • With plain GDB by calling:

      gdb python <pid of running process>
      

The biggest disadvantage of this approach is that you have find the right interpreter at the right moment.

Distributed mode

(Using active component which connects to debugger server)

With PyCharm

PyCharm provides Python Debug Server which can be used with PySpark jobs.

First of all you should add a configuration for remote debugger:

  • alt+shift+a and choose Edit Configurations or Run -> Edit Configurations.
  • Click on Add new configuration (green plus) and choose Python Remote Debug.
  • Configure host and port according to your own configuration (make sure that port and be reached from a remote machine)

  • Start debug server:

    shift+F9

    You should see debugger console:

  • Make sure that pyddev is accessible on the worker nodes, either by installing it or distributing the egg file.

  • pydevd uses an active component which has to be included in your code:

    import pydevd
    pydevd.settrace(<host name>, port=<port number>)
    

    The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to mapPartitions) it may require patching PySpark source itself, for example pyspark.daemon.worker or RDD methods like RDD.mapPartitions. Let's say we are interested in debugging worker behavior. Possible patch can look like this:

    diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
    index 7f06d4288c..6cff353795 100644
    --- a/python/pyspark/daemon.py
    +++ b/python/pyspark/daemon.py
    @@ -44,6 +44,9 @@ def worker(sock):
         """
         Called by a worker process after the fork().
         """
    +    import pydevd
    +    pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True)
    +
         signal.signal(SIGHUP, SIG_DFL)
         signal.signal(SIGCHLD, SIG_DFL)
         signal.signal(SIGTERM, SIG_DFL)
    

    If you decide to patch Spark source be sure to use patched source not packaged version which is located in $SPARK_HOME/python/lib.

  • Execute PySpark code. Go back to the debugger console and have fun:

Other tools

There is a number of tools, including python-manhole or pyrasite which can be used, with some effort, to work with PySpark.

Note:

Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).

2 of 2
2

Check out this tool called pyspark_xray, below is a high level summary extracted from its doc.

pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.

The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.

Problem

For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.

If you develop PySpark applications, you know that PySpark application code is made up of two categories:

  • code that runs on master node
  • code that runs on worker/slave nodes

While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.

Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.

Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.

Solution

pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

This library achieves these capabilties by using the following techniques:

  • wrapper functions of Spark code on slave nodes, check out the section to learn more details
  • practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
    • For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
  • usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
    • True: if current OS is Mac or Windows
    • False: otherwise

in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

🌐
Databricks
docs.databricks.com › developers › local development tools › databricks extension for visual studio code and cursor › tutorial
Tutorial: Run Python on a cluster and as a job using the Databricks extension for Visual Studio Code | Databricks on AWS
January 19, 2026 - Alternatively, right-click the demo.py file in the Explorer panel, then select Run on Databricks > Run File as Workflow. Now that you have successfully used the Databricks extension for Visual Studio Code to upload a local Python file and run it remotely, you can also: Explore Declarative Automation Bundles resources and variables using the extension UI. See Declarative Automation Bundles extension features. Run or debug Python code with Databricks Connect.
🌐
GitHub
github.com › jplane › pyspark-devcontainer
GitHub - jplane/pyspark-devcontainer: A simple VS Code devcontainer setup for local PySpark development · GitHub
Run the first cell... it will take a few seconds to initialize the kernel and complete. You should see a message to browse to the Spark UI...
Starred by 59 users
Forked by 27 users
Languages   Jupyter Notebook 54.0% | Dockerfile 46.0%
🌐
Stack Overflow
stackoverflow.com › questions › 76399139 › spark-pyspark-configuration-in-visual-studio-code
python - Spark PySpark Configuration in Visual Studio Code - Stack Overflow
Traceback (most recent call last): File "c:\VScode workspace\spark_test\pyspark-test.py", line 1, in <module> from pyspark.sql import SparkSession ModuleNotFoundError: No module named 'pyspark'
🌐
Vikas Srivastava
vikassri.com › posts › setting-pyspark-dev
Settting up pyspark development on vscode (Mac) | Vikas Srivastava
July 25, 2020 - You need to set up environment variable in the vscode. Let’s add the variable in vscode · code -> preference -> setting -> {search for 'ENV: Osx'} -> edit the setting.json ... Once you add above lines restart the vscode and test it, Before writing code all you need to do is to download pyspark package