January 4, 2020 - The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code.

stackoverflow.com › questions › 73051927 › debug-pyspark-in-vs-code

apache spark - Debug PySpark in VS Code - Stack Overflow

13 best practice for debugging python-spark code · 7 How to troubleshoot 'pyspark' is not recognized... error on Windows? 1 E0401:Unable to import 'pyspark in VSCode in Windows 10 · 3 pyspark dataframe methods (i.e. show()) can not be printed in vs code debug console ·

Videos

02:47

YouTube

How to Install PySpark in VS Code (Visual Studio Code) (2025) - ...

June 23, 2025

10:12

YouTube

Debug and Write PySpark Code with the AI Assistant in Databricks ...

July 15, 2024

youtube.com

Setting up PySpark in Visual Studio Code (with test module)

07:02

YouTube

Stop Using print(): Learn the VSCode Debugger - YouTube

April 23, 2024

02:53

YouTube

How to Install PySpark in Visual Studio Code (Easy) - YouTube

stackoverflow.com › questions › 65500267 › debug-pyspark-in-visual-studio-code

apache spark - Debug PySpark in Visual Studio Code - Stack Overflow

December 29, 2020 - Modifying Pyspark source code for debugging · 1 · E0401:Unable to import 'pyspark in VSCode in Windows 10 · 3 · pyspark dataframe methods (i.e. show()) can not be printed in vs code debug console · 3 · Visual studio code using pytest for Pyspark getting stuck at SparkSession Creation ·

GitHub

github.com › microsoft › vscode-python › issues › 15385

Impossible to debug pyspark script after upgrade vscode v1.53.1 · Issue #15385 · microsoft/vscode-python

1- Set debug configuration : { "name": "PySpark : test", "type": "python", "request": "launch", "stopOnEntry": true, "pythonPath": "${workspaceRoot}/sparkSubmit.sh", "program": "${workspaceRoot}/test_pyspark.py", "args": [], "cwd": "${workspaceRoot}/", "console": "integratedTerminal", "envFile": "${workspaceRoot}/.env", "env": {"PYSPARK_PYTHON":"python3"} } 2- sparkSubmit.sh call sparkSubmit standard tool 3- After upgrading vscode to 1.53.1, the debugger cannot start.

GitHub

github.com › microsoft › vscode-data-wrangler › issues › 255

Pyspark Dataframe Support · Issue #255 · microsoft/vscode-data-wrangler

August 10, 2024 - Error message "Could not retrieve variable df_config from the Jupyter extension. Please file an issue on the Data Wrangler GitHub repository." appears instead. Debug a pyspark application in VS Code, right click.

Author 0xbadidea

Microsoft Learn

learn.microsoft.com › en-us › fabric › data-engineering › author-sjd-with-vs-code

Create and manage Apache Spark job definitions in VS Code - Microsoft Fabric | Microsoft Learn

If the Spark job definition is created with PySpark (Python), you can download the .py script of the main definition file and the referenced file, and debug the source script in VS Code.

GitHub

github.com › microsoft › vscode-python › issues › 2531

Debugging pyspark applications no longer works after August update · Issue #2531 · microsoft/vscode-python

September 8, 2018 - The following command is displayed in the terminal after starting the debugger · cd /home/user/etl ; env "PYSPARK_PYTHON=python" "PYTHONPATH=/home/user/etl" "PYTHONIOENCODING=UTF-8" "PYTHONUNBUFFERED=1" /home/user/Spark/spark-2.2.1-bin-hadoop2.7/bin/spark-submit /home/user/.vscode/extensio...

Medium

nidhig631.medium.com › exploring-pyspark-setup-in-visual-studio-code-23f2bb778dde

Exploring PySpark Setup in Visual Studio Code | by Namaste Databricks | Medium

August 24, 2024 - This article provides a step-by-step guide to setting up your environment, leveraging the robust capabilities of PySpark, and seamlessly integrating it into the VS Code. Discover the efficiency and flexibility of developing, debugging, and optimizing your PySpark applications in a user-friendly and powerful IDE environment.

Stack Overflow

stackoverflow.com › questions › 64413494 › how-do-i-setup-pyspark-in-vs-code

How do I setup pyspark in VS Code? - Stack Overflow

export PYSPARK_PYTHON=python3.8 export PYSPARK_DRIVER_PYTHON=python3.8 · AND in vscode setting python interpreter to 3.8 too (you can set it from command palette and typing Python:Select Interpreter.

Find elsewhere

Google Bing Mojeek

Microsoft SQL Server Blog

cloudblogs.microsoft.com › home › visual studio code: develop pyspark jobs for sql server 2019 big data clusters

Visual Studio Code: Develop PySpark jobs for SQL Server 2019 Big Data Clusters - Microsoft SQL Server Blog

January 9, 2024 - With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current line, run selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below.

Microsoft Community

community.fabric.microsoft.com › t5 › Data-Engineering › Debugging-in-VSCode-pyspark-errors-exceptions-captured › m-p › 4602186

Solved: Debugging in VSCode: pyspark.errors.exceptions.cap... - Microsoft Fabric Community

April 3, 2025 - # Install mamba and PySpark ENV SPARK_VERSION=3.4.1 RUN conda install -n base -c conda-forge mamba -y && \ mamba install -n base -c conda-forge pyspark==$SPARK_VERSION -y && \ conda clean --all -y · As the original conda solver had issue with resolving dependency's and was not able to build an image. The default checks I'm doing are: What's my JAVA_HOME? Am I on the latest (prerelease)version of the fabric data enginering extension · Then I've ran into issue's with: - A refresh token: Solved: Re: Debugging in VSCode: Failed to get refresh tok...

GitHub

github.com › microsoft › vscode-python › issues › 5921

Make spark-submit arguments come right after "spark-submit" when debugging · Issue #5921 · microsoft/vscode-python

June 5, 2019 - spark-submit --queue ds-others training.py That works fine when executed in my terminal · Based on the launch.json above, when I start the debugger, the Debug console executes: cd /hadoop/met_scripts/datascience/dnaanalytics_us_group/us_grp_mo_idi_fwa ; env PYTHONIOENCODING=UTF-8 PYTHONUNBUFFERED=1 /bin/spark-submit /home/nokyere/.vscode-server-insiders/extensions/ms-python.python-2019.5.17517/pythonFiles/ptvsd_launcher.py --default --client --host localhost --port 40845 /hadoop/met_scripts/datascience/dnaanalytics_us_group/us_grp_mo_idi_fwa/python/training.py --queue ds-others

Author okyere

Microsoft Azure

azure.microsoft.com › blog home › developer tools › run your pyspark interactive query and batch job in visual studio code

Run your PySpark Interactive Query and batch job in Visual Studio Code | Microsoft Azure Blog

June 26, 2025 - We are excited to introduce the integration of HDInsight PySpark into Visual Studio Code (VSCode), which allows developers to easily edit Python scripts and submit PySpark statements to HDInsight clusters. This interactivity brings the best properties of Python and Spark to developers and empowers ...

Medium

bradyjiang.medium.com › introducing-pyspark-xray-a-diagnostic-tool-that-enables-local-debugging-of-pyspark-applications-17c06d6eca8d

Introducing pyspark_xray: a diagnostic tool that enables local debugging of PySpark applications using VSCode or PyCharm | by Brady Jiang | Medium

February 25, 2021 - pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

Medium

medium.com › @marcelopedronidasilva › how-to-install-and-run-pyspark-locally-integrated-with-vscode-via-jupyter-notebook-on-windows-ff209ac8621f

How to install and run Pyspark locally integrated with VSCode via Jupyter Notebook (on Windows). | by Marcelo Pedroni da Silva | Medium

July 18, 2024 - pyspark --version to see if everything is installed and configurated okay. If it did, you should see something like this: ... Now that all steps are done, you can work locally using PySpark, also can see the UI just typing pyspark on a cmd, ...

Stack Overflow

stackoverflow.com › questions › 31245083 › how-can-pyspark-be-called-in-debug-mode

python - How can PySpark be called in debug mode? - Stack Overflow

Other options

That being said, you're not completely out of options here.

Local mode

(passively attach debugger to a running interpreter)

Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example:

sc.parallelize([], n).count()

where n is a number of "cores" available in the local mode (local[n]). Example procedure step-by-step on Unix-like systems:

Start PySpark shell:
```
$SPARK_HOME/bin/pyspark 
```

Use pgrep to check there is no daemon process running:

➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
➜  spark-2.1.0-bin-hadoop2.7$

The same thing can be determined in PyCharm by:

alt+shift+a and choosing Attach to Local Process:

or Run -> Attach to Local Process.

At this point you should see only PySpark shell (and possibly some unrelated processes).
Execute dummy action:

sc.parallelize([], 1).count()
Now you should see both daemon and worker (here only one):
```
➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
13990
14046
➜  spark-2.1.0-bin-hadoop2.7$
```
and

The process with lower pid is a daemon, the one with higher pid is (possibly) ephemeral worker.
At this point you can attach debugger to a process of interest:
- In PyCharm by choosing the process to connect.
- With plain GDB by calling:
```
gdb python <pid of running process>
```

The biggest disadvantage of this approach is that you have find the right interpreter at the right moment.

Distributed mode

(Using active component which connects to debugger server)

With PyCharm

PyCharm provides Python Debug Server which can be used with PySpark jobs.

First of all you should add a configuration for remote debugger:

alt+shift+a and choose Edit Configurations or Run -> Edit Configurations.
Click on Add new configuration (green plus) and choose Python Remote Debug.
Configure host and port according to your own configuration (make sure that port and be reached from a remote machine)
Start debug server:

shift+F9

You should see debugger console:
Make sure that pyddev is accessible on the worker nodes, either by installing it or distributing the egg file.

pydevd uses an active component which has to be included in your code:

import pydevd
pydevd.settrace(<host name>, port=<port number>)

The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to mapPartitions) it may require patching PySpark source itself, for example pyspark.daemon.worker or RDD methods like RDD.mapPartitions. Let's say we are interested in debugging worker behavior. Possible patch can look like this:

diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 7f06d4288c..6cff353795 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
@@ -44,6 +44,9 @@ def worker(sock):
     """
     Called by a worker process after the fork().
     """
+    import pydevd
+    pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True)
+
     signal.signal(SIGHUP, SIG_DFL)
     signal.signal(SIGCHLD, SIG_DFL)
     signal.signal(SIGTERM, SIG_DFL)

If you decide to patch Spark source be sure to use patched source not packaged version which is located in $SPARK_HOME/python/lib.

Execute PySpark code. Go back to the debugger console and have fun:

Other tools

There is a number of tools, including python-manhole or pyrasite which can be used, with some effort, to work with PySpark.

Note:

Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).

2 of 2

Check out this tool called pyspark_xray, below is a high level summary extracted from its doc.

pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.

The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.

Problem

For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.

If you develop PySpark applications, you know that PySpark application code is made up of two categories:

code that runs on master node
code that runs on worker/slave nodes

While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.

Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.

Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.

Solution

pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

This library achieves these capabilties by using the following techniques:

wrapper functions of Spark code on slave nodes, check out the section to learn more details
practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
- For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
- True: if current OS is Mac or Windows
- False: otherwise

in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

Databricks

docs.databricks.com › developers › local development tools › databricks extension for visual studio code and cursor › tutorial

Tutorial: Run Python on a cluster and as a job using the Databricks extension for Visual Studio Code | Databricks on AWS

January 19, 2026 - Alternatively, right-click the demo.py file in the Explorer panel, then select Run on Databricks > Run File as Workflow. Now that you have successfully used the Databricks extension for Visual Studio Code to upload a local Python file and run it remotely, you can also: Explore Declarative Automation Bundles resources and variables using the extension UI. See Declarative Automation Bundles extension features. Run or debug Python code with Databricks Connect.

GitHub

github.com › jplane › pyspark-devcontainer

GitHub - jplane/pyspark-devcontainer: A simple VS Code devcontainer setup for local PySpark development · GitHub

Run the first cell... it will take a few seconds to initialize the kernel and complete. You should see a message to browse to the Spark UI...

Starred by 59 users

Forked by 27 users

Languages Jupyter Notebook 54.0% | Dockerfile 46.0%

Stack Overflow

stackoverflow.com › questions › 76399139 › spark-pyspark-configuration-in-visual-studio-code

python - Spark PySpark Configuration in Visual Studio Code - Stack Overflow

Traceback (most recent call last): File "c:\VScode workspace\spark_test\pyspark-test.py", line 1, in <module> from pyspark.sql import SparkSession ModuleNotFoundError: No module named 'pyspark'

Vikas Srivastava

vikassri.com › posts › setting-pyspark-dev

Settting up pyspark development on vscode (Mac) | Vikas Srivastava

July 25, 2020 - You need to set up environment variable in the vscode. Let’s add the variable in vscode · code -> preference -> setting -> {search for 'ENV: Osx'} -> edit the setting.json ... Once you add above lines restart the vscode and test it, Before writing code all you need to do is to download pyspark package