In my environment (using docker and the image sequenceiq/spark:1.1.0-ubuntu), I ran in to this. If you look at the pyspark shell script, you'll see that you need a few things added to your PYTHONPATH:

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH

That worked in ipython for me.

Update: as noted in the comments, the name of the py4j zip file changes with each Spark release, so look around for the right name.

Answer from nealmcb on Stack Overflow
🌐
PyPI
pypi.org › project › pyspark
pyspark · PyPI
At its core PySpark depends on Py4J, but some additional sub-packages have their own extra requirements for some features (including numpy, pandas, and pyarrow).
      » pip install pyspark
    
Published   Jan 09, 2026
Version   4.1.1
🌐
Apache
spark.apache.org › docs › latest › api › python › getting_started › install.html
Installation — PySpark 4.1.1 documentation - Apache Spark
Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib.
Discussions

python - Why can't PySpark find py4j.java_gateway? - Stack Overflow
I installed Spark, ran the sbt assembly, and can open bin/pyspark with no problem. However, I am running into problems loading the pyspark module into ipython. I'm getting the following error: In ... More on stackoverflow.com
🌐 stackoverflow.com
Creating pyspark's spark context py4j java gateway object - Stack Overflow
I am trying to convert a java dataframe to a pyspark dataframe. For this I am creating a dataframe(or dataset of Row) in java process and starting a py4j.GatewayServer server process on java side. ... More on stackoverflow.com
🌐 stackoverflow.com
python - Py4J error when creating a spark dataframe using pyspark - Stack Overflow
After many searches via Google, I found the correct way of setting the required environment variables: PYTHONPATH=$SPARK_HOME$\python;$SPARK_HOME$\python\lib\py4j--src.zip The version of Py4J source package changes between the Spark versions, thus, check what you have in your Spark ... More on stackoverflow.com
🌐 stackoverflow.com
What are compatible versions of pyspark and py4j packages in python - Stack Overflow
I am trying to setup pyspark locally I've initiated a spark session created a view named people tried to read the view via below command spark.sql("Select * From people") It throws the... More on stackoverflow.com
🌐 stackoverflow.com
🌐
Apache
spark.apache.org › docs › latest › api › python › development › debugging.html
Debugging PySpark — PySpark 4.1.1 documentation
PySpark uses Spark as an engine. PySpark uses Py4J to leverage Spark to submit and computes the jobs.
🌐
Databricks
databricks.com › glossary › pyspark
What is Pyspark? | Databricks
Py4J is a popular library which is integrated within PySpark and allows python to dynamically interface with JVM objects. PySpark features quite a few libraries for writing efficient programs.
🌐
Medium
medium.com › @sivakumartoday › how-python-interacts-with-spark-using-py4j-pyspark-f93eb7e2c7c7
How Python Interacts with Spark Using Py4J (PySpark)? | by Sivakumar N | Medium
July 6, 2023 - PySpark uses Py4j, a Python library, to interact with the Java Virtual Machine (JVM) that runs Spark. Py4j enables seamless communication…
🌐
Stack Overflow
stackoverflow.com › questions › 66797382 › creating-pysparks-spark-context-py4j-java-gateway-object
Creating pyspark's spark context py4j java gateway object - Stack Overflow
File: "path_to_virtual_environment/lib/site-packages/pyspark/conf.py", line 120, in __init__ self._jconf = _jvm.SparkConf(loadDefaults) TypeError: 'JavaPackage' object is not callable · Can someone please help ? Below is the code I am using:- ... import py4j.GatewayServer public class TestJavaToPythonTransfer{ Dataset<Row> df1; public TestJavaToPythonTransfer(){ SparkSession spark = SparkSession.builder().appName("test1").config("spark.master","local").getOrCreate(); df1 = spark.read().json("path/to/local/json_file"); } public Dataset<Row> getDf(){ return df1; } public static void main(String args[]){ GatewayServer gatewayServer = new GatewayServer(new TestJavaToPythonTransfer()); gatewayServer.start(); System.out.println("Gateway server started"); } }
Find elsewhere
🌐
Waiting for Code
waitingforcode.com › home › pyspark
PySpark and the JVM - introduction, part 1 on waitingforcode.com - articles about PySpark
Unfortunately, there is no native way to write a Python code and run it on the JVM. Instead, the operation requires a proxy able to take the code from Python, pass it to the JVM, and get the results back if needed. The proxy layer used for that in PySpark is the Py4J library.
🌐
GitHub
github.com › apache › spark › blob › master › python › pyspark › java_gateway.py
spark/python/pyspark/java_gateway.py at master · apache/spark
conf : :py:class:`pyspark.SparkConf` ... JVM. This is a developer feature intended for use in · customizing how pyspark interacts with the py4j JVM (e.g., capturing ·...
Author   apache
🌐
Spark By {Examples}
sparkbyexamples.com › home › pyspark › solved: py4j.protocol.py4jerror: org.apache.spark.api.python.pythonutils.getencryptionenabled does not exist in the jvm
SOLVED: py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM - Spark By {Examples}
March 27, 2024 - Sometimes after changing/upgrading the Spark version, you may get this error due to the version incompatible between pyspark version and pyspark available at anaconda lib. In order to correct it do the following. Note: copy the specified folder from inside the zip files and make sure you have environment variables set right as mentioned in the beginning. Copy the py4j folder from C:\apps\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\ to C:\Programdata\anaconda3\Lib\site-packages\.
🌐
DEV Community
dev.to › steadbytes › python-spark-and-the-jvm-an-overview-of-the-pyspark-runtime-architecture-21gg
Python, Spark and the JVM: An overview of the PySpark Runtime Architecture - DEV Community
May 3, 2020 - Take a look at this visual1 "TL;DR" ... and transfer is handled by Spark JVM processes. The Python driver program communicates with a local JVM running Spark via Py4J2......
🌐
Medium
medium.com › @ketanvatsalya › a-scenic-route-through-pyspark-internals-feaf74ed660d
A Scenic Route through PySpark Internals | by Ketan Vatsalya | Medium
December 26, 2018 - Okay, so every SparkContext (the big white box in the diagram) has an associated gateway (the grey box marked Py4j), and that gateway is linked with a JVM. There can only be one SparkContext per JVM. And we somehow associate a JavaSparkContext (the inner grey box) with the JVM.
🌐
Apache
cwiki.apache.org › confluence › display › spark › pyspark+internals
PySpark Internals - Spark - Apache Software Foundation
PySpark is built on top of Spark's Java API. Data is processed in Python and cached / shuffled in the JVM: In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
🌐
Medium
medium.com › @saaayush646 › understanding-py4j-in-apache-spark-a4ee298f648f
Understanding Py4j in Apache Spark | by Aayush Singh | Medium
November 30, 2023 - Apache Spark, a versatile big data processing framework, harmonises the power of Java and Python through Py4J, fostering seamless integration and cross-language communication. In this guide, we’ll explore the workings of Py4J by dissecting ...
🌐
Reddit
reddit.com › r/apachespark › error with pyspark and py4j
r/apachespark on Reddit: Error with PySpark and Py4J
September 5, 2024 -

Hey everyone!

I recently started working with Apache Spark, and its PySpark implementation in a professional environment, thus I am by no means an expert, and I am facing an error with Py4J.

In more details, I have installed Apache Spark, and already set up the SPARK_HOME, HADOOP_HOME, JAVA_HOME environment variables. As I want to run PySpark without using pip install pyspark, I have set up a PYTHONPATH environment variable, with values pointing to the python folder of Apache Spark and inside the py4j.zip.
My issue is that when I create a dataframe from scratch and use the command df.show() I get the Error

*"*Py4JJavaError: An error occurred while calling o143.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (xxx-yyyy.mshome.net executor driver): org.apache.spark.SparkException: Python worker failed to connect back".

However, the command works as it should when the dataframe is created, for example, by reading a csv file. Other commands that I have also tried, works as they should.

The version of the programs that I use are:
Python 3.11.9 (always using venv, so Python is not in path)
Java 11
Apache Spark 3.5.1 (and Hadoop 3.3.6 for the win.utls file and hadoop.dll)
Visual Studio Code
Windows 11

I have tried other version of Python (3.11.8, 3.12.4) and Apache Spark (3.5.2), with the same response

Any help would be greatly appreciated!

The following two pictures just show an example of the issue that I am facing.

----------- UPDATED SOLUTION -----------

In the end, also thanks to the suggestions in the comments, I figured out a way to make PySpark work with the following implementation. After running this code in a cell, PySpark is recognized as it should and the code runs without issues even for the manually created dataframe, Hopefully, it can also be helpful to others!

# Import the necessary libraries
import os, sys

# Add the necessary environment variables

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["spark_python"] = os.getenv('SPARK_HOME') + "\\python"
os.environ["py4j"] = os.getenv('SPARK_HOME') + "\\python\lib\py4j-0.10.9.7-src.zip"

# Retrieve the values from the environment variables
spark_python_path = os.environ["spark_python"]
py4j_zip_path = os.environ["py4j"]

# Add the paths to sys.path
for path in [spark_python_path, py4j_zip_path]:
    if path not in sys.path:
        sys.path.append(path)

# Verify that the paths have been added to sys.path
print("sys.path:", sys.path)
🌐
Py4j
py4j.org
Welcome to Py4J — Py4J
Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods.
🌐
Python.org
discuss.python.org › python help
Getting py4j.protocol.Py4JJavaError when running Spark job (pyspark version 3.5.1 and python version 3.11) - Python Help - Discussions on Python.org
April 17, 2024 - Hi, I am getting the following error when running Spark job (pySpark 3.5.1 is what my pip freeze shows) using Python 3.11. My colleague is using python 3.9 and he seems to have no problem. Could it be just because of higher Python version difference? py4j.protocol.Py4JJavaError: An error occurred while calling o60.javaToPython.