As it is right now, the result will depend on the working directory, where you invoke the script.

If you're in root, this will add its parent. You should use path relative to __file__ (see what does the __file__ variable mean/do?):

parentPath = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 
    os.path.pardir
)

but I'd will recommend using proper package structure.

Note:

This covers only local mode and driver path and even in local mode, worker paths, are not affected by the driver path.

To handle executor paths (after changes you get executor exceptions) you should still distribute modules to the workers How to use custom classes with Apache Spark (pyspark)?.

spark = SparkSession.builder.appName("tests").getOrCreate()
spark.sparkContext.addPyFile("/path/to/cast_to_float.py")
Answer from Alper t. Turker on Stack Overflow
🌐
GitHub
github.com › jupyterlab › jupyterlab › issues › 16715
Py4JJavaError: An error occurred while calling o71.showString. · Issue #16715 · jupyterlab/jupyterlab
August 24, 2024 - ----> 2 df.show() 3 df.printSchema() ~\anaconda3\lib\site-packages\pyspark\sql\dataframe.py in show(self, n, truncate, vertical) 945 name | Bob 946 """ --> 947 print(self._show_string(n, truncate, vertical)) 948 949 def _show_string( ~\anaconda3\lib\site-packages\pyspark\sql\dataframe.py in _show_string(self, n, truncate, vertical) 963 964 if isinstance(truncate, bool) and truncate: --> 965 return self._jdf.showString(n, 20, vertical) 966 else: 967 try: ~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) 1320 1321 answer = self.gateway_client.send_command(command) -> 13
Author   KanataD
Discussions

PySpark python issue: Py4JJavaError: An error occurred while calling o48.showString - Stack Overflow
This is my piece of Code and it ... an error. --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) in () ----> 1 df_Broadcast.show() ~/anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py in show(self, n, truncate) 334 """ 335 if isinstance(truncate, bool) and truncate: --> 336 print(self._jdf.showString(n, 20)) 337 ... More on stackoverflow.com
🌐 stackoverflow.com
python - An error occurred while calling o196.showString - Stack Overflow
I am getting to know spark and wanted to convert a list (about 1000 entries) into a spark df. Unfortunately I get the mentioned error in the title. I couldn't really figure out what causes this err... More on stackoverflow.com
🌐 stackoverflow.com
Can't show dataframe (df.show() fails)
Hi all, I am new of spark and pyspark and I am currently working on my first example. In my case I have to access to a bq table and I am using the following code snippet: from pyspark.sql import Sp... More on github.com
🌐 github.com
18
July 31, 2020
python - Error while I am using DataFrame show method in Pyspark - Stack Overflow
Py4JJavaError: An error occurred while calling o607.showString. More on stackoverflow.com
🌐 stackoverflow.com
🌐
Databricks Community
community.databricks.com › t5 › data-engineering › i-created-a-data-frame-but-was-not-able-to-see-the-data › td-p › 14720
Solved: I created a data frame but was not able to see the... - Databricks Community - 14720
June 17, 2023 - Py4JJavaError: An error occurred while calling o48.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (DESKTOP-N3C4AUC.attlocal.net executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "C:\pyspark\spark-3.3.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\worker.py", line 668, in main ·
🌐
GitHub
github.com › GoogleCloudDataproc › spark-bigquery-connector › issues › 225
Can't show dataframe (df.show() fails) · Issue #225 · GoogleCloudDataproc/spark-bigquery-connector
July 31, 2020 - --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling o160.showString.
Author   informatica92
Find elsewhere
🌐
Reddit
reddit.com › r/apachespark › connection reset error on creating/showing dataframe directly from data, but reading from csv works
r/apachespark on Reddit: Connection Reset error on creating/showing DataFrame directly from data, but reading from CSV works
September 9, 2024 -

Hello, I started learning PySpark a week back and faced some issues today, which I then narrowed to create a minimal example of the problem:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
        .appName("CreateDataFrameExample") \
        .getOrCreate()
    
    columns = ["language", "users_count"]
    data = [("Java", 20000), ("Python", 10000), ("Scala", 3000)]
    
    # fails when df.show() is called [Connection Reset error]
    df = spark.createDataFrame(data, columns)
    
    #this works as expected
    #df = spark.read.csv("data.csv", header=True)
    
    df.show()

I get a connection reset error when I show the df created directly from the data, but am able to print the dataframe created from reading the csv. For sanity check I have tried LLMs which say that the code is correct. I have tried setting the timeout and heartbeat interval to high values which hasn't helped.

Stacktrace:

an error occurred while calling o47.showString.  
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (DESKTOP-\*\*\*\* executor driver): java.net.SocketException: Connection reset  
at java.net.SocketInputStream.read(SocketInputStream.java:210)  
at java.net.SocketInputStream.read(SocketInputStream.java:141)  
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)  
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)  
at java.io.DataInputStream.readInt(DataInputStream.java:387)  
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)     
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)     
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)  
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)    
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)  
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)  
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)  
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)  
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)  
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)          
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)          
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)  
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)  
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)  
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)          
at org.apache.spark.scheduler.Task.run(Task.scala:141)  
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)  
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)  
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)  
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)  
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)   
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)   
at java.lang.Thread.run(Thread.java:748)  
  
Driver stacktrace:  
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)  
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)  
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)  
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)          
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)         
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)  
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)       
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)  
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)  
at scala.Option.foreach(Option.scala:407)  
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)  
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)  
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)  
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)  
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)  
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)  
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398)  
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419)  
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438)  
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)         
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)         
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)    
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4332)  
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3314)  
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4322)  
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)  
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4320)  
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)  
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)  
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)  
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)  
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)  
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4320)  
at org.apache.spark.sql.Dataset.head(Dataset.scala:3314)  
at org.apache.spark.sql.Dataset.take(Dataset.scala:3537)  
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)  
at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)  
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
at java.lang.reflect.Method.invoke(Method.java:498)  
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)  
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)  
at py4j.Gateway.invoke(Gateway.java:282)  
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)  
at py4j.commands.CallCommand.execute(CallCommand.java:79)  
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)      
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)  
at java.lang.Thread.run(Thread.java:748)  
Caused by: java.net.SocketException: Connection reset  
at java.net.SocketInputStream.read(SocketInputStream.java:210)  
at java.net.SocketInputStream.read(SocketInputStream.java:141)  
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)  
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)  
at java.io.DataInputStream.readInt(DataInputStream.java:387)  
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)     
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)     
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)  
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)    
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)  
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)  
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)  
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)  
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)  
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)  
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)          
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)          
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)  
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)  
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)  
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)          
at org.apache.spark.scheduler.Task.run(Task.scala:141)  
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:6        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)  
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)  
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)  
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)   
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)   
... 1 more
🌐
Apache JIRA
issues.apache.org › jira › browse › SPARK-20086
[SPARK-20086] issue with pyspark 2.1.0 window function - ASF JIRA
August 2, 2017 - Py4JJavaError: An error occurred while calling o2609.showString. with traceback: Py4JJavaErrorTraceback (most recent call last) <ipython-input-215-3106d06b6e49> in <module>() ----> 1 df.show() /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in show(self, n, truncate) 316 """ 317 if isinstance(truncate, bool) and truncate: --> 318 print(self._jdf.showString(n, 20)) 319 else: 320 print(self._jdf.showString(n, int(truncate))) /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self, *args) 1131 answer = self.gateway_client.send_command(comm
🌐
Medium
medium.com › @yhoso › resolving-weird-spark-errors-f34324943e1c
Solving 5 Mysterious Spark Errors | by yhoztak | Medium
November 28, 2018 - An error occurred while calling o206.showString.: org.apache.spark.SparkException: Job 25 cancelled because SparkContext was shut down ...
🌐
Gankrin
gankrin.org › how-to-fix-py4jjavaerror-while-calling-oxx-showstring-in-spark
An Error Occurred While Calling oxx.showString" in Spark
Py4JJavaError: An Error Occurred ... py4jjavaerror ,spark py4j.protocol.py4jjavaerror ,pyspark py4jjavaerror ,py4j/protocol py4jjavaerror an error occurred while calling o94 save ,an error occurred while calling o84.showstring ,an error occurred while calling o51.showstring ,an ...
🌐
Stack Overflow
stackoverflow.com › questions › 70303535 › py4j-protocol-py4jjavaerror-an-error-occurred-while-calling-showstring
apache spark - py4j.protocol.Py4JJavaError: An error occurred while calling showString - Stack Overflow
Facing Py4JJavaError on executing PySpark Code · 2 · Spark dataframe will not show() - Py4JJavaError: An error occurred while calling o426.showString · 10 · Py4JJavaError: An error occurred while calling · 2 · Py4JJavaError running Pyspark program · 0 · Py4JJavaError: An error occurred while calling o840.showString ·
🌐
Stack Overflow
stackoverflow.com › questions › 65663877 › py4jjavaerror-an-error-occurred-while-calling-o8484-showstring
pyspark - Py4JJavaError: An error occurred while calling o8484.showString - Stack Overflow
Py4JJavaError: An error occurred while calling o8484.showString. : java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
🌐
Reddit
reddit.com › r/apachespark › error with pyspark and py4j
r/apachespark on Reddit: Error with PySpark and Py4J
September 5, 2024 -

Hey everyone!

I recently started working with Apache Spark, and its PySpark implementation in a professional environment, thus I am by no means an expert, and I am facing an error with Py4J.

In more details, I have installed Apache Spark, and already set up the SPARK_HOME, HADOOP_HOME, JAVA_HOME environment variables. As I want to run PySpark without using pip install pyspark, I have set up a PYTHONPATH environment variable, with values pointing to the python folder of Apache Spark and inside the py4j.zip.
My issue is that when I create a dataframe from scratch and use the command df.show() I get the Error

*"*Py4JJavaError: An error occurred while calling o143.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (xxx-yyyy.mshome.net executor driver): org.apache.spark.SparkException: Python worker failed to connect back".

However, the command works as it should when the dataframe is created, for example, by reading a csv file. Other commands that I have also tried, works as they should.

The version of the programs that I use are:
Python 3.11.9 (always using venv, so Python is not in path)
Java 11
Apache Spark 3.5.1 (and Hadoop 3.3.6 for the win.utls file and hadoop.dll)
Visual Studio Code
Windows 11

I have tried other version of Python (3.11.8, 3.12.4) and Apache Spark (3.5.2), with the same response

Any help would be greatly appreciated!

The following two pictures just show an example of the issue that I am facing.

----------- UPDATED SOLUTION -----------

In the end, also thanks to the suggestions in the comments, I figured out a way to make PySpark work with the following implementation. After running this code in a cell, PySpark is recognized as it should and the code runs without issues even for the manually created dataframe, Hopefully, it can also be helpful to others!

# Import the necessary libraries
import os, sys

# Add the necessary environment variables

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["spark_python"] = os.getenv('SPARK_HOME') + "\\python"
os.environ["py4j"] = os.getenv('SPARK_HOME') + "\\python\lib\py4j-0.10.9.7-src.zip"

# Retrieve the values from the environment variables
spark_python_path = os.environ["spark_python"]
py4j_zip_path = os.environ["py4j"]

# Add the paths to sys.path
for path in [spark_python_path, py4j_zip_path]:
    if path not in sys.path:
        sys.path.append(path)

# Verify that the paths have been added to sys.path
print("sys.path:", sys.path)
🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › Py4JJavaError-An-error-occurred-while-calling-o53-showString › td-p › 292437
Py4JJavaError: An error occurred while calling o53... - Cloudera Community - 292437
February 10, 2021 - C:\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical) 348 """ 349 if isinstance(truncate, bool) and truncate: --> 350 print(self._jdf.showString(n, 20, vertical)) 351 else: 352 print(self._jdf.showString(n, int(truncate), vertical))
🌐
Cloudera Community
community.cloudera.com › t5 › Support-Questions › Py4JJavaError-An-error-occurred-while-calling-o53-showString › m-p › 311320
Re: Py4JJavaError: An error occurred while calling... - Cloudera Community - 292437
February 10, 2021 - C:\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\dataframe.py in show(self, n, truncate, vertical) 348 """ 349 if isinstance(truncate, bool) and truncate: --> 350 print(self._jdf.showString(n, 20, vertical)) 351 else: 352 print(self._jdf.showString(n, int(truncate), vertical))