AWS Glue Pyspark job is not able to save a Dataframe as csv format into an S3 Bucket (error `py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv`)
[Question] py4j.protocol.Py4JJavaError: An error occurred while calling o38.applySchemaToPythonRDD
py4j.protocol.Py4JJavaError: An error occurred while calling o29.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: mongodb
py4j.protocol.Py4JJavaError when calling o100.savePyDataFrame
Videos
Actually, I found a tricky solution. To solve the following problem:
Be sure that you installed Py4j, correctly. It's better to install it by using an official release. To do,
download the latest official release from from https://pypi.org/project/py4j/.
untar/unzip the file and navigate to the newly created directory, e.g., cd py4j-0.x.
run
sudo python(3) setup.py install
Then downgrade your Java to version 8 (previously, I used version 10.). To do, first remove the current version of Java using:
sudo apt-get purge openjdk-\* icedtea-\* icedtea6-\*
and then Install Java 8 using:
sudo apt install openjdk-8-jre-headless
Now the code works for me properly.
I also confirm that the solution works on Ubuntu 18.04 LTS.
I had a java 10 installed and tried to run the Python examples from: http://spark.apache.org/docs/2.3.1/, i.e. things as simple as:
./bin/spark-submit examples/src/main/python/pi.py 10
It did not work!
After applying the suggested fix:
sudo apt-get purge openjdk-\* icedtea-\* icedtea6-\*
sudo apt autoremove
sudo apt install openjdk-8-jre-headless
the example eventually worked; I mean if you consider that the right answer is:
Pi is roughly 3.142000
Thanks for the solution,
Bagvian