There's no straightforward approach for this, because you cannot modify the Dataproc cluster used in the execution within the pipeline. So, if you really need to use the Python plug-in in Native mode, my suggestion is to create a cluster with the py4j library, and then connect it to Data Fusion using the "Remote Hadoop provisioner".
Consider that to use this provisioner, you'll need to create a new Compute Profile, which is only available in Data Fusion Enterprise version.
To install the py4j library in your cluster, you can either create a custom image with the library, provide an initialization actions script to install it, or SSH into the machines and manually execute the pip install command.
Answer from Tlaquetzal on Stack Overflow
» pip install py4j
» pip install jtypes.py4j
There's no straightforward approach for this, because you cannot modify the Dataproc cluster used in the execution within the pipeline. So, if you really need to use the Python plug-in in Native mode, my suggestion is to create a cluster with the py4j library, and then connect it to Data Fusion using the "Remote Hadoop provisioner".
Consider that to use this provisioner, you'll need to create a new Compute Profile, which is only available in Data Fusion Enterprise version.
To install the py4j library in your cluster, you can either create a custom image with the library, provide an initialization actions script to install it, or SSH into the machines and manually execute the pip install command.
Yes, Tlaquetzal is right, basically, you have two ways to achieve this.
Use the fixed cluster and set up the Remote Hadoop Provisioner in CDAP
Create a custom image with the library.
- Create a custom image with library doc
#!/bin/bash apt-get update apt -y --force-yes install python3.7 apt -y --force-yes install python3-pip pip3 install py4j- Set up the customized image in CDAP compute profile as below