Running Apache Spark on Mac Pro M2 Ultra with GPU Acceleration - Need Guidance
Installing Apache Spark 3.5.0 on macOS - Stack Overflow
installing pyspark on my m1 mac, getting an env error
Practicing Spark on a local machine
Videos
Stupidly ambitious… please help … it’s too cool… Someone build this for me….
Just imagine, how much compute intelligence you could run on with a cool silver box on your desk…. That you can literally talk to….
Sooooo ChatGPT told me: ‘’’ Hello r/apache_spark,
I’m planning to set up Apache Spark on my new Mac Pro M2 Ultra and want to leverage its GPUs and neural engines for accelerated computing. Given that the M2 Ultra uses Apple’s architecture (not NVIDIA CUDA), I’m looking for advice on the best way to configure and optimize Spark to run efficiently on this hardware.
Here’s What I’m Looking to Achieve:
1. Run Apache Spark with GPU Acceleration: Utilize the M2 Ultra’s GPUs for Spark jobs. 2. Maximize Utilization of Neural Engines: Offload machine learning tasks to the M2 Ultra’s neural engines. 3. Docker Configuration: Set up Docker containers optimized for the ARM architecture to run Spark.
Questions:
1. Metal Integration: How can I integrate Metal (Apple’s GPU framework) with Spark for GPU-accelerated tasks? 2. Core ML and Spark: Is there a way to offload ML tasks from Spark to Core ML, utilizing the neural engines? 3. Docker Setup: Are there specific Docker configurations or images for running Spark on ARM-based Macs? 4. Accelerated Libraries: Which libraries or frameworks are recommended for GPU-accelerated computing on Apple Silicon?
What I’ve Done So Far:
• Installed Spark via Homebrew. • Investigated using Metal Performance Shaders (MPS) for GPU tasks. • Explored converting machine learning models to Core ML.
Example Setup:
Here’s a basic Dockerfile I’m using to run Spark on ARM:
FROM openjdk:11-jre-slim RUN apt-get update && apt-get install -y curl wget RUN wget https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz RUN tar -xvf spark-3.1.2-bin-hadoop3.2.tgz && mv spark-3.1.2-bin-hadoop3.2 /usr/local/spark ENV SPARK_HOME=/usr/local/spark ENV PATH=$SPARK_HOME/bin:$PATH WORKDIR $SPARK_HOME
Any suggestions, configurations, or pointers to relevant resources would be greatly appreciated. I’m eager to get the most out of my M2 Ultra for Spark workloads and would love to hear about any experiences or best practices from the community.
Thanks in advance for your help! ‘’’
What the AI said.
I wanted to learn spark on both scala and Python. I had python3 already so started off with Scala installation first
brew install coursier/formulas/coursier && cs setup
this installed a bunch of stuff scala, scala cli etc etc
Then I ran
brew install apache-spark
which installed a whole bunch of stuff as well. The apache-spark docs said pypark comes along with it, and you can run it by just running "pyspark" which I did, and a got a python like REPL
whenever I imported it in the Jupyter notebook however, it didn't work, so I had to do a pip3 install
then I ran
from pyspark import SparkContext sc = SparkContext() n = sc.parallelize([4,10,9,7]) n.take(3)
I got this as an error
raise RuntimeError(("Python in worker has different version %s than that in " +
RuntimeError: Python in worker has different version 3.8 than that in driver 3.7, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.I got the path using "which python3" and tried setting the above two paths in zshrc and z_profile to no effect. I'm a bit of a noob with this stuff and now I'm lost.
Can someone help me out here? TIA!
Sorry for the probably stupid question. I'm totally new to distributed computing and I can't find an answer to the following question: can I install Spark on my local machine (just a regular MacBook Pro) and "simulate" a cluster so I can start practicing with Spark? If so, how do I do? Any resources that explain how to do it? I just don't have access to a physical cluster of machines (yet).