This worked for me:
import os
import sys
spark_path = "D:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local", "test")
To verify:
In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x707ccf8>
Answer from James Ma on Stack OverflowVideos
This worked for me:
import os
import sys
spark_path = "D:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.9-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local", "test")
To verify:
In [2]: sc
Out[2]: <pyspark.context.SparkContext at 0x707ccf8>
2018 version
INSTALL PYSPARK on Windows 10 JUPYTER-NOTEBOOK With ANACONDA NAVIGATOR
STEP 1
Download Packages
1) spark-2.2.0-bin-hadoop2.7.tgz Download
2) java jdk 8 version Download
3) Anaconda v 5.2 Download
4) scala-2.12.6.msi Download
5) hadoop v2.7.1Download
STEP 2
MAKE SPARK FOLDER IN C:/ DRIVE AND PUT EVERYTHING INSIDE IT It will look like this
NOTE : DURING INSTALLATION OF SCALA GIVE PATH OF SCALA INSIDE SPARK FOLDER
STEP 3
NOW SET NEW WINDOWS ENVIRONMENT VARIABLES
HADOOP_HOME=C:\spark\hadoopJAVA_HOME=C:\Program Files\Java\jdk1.8.0_151SCALA_HOME=C:\spark\scala\binSPARK_HOME=C:\spark\spark\binPYSPARK_PYTHON=C:\Users\user\Anaconda3\python.exePYSPARK_DRIVER_PYTHON=C:\Users\user\Anaconda3\Scripts\jupyter.exePYSPARK_DRIVER_PYTHON_OPTS=notebookNOW SELECT PATH OF SPARK :
Click on Edit and add New
Add "C:\spark\spark\bin” to variable “Path” Windows
STEP 4
- Make folder where you want to store Jupyter-Notebook outputs and files
- After that open Anaconda command prompt and cd Folder name
- then enter Pyspark
thats it your browser will pop up with Juypter localhost
STEP 5
Check pyspark is working or not !
Type simple code and run it
from pyspark.sql import Row
a = Row(name = 'Vinay' , age=22 , height=165)
print("a: ",a)
I'm assuming you already have spark and jupyter notebooks installed and they work flawlessly independent of each other.
If that is the case, then follow the steps below and you should be able to fire up a jupyter notebook with a (py)spark backend.
Go to your spark installation folder and there should be a
bindirectory there:/path/to/spark/binCreate a file, let's call it
start_pyspark.shOpen
start_pyspark.shand write something like:#!/bin/bash
export PYSPARK_PYTHON=/path/to/anaconda3/bin/python export PYSPARK_DRIVER_PYTHON=/path/to/anaconda3/bin/jupyter export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880" pyspark "$@"
Replace the /path/to ... with the path where you have installed your python and jupyter binaries respectively.
Most probably this step is already done, but just in case
Modify your~/.bashrcfile by adding the following lines# Spark export PATH="/path/to/spark/bin:/path/to/spark/sbin:$PATH" export SPARK_HOME="/path/to/spark" export SPARK_CONF_DIR="/path/to/spark/conf"
Run source ~/.bashrc and you are set.
Go ahead and try start_pyspark.sh.
You could also give arguments to the script, something like
start_pyspark.sh --packages dibbhatt:kafka-spark-consumer:1.0.14.
Hope it works out for you.

Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark, you can just
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
... and go