PySpark Installation Step by Step on window

Shiva Manhar
3 min readApr 24, 2023

--

You cannot directly install PySpark through pip install. Some prerequisites must be installed before PySpark can be installed.

There are basically four steps.

Step 1: Download and Install Java. Here is the download link for Java.

After downloading Java, you have to create a Java directory in the C drive. If you do not create a separate directory for Java, it will take the default directory as “Program Files” in Windows, and if you try to install Spark, it may show an error because Spark does not accept Java path with spaces. And inside java folder create other empty direct for jar.

Once you have created the directories, click on your downloaded setup file and click Next. Java will ask for the installation path, so you should give the path you created. After some time, it will ask for the jar path, so set your jar folder path. Then simply click Next, Next.

After installation, you need to set the Java path.

JAVA_HOME = C:\java

And other one is path

C:\java\bin

After installing Java, you need to download Spark. You can go to the following link and download it,

but I tried to download the “Download Spark” button that is shown on the right side of the website, and it did not work. You can click on the “Download Spark” link that is shown in step three.

Create an “Apps” folder in the C drive and paste your downloaded Spark file here. Unzip the file in this location. Then set you HADOOP_HOME and SPARK_HOME path

HADOOP_HOME = C:\Apps\spark-3.4.0-bin-hadoop3\spark-3.4.0-bin-hadoop3
SPARK_HOME= C:\Apps\spark-3.4.0-bin-hadoop3\spark-3.4.0-bin-hadoop3

Also you need to set your path.

Download and past in spark bin folder in winutils.exe file.

After completing all the steps, restart your system. Then open your terminal and type the command “spark-shell”. If the installation was successful, it will show the following screen.

Finally, you can install PySpark. If you have already installed Anaconda Navigator, you only need to open your Jupyter Notebook and type “!pip install pyspark”. If you have not installed Anaconda or Jupyter, I recommend installing Anaconda Navigator first. After that, you can easily import your PySpark library.

You can try this programmed.

import pyspark
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExample').getOrCreate()
print(spark)

Thank you

--

--