PySpark — Create a DataFrame

Shiva Manhar
1 min readOct 25, 2023

Import required library

from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('Ivaan').getOrCreate()

We are creating simple dataframe using pandas library.

data = pd.DataFrame({'first_name':['hari', 'ravi'], 'middle_name':['shankar', 'shankar'], 'age': [33, 30]})

Pandas dataframe convert in PySpark datafarme.

df = spark.createDataFrame(data=data)
df.show()
+----------+-----------+---+
|first_name|middle_name|age|
+----------+-----------+---+
| hari| shankar| 33|
| ravi| shankar| 30|
+----------+-----------+---+

If we want to create dataframe without using pandas
I will create a variable in name of data.

data = [("Ravi", "korba", 30),
("Hari", "bilaspur", 32)]

This data variable is passed as a parameter in the createDataFrame function in PySpark, and you also pass the column name.

df = spark.createDataFrame(data=data, schema=["name", "city", "age"])

We can print all columns name using following command

df.columns

Output

['name', 'city', 'age']

Additionally, we can print column names with details.

df.printSchema()
root
|-- name: string (nullable = true)
|-- city: string (nullable = true)
|-- age: long (nullable = true)

Thank you

--

--

No responses yet