PySpark — Create a DataFrame
1 min readOct 25, 2023
Import required library
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.sql.functions import col
spark = SparkSession.builder.appName('Ivaan').getOrCreate()
We are creating simple dataframe using pandas library.
data = pd.DataFrame({'first_name':['hari', 'ravi'], 'middle_name':['shankar', 'shankar'], 'age': [33, 30]})
Pandas dataframe convert in PySpark datafarme.
df = spark.createDataFrame(data=data)
df.show()
+----------+-----------+---+
|first_name|middle_name|age|
+----------+-----------+---+
| hari| shankar| 33|
| ravi| shankar| 30|
+----------+-----------+---+
If we want to create dataframe without using pandas
I will create a variable in name of data.
data = [("Ravi", "korba", 30),
("Hari", "bilaspur", 32)]
This data variable is passed as a parameter in the createDataFrame
function in PySpark, and you also pass the column name.
df = spark.createDataFrame(data=data, schema=["name", "city", "age"])
We can print all columns name using following command
df.columns
Output
['name', 'city', 'age']
Additionally, we can print column names with details.
df.printSchema()
root
|-- name: string (nullable = true)
|-- city: string (nullable = true)
|-- age: long (nullable = true)
Thank you