PySpark DataFrame

      Comments Off on PySpark DataFrame
Spread the love

PySpark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Python. PySpark DataFrame is designed to handle large datasets and allows you to perform various data analysis tasks efficiently.

In this article, we will discuss PySpark DataFrame, its features, and how to create, manipulate, and transform a PySpark DataFrame.

Features of PySpark DataFrame

PySpark DataFrame has the following features:

  1. Schema: PySpark DataFrame has a defined schema, which is a blueprint of the data. A schema specifies the column names, data types, and nullability of each column.
  2. Immutability: PySpark DataFrame is immutable, meaning you cannot modify its content once it is created. Instead, you can apply transformations to it to create a new DataFrame.
  3. Lazy Evaluation: PySpark DataFrame uses lazy evaluation, which means that it does not execute a transformation until an action is called.
  4. Distributed Processing: PySpark DataFrame is distributed across a cluster of machines, allowing you to process large datasets in parallel.
  5. Optimization: PySpark DataFrame uses Catalyst Optimizer to optimize the execution plan and improve performance.

Creating a PySpark DataFrame

You can create a PySpark DataFrame from various data sources such as CSV, JSON, Parquet, or from an existing RDD. Here is an example of creating a PySpark DataFrame from a CSV file:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

df ="path/to/file.csv", header=True, inferSchema=True)

In this example, we use the read.csv() method of the SparkSession object to read the CSV file. The header parameter is set to True to use the first line as the header and the inferSchema parameter is set to True to infer the data types of each column.

Manipulating a PySpark DataFrame

Once you have created a PySpark DataFrame, you can perform various operations on it. Here are some common operations:


The select() method is used to select one or more columns from a DataFrame.

python"column1", "column2")

In this example, we select two columns named column1 and column2.


The filter() method is used to filter rows based on a condition.

df.filter(df.column1 > 10)

In this example, we filter the rows where the value in column1 is greater than 10.


The groupBy() method is used to group rows based on one or more columns.

df.groupBy("column1").agg({"column2": "sum"})

In this example, we group the rows by column1 and calculate the sum of column2 for each group.


The join() method is used to join two DataFrames based on one or more columns.

df1.join(df2, ["column1"])

In this example, we join df1 and df2 based on column1.

Transforming a PySpark DataFrame

You can also transform a PySpark DataFrame to create a new DataFrame. Here are some common transformations:

Adding a Column

The withColumn() method is used to add a new column to a DataFrame.

df.withColumn("new_column", df.column1 + df.column2)

In this example, we add a new column named new_column that is the sum of column1 and


Renaming a Column

The withColumnRenamed() method is used to rename a column.

df.withColumnRenamed("column1", "new_column1")

In this example, we rename column1 to new_column1.

Dropping a Column

The drop() method is used to drop one or more columns from a DataFrame.


In this example, we drop column1 from the DataFrame.


PySpark DataFrame is a powerful tool for performing data analysis tasks on large datasets. It provides various features such as immutability, lazy evaluation, distributed processing, and optimization. PySpark DataFrame allows you to create, manipulate, and transform data efficiently. In this article, we discussed the features of PySpark DataFrame, how to create a PySpark DataFrame, and how to manipulate and transform it.