pyspark sql delete rows

How to delete rows in a table created from a Spark dataframe?

stackoverflow.com › questions › 43515193 › how-to-delete-rows-in-a-table-created-from-a-spark-dataframe

Dataframes in Apache Spark are immutable. SO you cannot change it, to delete rows from data frame you can filter the row that you do not want and save in another dataframe.

Answer from koiralo on Stack Overflow

Srinimf

srinimf.com › 2024 › 03 › 25 › 5-best-ways-to-delete-rows-in-pyspark

5 Best Ways to Delete Rows in PySpark – Srinimf

March 25, 2024 - In PySpark, delete rows from DataFrame: filter, where, na.drop, drop, SQL Expression based on criteria.

Py-spark-sql

py-spark-sql.com › home › tutorials › beginner › delete with where

SQL DELETE with WHERE — Targeted Row Removal Tutorial

5 days ago - DELETE FROM t WHERE NOT EXISTS (SELECT 1 FROM other WHERE other.fk = t.pk) removes rows from t that have no matching record in other. For each row in t, the correlated subquery runs — if it returns zero rows, NOT EXISTS is TRUE and the row ...

Discussions

pyspark - How to delete rows efficiently in sparksql? - Stack Overflow

I get a view with IDs for which I have to delete the corresponding records in a table present in a database. View: |columnName| ------------ | 1 | | 2 | | 3 | | 4 | More on stackoverflow.com

stackoverflow.com

sql server - Delete records from table before writing dataframe - pyspark - Stack Overflow

I'm trying to delete records from my table before writing data into it from dataframe. Its not working for me ... What am I doing wrong? Goal: "delete from xx_files_tbl" before writing new More on stackoverflow.com

stackoverflow.com

apache spark - Remove rows from dataframe based on condition in pyspark - Stack Overflow

I have one dataframe with two columns: +--------+-----+ | col1| col2| +--------+-----+ |22 | 12.2| |1 | 2.1| |5 | 52.1| |2 | 62.9| |77 | 33.3| I would like to creat... More on stackoverflow.com

stackoverflow.com

pyspark - Delete rows based on condition from SQL Server using Databricks Notebook - Stack Overflow

In a Databricks notebook I'm calling my custom function that creates a session for connecting to a SQL Server database: spark = jdbc_spark_session( database=database, username=username, More on stackoverflow.com

stackoverflow.com

Videos

youtube.com

Spark SQL for Data Engineering11 : Spark SQL Delete ...

youtube.com

spark filter (delete) rows based on values from another ...

youtube.com

How to delete rows in a table created from a Spark dataframe?

02:01

YouTube

How to Delete Data from Azure SQL Table Using PySpark - YouTube

May 28, 2025

12:40

YouTube

Spark SQL Tutorial 14 | Delete Command In Spark SQL | Spark Tutorial ...

December 20, 2023

10:00

YouTube

Add, Delete & Update Row in Pyspark Dataframe - Databricks - YouTube

stackoverflow.com › questions › 43515193 › how-to-delete-rows-in-a-table-created-from-a-spark-dataframe

pyspark - How to delete rows in a table created from a Spark dataframe? - Stack Overflow

Top answer

1 of 3

Dataframes in Apache Spark are immutable. SO you cannot change it, to delete rows from data frame you can filter the row that you do not want and save in another dataframe.

2 of 3

You can not delete rows from Data Frame. But you can create new Data Frame which exclude unwanted records.

sql = """
    Select a.* FROM adsquare a
    INNER JOIN codepoint c ON a.grid_id = c.grid_explode
    WHERE dis2 <= 1 """

sq.sql(sql)

In this way you can create new data frame. Here I used reverse condition dis2 <= 1

Stack Overflow

stackoverflow.com › questions › 74107351 › how-to-delete-rows-efficiently-in-sparksql

pyspark - How to delete rows efficiently in sparksql? - Stack Overflow

spark.sql("""DELETE FROM """+database+"""."""+tableName+""" WHERE """+columnName+""" IN ( """+st+""") """) Here, st has all the unique values, for eg. if columnName is empid, then st with have 1,2,3,etc which I extract from the view using python. ... Both of the approaches work for me. Logic 1 is way faster than the merge logic but it still takes 20+ minutes because I have a lot of rows in my tables.

Stack Overflow

stackoverflow.com › questions › 64343789 › delete-records-from-table-before-writing-dataframe-pyspark

sql server - Delete records from table before writing dataframe - pyspark - Stack Overflow

Top answer

1 of 4

Instead of deleting the data in sql server table before writing your dataframe, you can directly write your dataframe with .mode("overwrite") and .option("truncate",true).

https://learn.microsoft.com/en-us/sql/big-data-cluster/spark-mssql-connector?view=sql-server-ver15

2 of 4

Spark documentations says that dbtable is used for passing table that should be read from or written into. FROM clause can be use only while reading data with JDBC connector. (resource: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html)

My suggestion would be either to use overwrite writing mode or to open a separate connection for data deletion. Spark is not required for data deletion and connection to MySQL server. It will be enough to use Python MySQL connector or to open a separate jdbc connection.

Stack Overflow

stackoverflow.com › questions › 52395986 › remove-rows-from-dataframe-based-on-condition-in-pyspark

apache spark - Remove rows from dataframe based on condition in pyspark - Stack Overflow

Top answer

1 of 4

I think the best way would be to simply use "filter".

df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

2 of 4

Another possible way could be using a where function of DF.

For example this:

val output = df.where("col1>col2")

will give you the expected result:

+----+----+
|col1|col2|
+----+----+
|  22|12.2|
|  77|33.3|
+----+----+

Find elsewhere

Google Bing Mojeek

Databricks

docs.databricks.com › sql language reference › delete from

DELETE FROM | Databricks on AWS

March 16, 2026 - [ common_table_expression ] DELETE FROM table_name [table_alias] [WHERE predicate] ... Common table expressions (CTE) are one or more named queries which can be reused multiple times within the main query block to avoid repeated computations or to improve readability of complex, nested queries. ... Identifies an existing table. The name must not include a temporal specification. ... Define an alias for the table. The alias must not include a column list. ... Filter rows by predicate.

Spark By {Examples}

sparkbyexamples.com › home › apache spark › spark drop, delete, truncate differences

Spark Drop, Delete, Truncate Differences - Spark By {Examples}

March 27, 2024 - Let's discuss the differences between drop, delete, and truncate using Spark SQL. Even though Drop, Delete, and Truncate sound the same but, there is a

Kontext

kontext.tech › home › blogs › spark & pyspark › "delete" rows (data) from pyspark dataframe

"Delete" Rows (Data) from PySpark DataFrame - Kontext Labs

September 25, 2021 - We can use where or filterfunction to 'remove' or 'delete' rows from a DataFrame. from pyspark.sql import SparkSession appName = "Python Example - 'Delete' Data from DataFrame" master = "local" # Create Spark session spark = SparkSession.builder ...

Stack Overflow

stackoverflow.com › questions › 78850219 › delete-rows-based-on-condition-from-sql-server-using-databricks-notebook

pyspark - Delete rows based on condition from SQL Server using Databricks Notebook - Stack Overflow

%scala import java.util.Properties import java.sql.DriverManager val jdbcUsername = "admin02" val jdbcPassword = "Welcome@1" val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver" val jdbcUrl = s"jdbc:sqlserver://dilipsqlser.database.windows.net:1433;database=db02;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;" val connectionProperties = new Properties() connectionProperties.put("user", jdbcUsername) connectionProperties.put("password", jdbcPassword) connectionProperties.setProperty("Driver", driverClass) val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword) val sql = "DELETE FROM Events WHERE EventId < 10003" val stmt = connection.createStatement() stmt.execute(sql) stmt.close() connection.close() println("Rows with EventId < 1003 have been deleted.")

Medium

medium.com › @shuklaprashant9264 › deltalake-delete-rows-that-match-a-predicate-condition-3dc66329d2cc

DeltaLake Delete rows that match a predicate condition | by PrashantShukla | Medium

May 4, 2023 - Here’s an example of how you can delete rows from a Delta table that match a predicate condition: from delta import DeltaTable from pyspark.sql.functions import col

Delta Lake

delta.io › blog › 2022-12-07-delete-rows-from-delta-lake-table

How to Delete Rows from a Delta Lake Table | Delta Lake

December 7, 2022 - Delete all the rows where the age is greater than 75. import delta import pyspark.sql.functions as F dt = delta.DeltaTable.forPath(spark, "tmp/sunny-table") dt.delete(F.col("age") > 75)

GeeksforGeeks

geeksforgeeks.org › python › drop-rows-in-pyspark-dataframe-with-condition

Drop Rows in PySpark DataFrame with Condition - GeeksforGeeks

July 23, 2025 - The dropna() method is used to remove rows that contain any null (missing) values in the DataFrame. It’s handy for quick cleanup when you want to keep only fully complete records ... import pyspark from pyspark.sql import SparkSession spark ...

Stack Overflow

stackoverflow.com › questions › 59370531

apache spark - delete row from table using pyspark on Notebook - Stack Overflow

Top answer

1 of 2

You can load the dataframe and filter it:

import pyspark.sql.functions as f

df = spark.sql("SELECT * from users_by_email")
df_filtered = df.filter(f.col("email_address") == "[email protected]")

Then you can save the dataframe with the overwrite option or, also, in a new table.

2 of 2

-1

Spark does not allow update and Delete Query over dataframe. You need to use Python external API in the code for deletion.

You can check below Python API which provide .delete() function for delete.

https://docs.datastax.com/en/developer/python-driver/3.18/api/cassandra/cqlengine/models/#cassandra.cqlengine.models.Model-methods

Stack Overflow

stackoverflow.com › questions › 72223596 › delete-multiple-rows-from-a-delta-table-pyspark-data-frame-givien-a-list-of-ids

databricks - Delete multiple rows from a delta table/pyspark data frame givien a list of IDs - Stack Overflow

Top answer

1 of 3

Using Spark SQl functions

dt_path = "/any/path/"
my_dt = DeltaTable.forPath(spark, dt_path)
seq_keys = ["id1", "id2", "id3"] 
my_dt.delete(col("key_col_name").isin(seq_keys))

And in scala:

val dt_path = "/any/path/"
val my_dt : DeltaTable = DeltaTable.forPath(spark, dt_path)
val seq_keys = Seq("id1", "id2", "id3")
my_dt.delete(col("key_col_name").isin(seq_keys:_*))

2 of 3

Let's say you have two dataframes, one being your data and the other one just a column with the IDs of rows to delete. A left-anti-JOIN can help you filter out the rows you want to delete.

df = df.join(dfWithIdsToDelete, "<idColumnName>", "left_anti")

This JOIN gives you all the rows of the df where the ID does not exist in the dfWithIdsToDelete, therefore filtering out all the rows you want to delete.

If your list of IDs to delete is a python list, you can just convert it to a dataframe.

Saturn Cloud

saturncloud.io › blog › how-to-remove-rows-in-a-spark-dataframe-based-on-position-a-comprehensive-guide

How to Remove Rows in a Spark Dataframe Based on Position: A Guide | Saturn Cloud Blog

January 11, 2024 - To remove rows based on their position, we’ll need to add an index column to the DataFrame, which will allow us to identify each row’s position. Once we have this, we can filter out the rows we don’t want.

Stack Overflow

stackoverflow.com › questions › 62392496 › delete-rows-from-cassandra-table-using-pyspark-or-cql-query

apache spark - Delete rows from cassandra table using pyspark or cql query - Stack Overflow

Top answer

1 of 2

Take a look on this code:

from pyspark.sql import SQLContext

def main_function():

  sql = SQLContext(sc)
  tests = sql.read.format("org.apache.spark.sql.cassandra").\
               load(keyspace="your keyspace", table="test").where(...)
  for test in tests:
    delete_sql = "delete from test_event where id = " + test.select('id')
    sql.execute(delete_sql)

Be aware of deleting one row at a time is not a best practice on spark but the above code is just an example to help you figure out your implementation.

2 of 2

Spark Cassandra Connector (SCC) itself provides only Dataframe API for Python. But there is a pyspark-cassandra package that provides RDD API on top of the SCC, so deletion could be performed as following.

Start pyspark shell with (I've tried with Spark 2.4.3):

bin/pyspark --conf spark.cassandra.connection.host=IPs\
    --packages anguenot:pyspark-cassandra:2.4.0

and inside read data from one table, and do delete. You need to have source data to have the columns corresponding to the primary key. It could be full primary key, partial primary key, or only partition key - depending on it, Cassandra will use corresponding tombstone type (row/range/partition tombstone).

In my example, table has primary key consisting of one column - that's why I specified only one element in the array:

rdd = sc.cassandraTable("test", "m1")
rdd.deleteFromCassandra("test","m1", keyColumns = ["id"])

Apache

spark.apache.org › docs › latest › api › python › reference › pyspark.pandas › api › pyspark.pandas.DataFrame.drop.html

pyspark.pandas.DataFrame.drop — PySpark 4.1.1 documentation

Changed in version 3.3: Added dropping rows by ‘index’.

Stack Overflow

stackoverflow.com › questions › 59931753 › pyspark-delete-row-from-postgresql

Pyspark delete row from PostgreSQL - Stack Overflow

January 27, 2020 - Dataframes in Apache Spark are immutable. You can filter out the rows you don't want. See the documentation. ... df = spark.jdbc("conn-url", "mytable") df.createOrReplaceTempView("mytable") df2 = spark.sql("SELECT * FROM mytable WHERE day != 3") df2.collect()

Skytowner

skytowner.com › explore › removing_rows_that_contain_specific_substring_in_pyspark_dataframe

Removing rows that contain specific substring in PySpark DataFrame

To remove rows that contain specific substring (e.g. '#') in PySpark DataFrame, use the contains(~) method: from pyspark.sql import functions as F ·