Issue

I want to remove duplicates based on the combination of listings.product_id and listings.channel_listing_id

This simple query returns 400.000 rows (the id's of the rows I want to keep):

SELECT  id
    FROM  `listings`
    WHERE  is_verified = 0
    GROUP BY  product_id, channel_listing_id

While this variation returns 1.600.000 rows, which are all records on the table, not only is_verified = 0:

SELECT  *
    FROM (
        SELECT  id
            FROM  `listings`
            WHERE  is_verified = 0
            GROUP BY  product_id, channel_listing_id
         ) AS keepem

I'd expect them to return the same amount of rows. What's the reason for this? How can I avoid it (in order to use the subselect in the where condition of the DELETE statement)?

EDIT: I found that doing a SELECT DISTINCT in the outer SELECT "fixes" it (it returns 400.000 records as it should). I'm still not sure if I should trust this subquery, for there is no DISTINCT in the DELETE statement.

EDIT 2: Seems to be just a bug in the way phpMyAdmin reports the total count of the rows.

Solution

Your query as it stands is ambiguous. Suppose you have two listings with the same product_id and channel_id. Then what id is supposed to be returned? The first, the second? Or both, ignoring the GROUP request?

What if there is more than one id with different product and channel ids?

Try removing the ambiguity by selecting MAX(id) AS id and adding DISTINCT.

Are there any foreign keys to worry about? If not, you could pour the original table into a copy, empty the original and copy back in it the non-duplicates only. Messier, but you only do SELECTs or DELETEs guaranteed to succeed, and you also get to keep a backup.

Answered By - LSerni

Thursday, March 17, 2022

[FIXED] Subquery returns more rows than straight same query in MySQL

Issue

Solution

No comments:

Post a Comment