The case of duplicate rows

Trying to delete duplicate records in Excel using free software

A client came to me with a problem. I had a database in an Excel spreadsheet with more than a million records, a few of them repeated. This is where my attempts to solve the case of duplicate rows began.

Since I don't like wasting readers' time, I tell you that the issue ended up being resolved in the easiest way, using the Excel 365 function which precisely serves to eliminate duplicate rows. However, as a good linuxer I tried the more complex methods first. That's what this article is about

The case of duplicate rows

It has to be said that, although LibreOffice Calc has no problem opening a 136 MB memory file and can be fully viewed and worked with, it was impossible to save, in both native and Microsoft formats. I have a computer with 8 GB of memory and an AMD A6-9500 processor, but you have to try to save it or the application would close or the computer would crash.

In general I am in favor of using the original application in which the document was created, that saves a lot of headaches (Most of them are the fault of Microsoft applications that do not usually accept interference) but the Windows 10 partition that I usually use for these cases it refused to work, so before reinstalling i decided to try other things.

bears and snakes

En Another item I already told you the steps to install programs from the Pip package manager in the new versions of Ubuntu Since the information about creating macros in LibreOffice Calc is sparse and out of date, I decided to use a Python script.

Actually, we cheat. The python script takes the data from the Excel spreadsheet, and imports it using a data analysis library. call pandas. Once this library processes the data, a new Excel spreadsheet is created.

I decided to try something simple, asking it to tell me the number of rows with duplicate data. The procedure is as follows (Remember to look first at the article that I indicated to you)

pip3 install pandas openpyxl This command installs the libraries that allow us to work with data and interact with Excel templates.

Then I opened the text editor and wrote the following code.

# Importa la biblioteca Pandas
import pandas as pd

def detect_duplicate_rows(path):
# Load the Excel file into a pandas DataFrame
df = pd.read_excel(path)

# Find duplicate rows
duplicates = df.duplicated()

# Display the line numbers of the duplicate rows
duplicate_lines = duplicates[duplicates].index.tolist()
if duplicate_lines:
print("The following lines of duplicate rows were found:")
for line in duplicate_lines:
print(line)
else:
print("No duplicate rows found in the spreadsheet.")

# Excel file path
excel_file = "file.xlsx"

# Call the function to detect duplicate rows
detect_duplicate_rows(excel_file)

I saved it as excel1.py and ran it with python3 excel1.py
Because the result showed that there were many duplicate rows I decided to make a script that would delete them and remove the blank row. The code is this:
import pandas as pd

# Load the Excel spreadsheet
excel_file = 'file.xlsx'
df = pd.read_excel(excel_file, engine='openpyxl')

# Eliminate duplicate data leaving only the first row with the data
df_no_duplicates = df.drop_duplicates(keep='first')

# Write the result of deduplication processing to a new Excel file
df_without_duplicates.to_excel('file2.xlsx', index=False, engine='openpyxl')

I saved it under the name of excel2.py and ran it with the command python3 excel2.py

I don't know why it doesn't work because it ends up closing the terminal after a few minutes and the new file was never created.either. Anyway, I ended up reinstalling Windows and Office and running a simple command that brings Microsoft's office suite built in, proving that Ockham's razor applies to the world of software as well. Anyway, the first program I wrote helped me do a quick check that all the duplicate data had been removed. Anyway, I plan to continue investigating and reporting the results.

By the way, maybe it's due to the lack of documentation, but the habit of people using spreadsheets when there are database managers is curious.


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: AB Internet Networks 2008 SL
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.

  1.   JF404 said

    Did you try to debug the script to find out where it was failing?
    Although when it comes to clients, the ideal is to deliver results as soon as possible, out of pure curiosity you can debug the script and try to identify the error, thus saving yourself time and creating a script that can be of use to you in the future.

    1.    Diego German Gonzalez said

      No, it didn't occur to me.
      I put it on the to-do list.
      It occurs to me that it is a memory problem, the file is too large.

      1.    Juan said

        Try using XlsxWriter instead of openpyxl

        1.    Diego German Gonzalez said

          Take note. Thanks.

  2.   Angel said

    I would recommend you measure the ram usage of the script. Loading dataframes into memory makes it easy for intensive tasks to run out of resources. You might be interested in keeping an eye on polars

    1.    Diego German Gonzalez said

      Thank you. Take note.

  3.   louis said

    If the excel file only contains data, no formulas, macros, graphics, etc., a good option is to export as CSV and process it with sort, thus:

    1- Open a console (cmd.exe)
    2- Write: sort /unique file.csv > output.csv
    3- Open it again in excel and save as XLSX format

    In case of using linux:

    1- Open Terminal,
    2- Write: sort -u file.csv > output.csv

    1.    Diego German Gonzalez said

      Good idea. I'm going to try it

  4.   Gregory ros said

    I make the last comment my own. I don't know why the databases were created if they don't use them later and see if a spreadsheet is easy to use, even more so a database.