The case of duplicate rows

A client came to me with a problem. I had a database in an Excel spreadsheet with more than a million records, a few of them repeated. This is where my attempts to solve the case of duplicate rows began.

Since I don't like wasting readers' time, I tell you that the issue ended up being resolved in the easiest way, using the Excel 365 function which precisely serves to eliminate duplicate rows. However, as a good linuxer I tried the more complex methods first. That's what this article is about

The case of duplicate rows

It has to be said that, although LibreOffice Calc has no problem opening a 136 MB memory file and can be fully viewed and worked with, it was impossible to save, in both native and Microsoft formats. I have a computer with 8 GB of memory and an AMD A6-9500 processor, but you have to try to save it or the application would close or the computer would crash.

In general I am in favor of using the original application in which the document was created, that saves a lot of headaches (Most of them are the fault of Microsoft applications that do not usually accept interference) but the Windows 10 partition that I usually use for these cases it refused to work, so before reinstalling i decided to try other things.

bears and snakes

En Another item I already told you the steps to install programs from the Pip package manager in the new versions of Ubuntu Since the information about creating macros in LibreOffice Calc is sparse and out of date, I decided to use a Python script.

Actually, we cheat. The python script takes the data from the Excel spreadsheet, and imports it using a data analysis library. call pandas. Once this library processes the data, a new Excel spreadsheet is created.

I decided to try something simple, asking it to tell me the number of rows with duplicate data. The procedure is as follows (Remember to look first at the article that I indicated to you)

pip3 install pandas openpyxl This command installs the libraries that allow us to work with data and interact with Excel templates.

Then I opened the text editor and wrote the following code.
# Importa la biblioteca Pandas import pandas as pd

def detect_duplicate_rows(path):
# Load the Excel file into a pandas DataFrame
df = pd.read_excel(path)

# Find duplicate rows
duplicates = df.duplicated()

# Display the line numbers of the duplicate rows
duplicate_lines = duplicates[duplicates].index.tolist()
if duplicate_lines:
print("The following lines of duplicate rows were found:")
for line in duplicate_lines:
print(line)
else:
print("No duplicate rows found in the spreadsheet.")

# Excel file path
excel_file = "file.xlsx"

# Call the function to detect duplicate rows
detect_duplicate_rows(excel_file)

I saved it as excel1.py and ran it with python3 excel1.py
Because the result showed that there were many duplicate rows I decided to make a script that would delete them and remove the blank row. The code is this:
import pandas as pd

# Load the Excel spreadsheet
excel_file = 'file.xlsx'
df = pd.read_excel(excel_file, engine='openpyxl')

# Eliminate duplicate data leaving only the first row with the data
df_no_duplicates = df.drop_duplicates(keep='first')

# Write the result of deduplication processing to a new Excel file
df_without_duplicates.to_excel('file2.xlsx', index=False, engine='openpyxl')

I saved it under the name of excel2.py and ran it with the command python3 excel2.py

I don't know why it doesn't work because it ends up closing the terminal after a few minutes and the new file was never created.either. Anyway, I ended up reinstalling Windows and Office and running a simple command that brings Microsoft's office suite built in, proving that Ockham's razor applies to the world of software as well. Anyway, the first program I wrote helped me do a quick check that all the duplicate data had been removed. Anyway, I plan to continue investigating and reporting the results.

By the way, maybe it's due to the lack of documentation, but the habit of people using spreadsheets when there are database managers is curious.

LinuxAdictos

The case of duplicate rows

The case of duplicate rows

bears and snakes

Leave a Comment Cancel reply