Sunday, November 30, 2025

I Cleaned a Messy CSV File Utilizing Pandas .  Right here’s the Actual Course of I Comply with Each Time.


good. You’re going to come across plenty of information inconsistencies. Nulls, unfavourable values, string inconsistencies, and so on. If these aren’t dealt with early in your information evaluation workflow, querying and analysing your information could be a ache afterward.

Now, I’ve performed information cleansing earlier than utilizing SQL and Excel, probably not with Python. So, to find out about Pandas (certainly one of Python’s information evaluation libraries), I’ll be dabbling in some information cleansing. 

On this article, I’ll be sharing with you a repeatable, beginner-friendly information cleansing workflow. By the top of this text, you ought to be fairly assured in utilizing Python for information cleansing and evaluation.

The Dataset we’ll be working with

I’ll be working with an artificial, messy HR dataset containing typical real-world errors (inconsistent dates, combined information varieties, compound columns). This dataset is from Kaggle, and it’s designed for practising information cleansing, transformation, exploratory evaluation, and preprocessing for information visualisation and machine studying.

The dataset accommodates over 1,000 rows and 13 columns, together with worker data resembling names, department-region combos, contact particulars, standing, salaries, and efficiency scores. It consists of examples of:

  • Duplicates
  • Lacking values
  • Inconsistent date codecs
  • Inaccurate entries (e.g., non-numeric wage values)
  • Compound columns (e.g., “Department_Region” like “Cloud Tech-Texas” that may be cut up)

It accommodates columns like:

  • Employee_ID: Distinctive artificial ID (e.g., EMP1001)
  • First_Name, Last_Name: Randomly generated private names
  • Title: Full title (could also be redundant with first/final)
  • Age: Consists of lacking values
  • Department_Region: Compound column (e.g., “HR-Florida”)
  • Standing: Worker standing (Energetic, Inactive, Pending)
  • Join_Date: Inconsistent format (YYYY/MM/DD)
  • Wage: Consists of invalid entries (e.g., “N/A”)
  • E mail, Cellphone: Artificial contact data
  • Performance_Score: Categorical efficiency ranking
  • Remote_Work: Boolean flag (True/False)

You’ll be able to entry the dataset right here and mess around with it

The dataset is absolutely artificial. It doesn’t include any actual people’ information and is protected to make use of for public, tutorial, or business initiatives.

This dataset is within the public area underneath the CC0 1.0 Common license. You’re free to make use of, modify, and distribute it with out restriction.

Overview of the Cleansing Workflow

The info cleansing workflow I’ll be working with consists of 5 easy phases.

  1. Load
  2. Examine
  3. Clear
  4. Evaluation
  5. Export

Let’s dive deeper into every of those phases.

Step 1 — Load the CSV (And Deal with the First Hidden Points)

There are some issues to bear in mind earlier than loading your dataset. Nevertheless, that is an optionally available step, and we in all probability wouldn’t encounter most of those points in our dataset. However it doesn’t harm to know these items. Listed here are some key issues to think about whereas loading.

Encoding points (utf-8latin-1)

Encoding defines how characters are saved as bytes within the file. Python and Pandas normally default to UTF-8, which handles most trendy textual content and particular characters globally. Nevertheless, if the file was created in an older system or a non-English surroundings, it’d use a distinct encoding, mostly Latin-1

So in case you attempt to learn a Latin-1 file with UTF-8, Pandas will encounter bytes it doesn’t recognise as legitimate UTF-8 sequences. You’ll sometimes see a UnicodeDecodeError whenever you attempt to learn a CSV with encoding points.

If maybe the default load fails, you could possibly attempt to specify a distinct encoding:

# First try (the default)
strive:
df = pd.read_csv(‘messy_data.csv’)
besides UnicodeDecodeError:
# Second try with a standard different
df = pd.read_csv(‘messy_data.csv’, encoding=’latin-1')

Improper delimiters

CSV stands for “Comma Separated Values,” however in actuality, many recordsdata use different characters as separators, like semicolons (frequent in Europe), tabs, and even pipes (|). Pandas sometimes defaults to the comma (,).

So, in case your file makes use of a semicolon (;) however you load it with the default comma delimiter, Pandas will deal with all the row as a single column. The outcome could be a DataFrame with a single column containing total traces of information, making it inconceivable to work with.

The repair is fairly easy. You’ll be able to strive checking the uncooked file (opening it in a textual content editor like VS Code or Notepad++ is finest) to see what character separates the values. Then, move that character to the sep argument like so

# If the file makes use of semicolons
df = pd.read_csv('messy_data.csv', sep=';')

# If the file makes use of tabs (TSV)
df = pd.read_csv('messy_data.csv', sep='t')

Columns that import incorrectly

Typically, Pandas guesses the info kind for a column primarily based on the primary few rows, however later rows include surprising information (e.g., textual content combined right into a column that began with numbers).

As an example, Pandas might accurately determine 0.1, 0.2, 0.3 as floats, but when row 100 accommodates the worth N/A, Pandas may pressure all the column into an object (string) kind to accommodate the combined values. This sucks since you lose the flexibility to carry out quick, vectorised numeric operations on that column till you clear up the unhealthy values.

To repair this, I exploit the dtype argument to inform Pandas what information kind a column must be explicitly. This prevents silent kind casting.

df = pd.read_csv(‘messy_data.csv’, dtype={‘value’: float, ‘amount’: ‘Int64’})

Studying the primary few rows

You would save time by checking the primary few rows instantly throughout the loading course of utilizing the nrows parameter. That is nice, particularly whenever you’re working with massive datasets, because it means that you can take a look at encoding and delimiters with out loading all the 10 GB file.

# Load solely the primary 50 rows to verify encoding and delimiter
temp_df = pd.read_csv('large_messy_data.csv', nrows=50)
print(temp_df.head())

When you’ve confirmed the arguments are appropriate, you may load the total file.

Let’s load the Worker dataset. I don’t anticipate to see any points right here.

import pandas as pd
df = pd.read_csv(‘Messy_Employee_dataset.csv’)
df

Output:

1020 rows × 12 columns

Now we are able to transfer on to Step 2 : Inspection

Step 2 — Examine the Dataset

I deal with this part like a forensic audit. I’m searching for proof of chaos hidden beneath the floor. If I rush this step, I assure myself a world of ache and analytical errors down the road. I all the time run these 4 essential checks earlier than writing any transformation code.

The next strategies give me the total well being report on my 1,020 worker data:

1. df.head() and df.tail(): Understanding the Boundaries

I all the time begin with a visible verify. I exploit df.head() and df.tail() to see the primary and final 5 rows. That is my fast sanity verify to see if all columns look aligned and if the info visually is sensible.

My Discovering:

After I ran df.head(), I seen my Worker ID was sitting in a column, and the DataFrame was utilizing the default numerical index (0, 1, 2, …) as an alternative.

Whereas I do know I may set Worker ID because the index, for now, I’ll go away it. The larger fast visible threat I’m searching for right here is information misaligned within the improper column or apparent main/trailing areas on names that can trigger hassle later.

2. df.data(): Recognizing Datatype Issues and Missingness

That is probably the most crucial methodology. It tells me the column names, the info varieties (Dtype), and the precise variety of non-null values.

My Findings on 1,020 Rows:

  • Lacking Age: My whole entry depend is 1,020, however the Age column solely has 809 non-null values. That’s a major quantity of lacking information that I’ll should resolve methods to deal with later—do I impute it, or do I drop the rows?
  • Lacking Wage: The Wage column has 996 non-null values, which is barely a minor hole, however nonetheless one thing I have to resolve.
  • The ID Kind Test: The Worker ID is accurately listed as an object (string). This isn’t proper. IDs are identifiers, not numbers to be averaged, and utilizing the string kind prevents Pandas from unintentionally stripping main zeros.

3. Information Integrity Test: Duplicates and Distinctive Counts

After checking dtypes, I must know if I’ve duplicate data and the way constant my categorical information is.

  • Checking for Duplicates: I ran df.duplicated().sum() and acquired a results of 0. That is good! It means I don’t have equivalent rows cluttering up my dataset.
  • Checking Distinctive Values (df.nunique()): I exploit this to grasp the variety inside every column. Low counts in categorical columns are wonderful, however I search for columns that must be distinctive however aren’t, or columns which have too many distinctive values, suggesting typos.
  • Employee_ID have 1020 distinctive data. That is good. It means all data are distinctive.
  • The First_Name / Last_Name discipline has eight distinctive data. That’s just a little odd. This confirms the dataset’s artificial nature. My evaluation received’t be skewed by a big number of names, since I’ll deal with them as normal strings.
  • Department_Region has 36 distinctive data. There’s excessive potential for typos. 36 distinctive values for area/division is simply too many. I might want to verify this column for spelling variations (e.g., “HR” vs. “Human Assets”) within the subsequent step.
  • E mail (64 distinctive data). With 1,020 staff, having solely 64 distinctive emails suggests many staff share the identical placeholder e-mail. I’ll flag this for exclusion from evaluation, because it’s ineffective for figuring out people.
  • Cellphone (1020 distinctive data). That is good as a result of it confirms telephone numbers are distinctive identifiers.
  • Age / Efficiency Rating / Standing / Distant Work (2–4 distinctive data). These low counts are anticipated for categorical or ordinal information, which means they’re prepared for encoding.

4. df.describe(): Catching Odd and Unattainable Values

I exploit df.describe() to get a statistical abstract of all my numerical columns. That is the place the place really inconceivable values—the “crimson flags”—present up immediately. I principally concentrate on the min and max rows.

My Findings:

I instantly seen an issue in what I anticipated to be the Cellphone Quantity column, which Pandas mistakenly transformed to a numerical kind.

Imply
-4.942253 * 10⁹
Min
-9.994973 * 10⁹
Max
-3.896086 * 10⁶
25%
-7.341992e * 10⁹
50%
4.943997 * 10⁹
75%
-2.520391e * 10⁹

It seems all of the telephone quantity values have been huge unfavourable numbers! This confirms two issues:

Pandas incorrectly inferred this column as a quantity, although telephone numbers are strings.

There should be characters within the textual content that Pandas can not interpret (for instance, parentheses, dashes, or nation codes). I must convert this to an object kind and clear it up fully.

5. df.isnull().sum(): Quantifying Lacking Information

Whereas df.data() offers me non-null counts, df.isnull().sum() offers me the entire depend of nulls, which is a cleaner option to quantify my subsequent steps.

My Findings:

  • Age has 211 nulls (1020 – 809 = 211), and
  • Wage has 24 nulls (1020 – 996 = 24). This exact depend units the stage for Step 3.

This inspection course of is my security web. If I had missed the unfavourable telephone numbers, any analytical step that concerned numerical information would have failed or, worse, produced skewed outcomes with out warning.

By figuring out the necessity to deal with Cellphone Quantity as a string and the numerous lacking values in Age now, I’ve a concrete cleansing listing. This prevents runtime errors and, critically, ensures that my last evaluation relies on believable, non-corrupted information.

Step 3 — Standardise Column Names, Right Dtypes, and Deal with Lacking Values

With my listing of flaws in hand (lacking Age, lacking Wage, the horrible unfavourable Cellphone Numbers, and the messy categorical information), I transfer into the heavy lifting. I deal with this step in three sub-phases: making certain consistency, fixing corruption, and filling gaps.

1. Standardising Column Names and Setting the Index (The Consistency Rule)

Earlier than I do any severe information manipulation, I implement strict consistency on column names. Why? As a result of typing df['Employee ID '] unintentionally as an alternative of df['employee_id'] is a silent, irritating error. As soon as the names are clear, I set the index.

My golden rule is snake_case and lowercase all over the place, and ID columns must be the index.

I exploit a easy command to strip whitespace, change areas with underscores, and convert every thing to lowercase.

# The Standardization Command
df.columns = df.columns.str.decrease().str.change(' ', '_').str.strip()
# Earlier than: ['Employee_ID', 'First_Name', 'Phone']
# After: ['employee_id', 'first_name', 'phone']

Now that our columns are standardised. I can transfer on to set employee_id as an index.

# Set the Worker ID because the DataFrame Index
# That is essential for environment friendly lookups and clear merges later.
df.set_index('employee_id', inplace=True)

# Let’s evaluate it actual fast
print(df.index)

Output:

Index(['EMP1000', 'EMP1001', 'EMP1002', 'EMP1003', 'EMP1004', 'EMP1005',
'EMP1006', 'EMP1007', 'EMP1008', 'EMP1009',
...
'EMP2010', 'EMP2011', 'EMP2012', 'EMP2013', 'EMP2014', 'EMP2015',
'EMP2016', 'EMP2017', 'EMP2018', 'EMP2019'],
dtype='object', title='employee_id', size=1020)

Excellent, every thing is in place.

2. Fixing Information Sorts and Corruption (Tackling the Unfavorable Cellphone Numbers)

My df.describe() verify revealed probably the most pressing structural flaw: the Cellphone column, which was imported as a rubbish numerical kind. Since telephone numbers are identifiers (not portions), they should be strings.

On this part, I’ll convert all the column to a string kind, which is able to flip all these unfavourable scientific notation numbers into human-readable textual content (although nonetheless stuffed with non-digit characters). I’ll go away the precise textual content cleansing (eradicating parentheses, dashes, and so on.) for a devoted standardisation step (Step 4).

# Repair the Cellphone dtype instantly
# Be aware: The column title is now 'telephone' resulting from standardization in 3.1
df['phone'] = df['phone'].astype(str)

3. Dealing with Lacking Values (The Age & Wage Gaps)

Lastly, I handle the gaps revealed by df.data(): the 211 lacking Age values and the 24 lacking Wage values (out of 1,020 whole rows). My technique relies upon totally on the column’s function and the magnitude of the lacking information:

  • Wage (24 lacking values): On this case, eradicating or dropping all lacking values could be the most effective technique. Wage is a crucial metric for monetary evaluation. Imputing it dangers skewing conclusions. Since solely a small fraction (2.3%) is lacking, I select to drop the unfinished data.
  • Age (211 lacking values). Filling the lacking values is the most effective technique right here. Age is commonly a characteristic for predictive modelling (e.g., turnover). Dropping 20% of my information is simply too expensive. I’ll fill the lacking values utilizing the median age to keep away from skewing the distribution with the imply.

I execute this technique with two separate instructions:

# 1. Elimination: Drop rows lacking the crucial 'wage' information
df.dropna(subset=['salary'], inplace=True)

# 2. Imputation: Fill lacking 'age' with the median
median_age = df['age'].median()
df['age'].fillna(median_age, inplace=True)

After these instructions, I might run df.data() or isnull().sum() once more simply to verify that the non-null counts for wage and age now replicate a clear dataset.

# Rechecking the null counts for wage and age
df[‘salary’].isnull().sum())
df[‘age’].isnull().sum())

Output:

np.int64(0)

Up to now so good!

By addressing the structural and lacking information points right here, the following steps can focus totally on worth standardisation, such because the messy 36 distinctive values in department_region—which we sort out within the subsequent part.

Step 4 — Worth Standardization: Making Information Constant

My DataFrame now has the fitting construction, however the values inside are nonetheless soiled. This step is about consistency. If “IT,” “i.t,” and “Information. Tech” all imply the identical division, I must pressure them right into a single, clear worth (“IT”). This prevents errors in grouping, filtering, and any statistical evaluation primarily based on classes.

1. Cleansing Corrupted String Information (The Cellphone Quantity Repair)

Keep in mind the corrupted telephone column from Step 2? It’s at the moment a multitude of unfavourable scientific notation numbers that we transformed to strings in Step 3. Now, it’s time to extract the precise digits.

So, I’ll be eradicating each non-digit character (dashes, parentheses, dots, and so on.) and changing the outcome right into a clear, unified format. Common expressions (.str.change()) are good for this. I exploit D to match any non-digit character and change it with an empty string.

# The telephone column is at the moment a string like '-9.994973e+09'
# We use regex to take away every thing that is not a digit
df['phone'] = df['phone'].str.change(r'D', '', regex=True)

# We are able to additionally truncate or format the ensuing string if wanted
# For instance, retaining solely the final 10 digits:
df['phone'] = df['phone'].str.slice(-10)
print(df['phone'])

Output:

employee_id
EMP1000 1651623197
EMP1001 1898471390
EMP1002 5596363211
EMP1003 3476490784
EMP1004 1586734256
...
EMP2014 2470739200
EMP2016 2508261122
EMP2017 1261632487
EMP2018 8995729892
EMP2019 7629745492
Title: telephone, Size: 996, dtype: object

Seems to be significantly better now. That is all the time a superb follow to scrub identifiers that include noise (like IDs with main characters or zip codes with extensions).

2. Separating and Standardizing Categorical Information (Fixing the 36 Areas)

My df.nunique() verify revealed 36 distinctive values within the department_region column. After I reviewed all of the distinctive values within the column, the output revealed that they’re all neatly structured as department-region (e.g., devops-california, finance-texas, cloud tech-new york).

I assume one option to clear up that is to separate this single column into two devoted columns. I’ll cut up the column on the hyphen (-) and assign the components to new columns: division and area.

# 1. Break up the mixed column into two new, clear columns
df[['department', 'region']] = df['department_region'].str.cut up('-', increase=True)
Subsequent, I’ll drop the department_region column because it’s just about ineffective now
# 2. Drop the redundant mixed column
df.drop('department_region', axis=1, inplace=True)
Let’s evaluate our new columns
print(df[[‘department’, ‘region’]])

Output:

division area
employee_id
EMP1000 devops california
EMP1001 finance texas
EMP1002 admin nevada
EMP1003 admin nevada
EMP1004 cloud tech florida
... ... ...
EMP2014 finance nevada
EMP2016 cloud tech texas
EMP2017 finance big apple
EMP2018 hr florida
EMP2019 devops illinois

[996 rows x 2 columns]

After splitting, the brand new division column has solely 6 distinctive values (e.g., ‘devops’, ‘finance’, ‘admin’, and so on.). That is nice information. The values are already standardised and prepared for evaluation! I assume we may all the time map all related departments to at least one single class. However I’m gonna skip that. I don’t wish to get too superior on this article.

3. Changing Date Columns (The Join_Date Repair)

The Join_Date column is normally learn in as a string (object) kind, which makes time-series evaluation inconceivable. This implies now we have to transform it to a correct Pandas datetime object.

pd.to_datetime() is the core operate. I usually use errors='coerce' as a security web; if Pandas can’t parse a date, it converts that worth to NaT (Not a Time), which is a clear null worth, stopping the entire operation from crashing.

# Convert the join_date column to datetime objects
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')

The conversion of dates permits highly effective time-series evaluation, like calculating common worker tenure or figuring out turnover charges by 12 months.

After this step, each worth within the dataset is clear, uniform, and accurately formatted. The specific columns (like division and area) are prepared for grouping and visualisation, and the numerical columns (like wage and age) are prepared for statistical modeling. The dataset is formally prepared for evaluation.

Step 5 — Remaining High quality Test and Export

Earlier than closing the pocket book, I all the time carry out one final audit to make sure every thing is ideal, after which I export the info so I can carry out evaluation on it later.

The Remaining Information High quality Test

That is fast. I re-run the 2 most important inspection strategies to verify that each one my cleansing instructions really labored:

  • df.data(): I affirm there are no extra lacking values within the crucial columns (age, wage) and that the info varieties are appropriate (telephone is a string, join_date is datetime).
  • df.describe(): I make sure the statistical abstract reveals believable numbers. The Cellphone column ought to now be absent from this output (because it’s a string), and Age and Wage ought to have logical minimal and most values.

If these checks move, I do know the info is dependable.

Exporting the Clear Dataset

The ultimate step is to save lots of this cleaned model of the info. I normally reserve it as a brand new CSV file to maintain the unique messy file intact for reference. I exploit index=False right here if I don’t need the employee_id (which is now the index) to be saved as a separate column, or index=True if I wish to save the index as the primary column within the new CSV.

# Exporting the clear DataFrame to a brand new CSV file
# We use index=True to maintain our major key (employee_id) within the exported file
df.to_csv('cleaned_employee_data.csv', index=True)

By exporting with a transparent, new filename (e.g., _clean.csv), you formally mark the top of the cleansing part and supply a clear slate for the subsequent part of the challenge.

Conclusion

Actually, I used to really feel overwhelmed by a messy dataset. The lacking values, the bizarre information varieties, the cryptic columns — it felt like going through the clean web page syndrome.

However this structured, repeatable workflow modified every thing. By specializing in Load, Examine, Clear, Evaluation, and Export, we established order immediately: standardizing column names, making the employee_id the index, and utilizing good methods for imputation and splitting messy columns.

Now, I can soar straight into the enjoyable evaluation half with out continually second-guessing my outcomes. In the event you battle with the preliminary information cleansing step, check out this workflow. I’d love to listen to the way it goes. If you wish to mess around with the dataset, you may obtain it right here.

Wanna join? Be happy to say hello on these platforms

LinkedIn

Twitter

YouTube

Medium

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles