Locked lesson.
About this lesson
Tidy up data by removing erroneous spaces, random line-breaks and weird non-printing characters by combining 3 useful Excel text functions.
Exercise files
Download this lesson’s related exercise files.
05-04-Data Cleaning Techniques-Start.xlsx433 KB 05-04-Data Cleaning Techniques-Complete.xlsx
456.1 KB 5.04 data-cleaning-techniques - Exercise.docx
53 KB Exercise - Data Cleaning Techniques.xlsx
79 KB 5.04 data-cleaning-techniques - Exercise solution.docx
150.5 KB Exercise Solution - Data Cleaning Techniques.xlsx
78.7 KB
Quick reference
Data Cleaning Techniques
Learn how to clean up a messy dataset so it's ready for analysis.
When to use
We clean our datasets whenever we have blank rows and cells, duplicates, strange formatting, and other inconsistencies in our data. Removing these ensures our data is ready for further analysis using Pivot Tables, Pivot Charts, and formulas.
Instructions
It's important to ensure that any dataset that we want to analyze is clean.
Datasets that contain blank rows and cells, duplicate values, inconsistent formatting, spelling errors, etc. need to be cleaned prior to analyzing using Pivot Tables and Charts.
If we do not clean up our dataset first, our analysis results could be harder to read and interpret and in many cases incorrect.
It's good to get into the habit of cleaning data particularly if we download datasets from external sources.
There are several techniques we can use to clean our data.
Remove Blank Rows
Blank rows can cause problems in datasets so it is best to remove them prior to analysis.
- Select the data.
- Press CTRL+G and then click Special.
- Choose Blanks.
This will select all blank rows in the dataset. We can now remove them in one go.
- From the Home tab, in the Cells group, click Delete and Delete Sheet Rows.
Clear Formatting
A dataset might contain rogue formatting that needs to be removed. We might have values highlighted in bold, background fills and borders around cells that we don't want.
- Press CTRL+A to select All.
- From the Home tab, in the Editing group, click Clear and Clear Formats.
Remove Duplicates
It's always worth checking that our datasets do not contain complete duplicates as this can throw off our analysis.
- Click anywhere in the dataset.
- From the Data tab, in the Data Tools group, click Remove Duplicates.
In this example, we are looking for exact duplicates where each item in each column is the same.
- Ensure all columns are selected.
- Ensure My data has headers is selected.
- Click OK.
Clean a Column with Text Functions
In this example, we need to tidy up column B. We can see that we have some erroneous spaces and inconsistent case. We can deal with both of these issues using text functions.
The PROPER function
The PROPER function changes the case of the selected text to proper case (first letter of each word is capitalized).
- Select column C.
- Press CTRL+SHIFT+(+) to insert a new column to the right.
The PROPER function only has one argument.
- Press Enter and copy the formula down.
All text in the column is converted to proper case.
The TRIM function
The TRIM function is used to remove any unwanted spaces in a cell. We can add the TRIM function to our PROPER formula.
- Click in cell C6.
- Press F2 to edit the cell.
- Press Enter and copy the formula down.
The erroneous spaces have been removed.
The CLEAN function
The CLEAN function removes any non-printing characters and manual line breaks from a cell. We can add this to our formula.
- Click in cell C6.
- Press F2 to edit the cell.
Remove the "dirty" column
We now have two columns that contain the same data, one is clean and the other contains the errors. We can't simply delete the "dirty" column as the formulas in the clean column rely on values in column B.
Instead, we can copy the clean data and use Paste Values to paste just the values over the top and remove the underlying formulas.
- Select column C.
- Press CTRL+C to Copy.
- From the Home tab, in the Clipboard group, click the lower half of the Paste button.
- Choose Paste Values.
- Select column B.
- Right-click and choose Delete from the menu.
Apply Number Formatting
It's important to ensure the data in each column has the correct number formatting applied.
- Select the 'Order Date' and 'Ship Date' columns.
- From the Home tab, in the Number group, click the drop-down next to General and select Short Date.
- Select the 'Units Sold' column.
- From the Home tab, in the Number group, click the drop-down next to General and select Number.
- Click the Decrease Decimal button twice.
- Select the 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', and 'Total Profit' columns.
- From the Home tab, in the Number group, click the drop-down next to General and select Currency.
Format Data as a Table
The final thing to do is to format the dataset as an Excel table. Formatting as a table ensures that any new data added to the dataset will automatically be included in any calculations, PivotTables, or charts we've created.
- Click anywhere in the data.
- Press CTRL+T.
- Place a check next to My table has headers.
- Click OK.
Hints & tips
- When we clear formatting from cells it will also clear any number formatting we have applied.
- Depending on your keyboard, you might need to use CTRL+(+) instead of CTRL+SHIFT+(+) to insert a new column.
Lesson notes are only available for subscribers.