Locked lesson.
About this lesson
Tidy up data by removing erroneous spaces, random line-breaks and weird non-printing characters by combining 3 useful Excel text functions.
Exercise files
Download this lesson’s related exercise files.
05-04-Data Cleaning Techniques-Start.xlsx433 KB 05-04-Data Cleaning Techniques-Complete.xlsx
456.1 KB 5.04 data-cleaning-techniques - Exercise.docx
53 KB Exercise - Data Cleaning Techniques.xlsx
79 KB 5.04 data-cleaning-techniques - Exercise solution.docx
150.5 KB Exercise Solution - Data Cleaning Techniques.xlsx
78.7 KB
Quick reference
Data Cleaning Techniques
Learn how to clean up a messy dataset so it's ready for analysis.
When to use
We clean our datasets whenever we have blank rows and cells, duplicates, strange formatting, and other inconsistencies in our data. Removing these ensures our data is ready for further analysis using Pivot Tables, Pivot Charts, and formulas.
Instructions
It's important to ensure that any dataset that we want to analyze is clean.
Datasets that contain blank rows and cells, duplicate values, inconsistent formatting, spelling errors, etc. need to be cleaned prior to analyzing using Pivot Tables and Charts.
If we do not clean up our dataset first, our analysis results could be harder to read and interpret and in many cases incorrect.
It's good to get into the habit of cleaning data particularly if we download datasets from external sources.
There are several techniques we can use to clean our data.
Remove Blank Rows
Blank rows can cause problems in datasets so it is best to remove them prior to analysis.
- Select the data.
- Press CTRL+G and then click Special.
- Choose Blanks.
This will select all blank rows in the dataset. We can now remove them in one go.
- From the Home tab, in the Cells group, click Delete and Delete Sheet Rows.
Clear Formatting
A dataset might contain rogue formatting that needs to be removed. We might have values highlighted in bold, background fills and borders around cells that we don't want.
- Press CTRL+A to select All.
- From the Home tab, in the Editing group, click Clear and Clear Formats.
Remove Duplicates
It's always worth checking that our datasets do not contain complete duplicates as this can throw off our analysis.
- Click anywhere in the dataset.
- From the Data tab, in the Data Tools group, click Remove Duplicates.
In this example, we are looking for exact duplicates where each item in each column is the same.
- Ensure all columns are selected.
- Ensure My data has headers is selected.
- Click OK.
Clean a Column with Text Functions
In this example, we need to tidy up column B. We can see that we have some erroneous spaces and inconsistent case. We can deal with both of these issues using text functions.
The PROPER function
The PROPER function changes the case of the selected text to proper case (first letter of each word is capitalized).
- Select column C.
- Press CTRL+SHIFT+(+) to insert a new column to the right.
The PROPER function only has one argument.
- Press Enter and copy the formula down.
All text in the column is converted to proper case.
The TRIM function
The TRIM function is used to remove any unwanted spaces in a cell. We can add the TRIM function to our PROPER formula.
- Click in cell C6.
- Press F2 to edit the cell.
- Press Enter and copy the formula down.
The erroneous spaces have been removed.
The CLEAN function
The CLEAN function removes any non-printing characters and manual line breaks from a cell. We can add this to our formula.
- Click in cell C6.
- Press F2 to edit the cell.
Remove the "dirty" column
We now have two columns that contain the same data, one is clean and the other contains the errors. We can't simply delete the "dirty" column as the formulas in the clean column rely on values in column B.
Instead, we can copy the clean data and use Paste Values to paste just the values over the top and remove the underlying formulas.
- Select column C.
- Press CTRL+C to Copy.
- From the Home tab, in the Clipboard group, click the lower half of the Paste button.
- Choose Paste Values.
- Select column B.
- Right-click and choose Delete from the menu.
Apply Number Formatting
It's important to ensure the data in each column has the correct number formatting applied.
- Select the 'Order Date' and 'Ship Date' columns.
- From the Home tab, in the Number group, click the drop-down next to General and select Short Date.
- Select the 'Units Sold' column.
- From the Home tab, in the Number group, click the drop-down next to General and select Number.
- Click the Decrease Decimal button twice.
- Select the 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', and 'Total Profit' columns.
- From the Home tab, in the Number group, click the drop-down next to General and select Currency.
Format Data as a Table
The final thing to do is to format the dataset as an Excel table. Formatting as a table ensures that any new data added to the dataset will automatically be included in any calculations, PivotTables, or charts we've created.
- Click anywhere in the data.
- Press CTRL+T.
- Place a check next to My table has headers.
- Click OK.
Hints & tips
- When we clear formatting from cells it will also clear any number formatting we have applied.
- Depending on your keyboard, you might need to use CTRL+(+) instead of CTRL+SHIFT+(+) to insert a new column.
- 00:04 When it comes to analyzing data in Excel,
- 00:06 one of the most important things you should be doing prior to even attempting
- 00:11 to analyze your data is making sure that you're starting out with a clean data set.
- 00:17 So what do we mean by a clean data set?
- 00:20 Well, a data set that is consistent in its format and
- 00:24 is effectively ready for analysis.
- 00:27 And once again, if you're importing data or
- 00:30 downloading data from third party software,
- 00:32 a lot of the time when you import it into Excel, it doesn't come across perfectly.
- 00:37 We could have things like blank cells, blank rows, weird formatting,
- 00:41 maybe the columns aren't formatted correctly.
- 00:44 So the whole process of tidying up our data is what we refer to as data
- 00:48 cleaning or data cleansing.
- 00:50 So I'm going to show you some of my favorite techniques to quickly tidy up
- 00:54 a data set.
- 00:55 Now if you take a look at my spreadsheet,
- 00:57 you can see I have a data set here that doesn't look particularly great.
- 01:01 It's definitely not ready to be putting into a pivot table or a pivot chart.
- 01:05 Now I will say this is quite a convoluted example.
- 01:09 I've basically just messed up this data set myself, but all of the techniques that
- 01:13 I'm going to show you are valid on any data set you download and need to tidy up.
- 01:17 So let's take a look at some of the issues that we have with our data set.
- 01:20 Well, I can see that I have some strange formatting.
- 01:23 Some cells have a gray background fill,
- 01:25 some cells have a border around the outside.
- 01:27 So I'm going to want to get rid of all of that so everything's consistent.
- 01:30 I can see I need to widen out my columns.
- 01:33 I can see that I have some blank rows that I should probably get rid of.
- 01:37 And if I take a look in column B,
- 01:38 I can see that I've got some weird things going on here.
- 01:41 I have inconsistent case.
- 01:43 I have some weird spacing issues.
- 01:46 So we're going to deal with all of this in our data cleansing process.
- 01:50 Now the first thing I'm going to do here is I'm going to widen out my column so
- 01:54 I can actually see all of my data.
- 01:55 So let's select all of our columns, and I'm just going to hover my mouse over any
- 02:01 of the column dividers and double-click, which is going to auto fit the cells.
- 02:06 Now notice because I have this title up here, that one's wider than it should be,
- 02:10 so I'm just going to simply drag this one in a little bit.
- 02:13 But you can see that all of the other columns are now perfectly sized.
- 02:17 The next thing I'm going to deal with is the blank rows that I have in this data.
- 02:21 You can see here, I have a blank row here, a blank row here, one just here, and
- 02:25 I might have this pattern continuing throughout my data set.
- 02:29 Now it is going to be quite tedious to go through and delete all of these manually.
- 02:34 I could right-click and go to Delete to get rid of the blank row.
- 02:38 But if I have 100,000 rows in my data set, I'm not going to want to do that manually
- 02:43 because it's going to take me quite a bit of time.
- 02:45 Now there is a much quicker way that you can do this using Go to Special.
- 02:49 Now we need to select all of our data, and the problem when you have blank rows in
- 02:54 your data, if I was to click somewhere here and press Ctrl+A,
- 02:57 it's only going to select the part of my data in between the two blank rows.
- 03:01 So this is one of the issues when you have blank rows in your data set.
- 03:04 So in general, what I would do is I would select all of the columns again.
- 03:09 Now, once again, because I have this heading in my spreadsheet, and
- 03:13 you might not have a heading in there,
- 03:15 I don't want to include this heading when I'm removing rows.
- 03:18 So I'm simply going to hold down Ctrl and
- 03:21 select the cells that I don't want to include.
- 03:24 So now that I have my dataset selected, I can press Ctrl+G,
- 03:28 click on Special, and then I can choose blanks.
- 03:32 And what that's going to do is it's going to
- 03:35 find all of the blank rows in my data set.
- 03:37 And now I can delete them all in one go simply by going up to the Home ribbon,
- 03:41 clicking the drop-down underneath Delete, and choosing Delete Sheet Rows.
- 03:46 And like magic, they are all gone.
- 03:48 The next thing I'm going to deal with is this inconsistent formatting.
- 03:52 So now I've deleted out those blanks,
- 03:54 I can press Ctrl+A to select all of my data set.
- 03:57 I'm going to go to the Home ribbon, click on Clear, and
- 04:00 I'm going to choose to Clear Formats.
- 04:02 And that's basically going to get rid of all of that background shading and
- 04:05 any borders we have applied.
- 04:07 Now bear in mind if you do have any currency formatting or
- 04:10 number formatting on any cells, it's going to get rid of those as well.
- 04:13 But we're going to go in and reapply those a bit later on.
- 04:16 The next thing I would probably check for
- 04:18 is I would make sure that I don't have any complete duplicate rows
- 04:22 in my data set because duplicates can really throw off your analysis.
- 04:26 Now when I'm talking about duplicates, I'm talking about where every single
- 04:31 column is duplicated, so exact duplicate records.
- 04:34 In the Data Tools group, we have a Remove Duplicates button, so let's click this.
- 04:40 We want to make sure we have a tick in all of the columns because we're looking for
- 04:44 exact duplicates.
- 04:45 Let's click on OK.
- 04:47 It's found one duplicate, and it's removed it for me.
- 04:50 What else do I want to do here?
- 04:52 Well, I need to sort out column B, because we have some inconsistent case.
- 04:56 And I can also see that for some of these, I have some weird spacing going on.
- 05:01 So what I'm going to do here is I'm going to add a new column,
- 05:04 Ctrl+Shift+Plus, and
- 05:05 this is going to be our helper column where we're going to clean this data.
- 05:09 Now, the first function that you can use to change the case is PROPER.
- 05:14 And proper case means that the first letter is capitalized.
- 05:16 So if you have things in lowercase or uppercase that you want to change to
- 05:20 proper case, this is the function you would use.
- 05:23 So if I select the first cell, and we only have one argument here, and
- 05:27 that is the text that we want to clean, hit Enter.
- 05:29 And if I double-click to copy this down,
- 05:32 you can see that it's now fixed any casing issues that we have.
- 05:36 Now what it hasn't fixed is this erroneous or this weird spacing.
- 05:40 So if you recall a couple of lessons ago, we used a function to remove
- 05:45 extra spaces from our data set, and that was the trim function.
- 05:49 So let's work in the formula bar.
- 05:51 I'm just going to simply add TRIM onto the front here.
- 05:56 That's all we need to do.
- 05:57 Let's close the bracket, hit Enter.
- 05:59 And now when we copy this formula down,
- 06:01 we should find that all of those weird spacing issues are now fixed.
- 06:06 Now another thing that you could add in here is the CLEAN function.
- 06:12 Now what CLEAN does is if you have any manual line breaks in cells you
- 06:16 want to get rid of, CLEAN will get rid of those.
- 06:19 It will also get rid of any weird non-printing characters that you have in
- 06:23 cells as well.
- 06:23 So sometimes when you import data you get sort of weird symbols in some cells,
- 06:28 CLEAN will get rid of those.
- 06:30 So you can always just add that into the formula, double-click to copy down, and
- 06:34 we've pretty much fixed all of the issues in this column.
- 06:37 Now the final thing you need to do here is now that we have this clean column,
- 06:41 we don't have a need for column B.
- 06:43 But if I just go in and try to delete column B, I get these reference errors,
- 06:48 and that is because we have formulas in column C that refer to column B.
- 06:53 So once I've cleaned my data set, what I'll do is I'll press
- 06:58 Ctrl+C to copy it, and then I'll simply go to Paste special and
- 07:02 I will paste just the values over the top.
- 07:05 And what that will do is it will get rid of those underlying formulas and
- 07:09 just leave the text.
- 07:11 So it means that we can safely delete out column B and
- 07:14 there we have our perfectly clean column.
- 07:17 The final thing to do here is just apply the correct number formatting.
- 07:20 So where we have some dates just here.
- 07:23 I'm going to change that to Short Date.
- 07:25 And then when we have currencies, I'm going to change these to,
- 07:29 let's just put currency format, like so.
- 07:32 Final thing to do, Ctrl+T to put everything into a table, and
- 07:37 we are ready to analyze our data.
Lesson notes are only available for subscribers.