Locked lesson.
About this lesson
You may encounter missing data while doing analysis. What should you do? In this video, we'll discuss some helpful alternatives.
Exercise files
Download this lesson’s related exercise files.
Dealing With Missing Data.docx57.3 KB Dealing With Missing Data - Solution.docx
56.6 KB
Quick reference
Dealing With Missing Data
Often times our data sets contain missing data. In this video we'll learn how to drop missing data, or change missing data.
When to use
Use these two methods to either drop Null Data, or change it to something else.
Instructions
Given a DataFrame named df. To drop Null Objects (NaN) from a row:
df.dropna()
To drop Null Objects from a column:
df.dropna(axis=1)
To drop a row or column based on a certain number of Null Objects:
df.dropna(thresh=3) #only instances of 3 Null Objects
To replace Null Objects:
df.fillna(value="Bob") #replaces NaN with "Bob"
To replace based on the column average:
df.fillna(value=df['A'].mean())
Hints & tips
- To Drop Null Objects: df.dropna()
- To Replace Null Objects: df.fillna(value="Bob")
- 00:05 Okay, in this video I want to talk about missing data.
- 00:07 So up until now we've just been using random data for the most part and
- 00:11 all of our data frames have had full data.
- 00:15 There's been no missing things in there.
- 00:17 But a lot of times when you're dealing with data analysis you're going to have
- 00:20 missing data, right?
- 00:21 You get some dataset that's maybe not up to date or just not complete, and
- 00:26 there's a bunch of null values inside of it.
- 00:29 So what do you do when that happens?
- 00:30 Remember we said we can create data frames with dictionaries back at
- 00:33 the beginning of the course, so let's just create one real quick.
- 00:37 So let's just call this data and set this equal to a dictionary.
- 00:41 Remember back at the beginning of the course, we said we can create data frames
- 00:44 with dictionaries, this is a quicker way to do it.
- 00:46 So since we want exact data we don't want to make random stuff.
- 00:51 Let's just create this.
- 00:52 So first column is A and it'll have values of 1, 2, and 3.
- 00:56 The second column will be B and it'll have values of 4, 5, and 6.
- 01:03 And then finally the last column is C and
- 01:07 it will have values of 7, 8 and 9.
- 01:11 So we can create a data frame with this, let's just call it df and
- 01:15 set that equal to a pd.DataFrame, and then just pass in that data.
- 01:22 So if we run this guy we can see, okay, we got this nice columns A,
- 01:27 B, C, rows 0, 1, 2.
- 01:29 We didn't put any row index label so we just use the index numbers of 0,
- 01:34 1, and 2, as we've seen in the past.
- 01:37 Now this is complete data, there's no missing stuff here, right?
- 01:40 So let's come up here and let's just add some missing data.
- 01:43 So instead of 2, let's call np.nan, and
- 01:48 this little guy, this little function will create null values.
- 01:52 We've seen null values before these NaN objects, right?
- 01:56 So now we have incomplete data, right?
- 01:58 We have missing data, so what can we do?
- 02:00 Well, there's a couple of different things we want to look at,
- 02:02 we can drop the data and we can change the data.
- 02:05 So let's look at drop first, we can use the drop na function, na stands for
- 02:10 null object, right?
- 02:12 NaN, right?
- 02:13 So we can go at df.dropna, and this is a function.
- 02:19 And if we run this, we'll see any row that has a null value gets dropped.
- 02:25 Well, what if we didn't want rows?
- 02:27 Maybe we wanted to get rid of columns that had null values.
- 02:30 Well, we can Shift+Enter up here and
- 02:32 we can see that there's some things, and the first thing is axis.
- 02:36 Remember when we looked at columns and row headings earlier,
- 02:42 remember the axis for columns is 1 and the axis for rows is 0, so it's 0 by default.
- 02:47 So we can just change this to axis=1, if we just want to do this for columns.
- 02:53 And now we'll notice our column A is gone because column A had that null value,
- 02:58 right?
- 02:58 So that's kind of cool.
- 03:00 One thing to note if we call our data frame again, oops,
- 03:03 it's back, that's because we didn't specify this to be inplace.
- 03:07 Like we always have to do in so many things,
- 03:09 if we want this to be permanent we have to call it inplace=True.
- 03:13 So okay, there's just one null value in this, right?
- 03:18 So what if we have more than one?
- 03:21 So let's come up here and instead of 3, let's add another 1.
- 03:25 So let's run this and this, and now we have two values here, right?
- 03:29 So let's say we only want to get rid of columns that have one null value.
- 03:34 Well, this one now has two, we can set a threshold to equal whatever we want.
- 03:40 So let's say we want to a threshold of 1, when we specify that nothing gets removed.
- 03:45 That's because this column has 2 null values and
- 03:48 we've designated this to only remove them if they have 1.
- 03:51 So if we change this to 2, now it gets rid of that because there are 2 values.
- 03:57 So threshold and axis,
- 03:58 those are the main two things you want to remember with dropna.
- 04:01 You can always hit Shift+Tab, in order to pull this little guy up to see
- 04:05 the different things that you can play with and that's cool.
- 04:08 So that's how you drop things, right?
- 04:10 How do we replace them?
- 04:12 Well, we can do that with the fillna function.
- 04:15 So we can call df.fillna and it's a function,
- 04:20 and then we just pass in what value we want.
- 04:22 So we want to replace all of the null values and we can do anything we want.
- 04:27 We can change it to John, right?
- 04:28 So instead of these being null, now they're John,
- 04:32 we could change them to 41 if we wanted, right?
- 04:36 Whatever we want, that's pretty cool.
- 04:38 Now we can do sort of different math, we want to change the value to df and
- 04:44 let's say, we want to deal with column A.
- 04:49 Let's go back here and change this, let's put this back to 3 as it was.
- 04:54 So if we run this again, there's only 1 null value we have 1, null and 3.
- 05:01 So if we want to change this to the max of this column,
- 05:07 Then the null gets changed to 3 because the max was 3.
- 05:11 Similar we can do mean, right?
- 05:14 Now it gets changed to 1, if we want to do the average we can call mean.
- 05:19 Now it gets changed to 2, because 3 plus 1 is 4,
- 05:21 divided by 2 is 2, and that's the average 2.
- 05:25 Again, like before, if we call our data frame this gets changed back to
- 05:28 null because we didn't designate this to be inplace.
- 05:34 And so you can do that and that's really all there is to it.
- 05:37 Now, I'm not going to advise you on what you should change your objects to or
- 05:41 what you should change your null values to.
- 05:44 There's a whole class of statistical theory behind what you should change your
- 05:49 null values to, and it's going to depend on what you're trying to do,
- 05:53 the analysis you're trying to run.
- 05:55 So we're not going to get into anything like that,
- 05:57 that's a much more advanced topic.
- 05:58 I just want you to see exactly how you can change them and
- 06:01 then you could change them into whatever you want.
- 06:04 So that's how to deal with missing data.
- 06:05 In the next video, we'll look at grouping things.
Lesson notes are only available for subscribers.