From the course: Machine Learning and AI: Advanced Decision Trees with SPSS

Unlock the full course today

Join today to access over 23,200 courses taught by industry experts.

How C5.0 handles missing data

How C5.0 handles missing data

- [Instructor] Let's talk about how missing data's handled in both C4.5 and C5. This can become a complicated topic, and the reason is the way that Quinlan is tackling missing data has a ripple effect through the information gain and information gain ratio calculations, making them more complicated. What he's trying to do is say that if a particular case is missing, that it adds no information content. So its presence has to be somewhat subtracted from those calculations, even though it's still on the tree. So it can make it really complicated, but the basic idea is straightforward. As Quinlan himself says in the book that he wrote to release C4.5, "It's possible to get enmeshed "in the details of calculations like these." But the punchline is as follows. "A case with an unknown test outcome "is divided into fragments whose weights are proportional "to the relative frequencies." He's referring to the known data. So if half the data goes down one branch, and the other half goes down…

Contents