From the course: Machine Learning and AI: Advanced Decision Trees with SPSS
Unlock the full course today
Join today to access over 23,200 courses taught by industry experts.
How C5.0 handles missing data
From the course: Machine Learning and AI: Advanced Decision Trees with SPSS
How C5.0 handles missing data
- [Instructor] Let's talk about how missing data's handled in both C4.5 and C5. This can become a complicated topic, and the reason is the way that Quinlan is tackling missing data has a ripple effect through the information gain and information gain ratio calculations, making them more complicated. What he's trying to do is say that if a particular case is missing, that it adds no information content. So its presence has to be somewhat subtracted from those calculations, even though it's still on the tree. So it can make it really complicated, but the basic idea is straightforward. As Quinlan himself says in the book that he wrote to release C4.5, "It's possible to get enmeshed "in the details of calculations like these." But the punchline is as follows. "A case with an unknown test outcome "is divided into fragments whose weights are proportional "to the relative frequencies." He's referring to the known data. So if half the data goes down one branch, and the other half goes down…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.