CART or C&RT

CART, for Classification And Regression Trees, is an exploratory method used to study the relationship between a dependent variable and a series of predictor variables. CART modeling is also called C&RT in some programs or statistical texts. CART modeling selects a set of predictors and their interactions that optimally predict the dependent measure. The developed model is a classification tree (or data partitioning tree) that shows how major "types" formed from the independent (predictor or splitter) variables differentially predict a criterion or dependent variable. The Measurement Group uses a second-generation CART algorithm coded in the Answer Tree program. CART is an alternative to Exhaustive CHAID analysis for developing a classification tree.


Overview: CART modeling is an exploratory data analysis method used to study the relationships between a dependent measure and a large series of possible predictor variables that themselves may interact. The dependent measure may be a qualitative (nominal or ordinal) one or a quantitative indicator. For the analyses conducted here, we typically use a Gini measure of association, and prune branches using a 1 SEM rule.

Reading a CART Diagram: CART diagrams should be thought of as a "tree trunk" with progressive splits into smaller and smaller "branches." The initial "tree trunk" is all of the participants in the study. A series of "predictor" variables are assessed to see if splitting the sample based on these predictors leads to better discrimination in the dependent measure. For instance, if our dependent measure is whether the patient has gotten medical case management services, we would first assess whether there are different levels of receiving this service for two groups formed on the basis of one of the predictor variables. The "most significant" of these predictions would define the first split of the sample, or the first branching of the tree. Then, for each of the new groups formed, we would ask if the subgroup could be further significantly split by another of the predictor variables. And so on. After each split, we ask if the new subgroup can be further split on another variable so that there are significant differences in the dependent variable. The result at the end of the tree building process is that we have a series of groups that are maximally different from one another on the dependent variable. At each step, the optimal binary split is made. Different orientations of the same tree are sometimes useful to highlight different portions of the results. [Also note that on the various CART diagrams in our work, certain boxes have thicker lines around them. This is a "quirk" of the Answer Tree program that at least one box must be highlighted in the diagrams. In our work we usually highlight the top (base) node of the tree, but at other times might highlight the final or terminal nodes, or certain notes of special interest given the topic addressed or hypothesis tested.] In CART modeling, all splits are binary (or into two groups); the same variable may be split repeatedly, however.

Advantages: The CART method has certain advantages as a way of looking for patterns in complicated datasets. First, the level of measurement for the dependent variable and predictor variables can be nominal or ordinal (categorical) or interval (a "scale"). Second, the level of measurement for the predictor variables can be nominal or ordinal or interval. Third, not all predictor variables need be measured at the same level (nominal, ordinal, interval). Fourth, missing values in predictor variables can be estimated from other predictor ("surrogate") variables so that partial data can be used whenever possible within the tree. Fifth, if an appropriately conservative set of statistical criteria are used to prune the tree after is grown, the resulting models will primarily emphasize strong results without over-capitalizing on chance because the relationships between many variables are being considered at once. On the other hand, it must always be remembered that CART modeling is essentially a "stepwise" statistical method and that there is always a potential for too much to be seen in the data even when very conservative statistical criteria are used. Nonetheless, in those cases in which there is not a strong theory in an area that would clearly indicate which variables are, and are not, probably predictors of some dependent measure, CART will be very useful in identifying major data trends.

Programs: All analyses in this Knowledge Base use the Answer Tree computer program published by SPSS

Technical Options: Typically the technical options used for the analyses include the following: Gini measure of impurity for categorical targets; a minimum parent node size of 10; a minimum child node size of 5; the ability to split or combine continuously the categories of predictor variables. In some cases, these technical options are adjusted for the sample size or based on prior knowledge about the variables. Look for a hyperlink to Technical Options in various Knowledge Items to show the exact way the program was set up for the analyses presented in that Knowledge Item. When sample sizes are large or the variables are fairly "coarse" ones, the minimum parent node size is sometimes set at 20 and the minimum child node size at 10.

Note on CART as a Modeling Mechanism: CART is a useful method of summarizing data, and can show major natural divisions of the clients by various defining variables. It must be recognized, however, that CART is analogous to a "forward" stepwise regression analysis and has all of the possible attendant difficulties of such stepwise regression. The models presented should be considered as suggestive, but not absolutely definitive as there may be alternate models that may also fit the data in a statistically or theoretically acceptable manner. Note that in most cases, fairly conservative modeling methods are used.

Technical Citation: Brieman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth. See also, Kennedy, R. L., Lee, Y, Roy, B. V., Reed, C. D., and Lippman, R. P. (1997). Solving data mining problems through pattern recognition. Upper Saddle River, N. J.: Prentice Hall.

Also see: The implications of using alternate algorithms to develop classification trees for the types of data typically analyzed in studies by The Measurement Group.

Also see: General index of data mining applications (including those using CHAID) conducted by The Measurement Group.

[This is the same tree as that shown above with the orientation changed to show a slightly different way of looking at the same result.]

Click graphic to expand.
Click graphic to expand.


TheMeasurementGroup.com Glossary Index

 


Copyright © 1999-2005 by The Measurement Group LLC. All rights reserved. This may not be current and will not be updated.