Overview: CART modeling is an exploratory data
analysis method used to study the relationships between a dependent
measure and a large series of possible predictor variables that themselves
may interact. The dependent measure may be a qualitative (nominal or
ordinal) one or a quantitative indicator. For the analyses conducted here,
we typically use a Gini measure of association, and prune branches using a
1 SEM rule.
Reading a CART Diagram:
CART diagrams should be thought of as a "tree trunk" with
progressive splits into smaller and smaller "branches." The initial
"tree trunk" is all of the participants in the study. A series
of "predictor" variables are assessed to see if splitting the
sample based on these predictors leads to better discrimination in the dependent measure. For instance, if our dependent
measure is whether the patient has gotten medical case management
services, we would first assess
whether there are different levels of receiving this service for two groups formed on the basis of one of the predictor variables. The
"most significant" of these predictions would define the first split of the
sample, or the first branching of the tree. Then, for each of the new
groups formed, we would ask if the subgroup could be further significantly
split by another of the predictor variables. And so on. After each split,
we ask if the new subgroup can be further split on another variable so
that there are significant differences in the dependent variable. The
result at the end of the tree building process is that we have a series of
groups that are maximally different from one another on the dependent
variable. At each step, the optimal binary split is made. Different orientations of the same tree are
sometimes useful to highlight different portions of the results. [Also
note that on the various CART diagrams in our work, certain boxes have
thicker lines around them. This is a "quirk" of the Answer Tree
program that at least one box must be highlighted in the diagrams. In our
work we usually highlight the top (base) node of the tree, but at other
times might highlight the final or terminal nodes, or certain notes of
special interest given the topic addressed or hypothesis tested.] In
CART modeling, all splits are binary (or into two groups); the same
variable may be split repeatedly, however.

Advantages: The CART
method has certain advantages as a way of looking for patterns in
complicated datasets. First, the level of measurement for the dependent
variable and predictor variables can be nominal or ordinal (categorical) or interval (a
"scale"). Second, the level of measurement for the predictor
variables can be nominal or ordinal or interval. Third, not all predictor
variables need be measured at the same level (nominal, ordinal, interval).
Fourth, missing values in predictor variables can be estimated from other
predictor ("surrogate") variables so that partial data can be used whenever
possible within the tree. Fifth, if an appropriately conservative set of
statistical criteria are used to prune the tree after is grown, the resulting models will primarily
emphasize strong results without over-capitalizing on chance because the
relationships between many variables are being considered at once. On the
other hand, it must always be remembered that CART modeling is
essentially a "stepwise" statistical method and that there is
always a potential for too much to be seen in the data even when very
conservative statistical criteria are used. Nonetheless, in those cases in
which there is not a strong theory in an area that would clearly indicate
which variables are, and are not, probably predictors of some dependent
measure, CART will be very useful in identifying major data trends.
Programs: All analyses in this Knowledge Base use
the Answer Tree computer program published by SPSS.
Technical Options: Typically the technical options used
for the analyses include the following: Gini measure of impurity for
categorical targets; a minimum parent node size of 10; a minimum child node size
of 5; the ability to split or combine continuously the categories of
predictor variables. In some cases, these technical options are adjusted
for the sample size or based on prior knowledge about the variables. Look
for a hyperlink to Technical
Options in various Knowledge Items to show the exact way the program
was set up for the analyses presented in that Knowledge Item. When sample
sizes are large or the variables are fairly "coarse" ones, the minimum parent node size
is sometimes set at 20 and
the minimum child node size at 10.
Note
on CART as a Modeling Mechanism: CART is a useful method of
summarizing data, and can show major natural divisions of the clients by
various defining variables. It must be recognized, however, that CART is
analogous to a "forward" stepwise regression analysis and has
all of the possible attendant difficulties of such stepwise regression.
The models presented should be considered as suggestive, but not absolutely definitive as there may be alternate models that may also fit the data in a statistically or theoretically acceptable manner. Note that in most cases, fairly conservative modeling methods
are used.
Technical Citation: Brieman, L.,
Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification
and regression trees. Belmont, CA: Wadsworth. See also, Kennedy, R.
L., Lee, Y, Roy, B. V., Reed, C. D., and Lippman, R. P. (1997). Solving
data mining problems through pattern recognition. Upper Saddle River,
N. J.: Prentice Hall.
Also see: The
implications of using alternate algorithms
to develop classification trees for the types of data typically analyzed
in studies by The Measurement Group.
Also see: General index of data mining applications (including those
using CHAID) conducted by The Measurement Group.
[This is the same tree as
that shown above with the orientation changed to show a slightly different
way of looking at the same result.]
Click
graphic to expand.
