CHAID
CHAID, for
Chi-square
Automatic
Interaction
Detector
(or Detection, depending upon the source consulted), is an
exploratory method used to study the relationship between a dependent variable and a
series of predictor variables. CHAID modeling selects a set of predictors
and their interactions that optimally predict the dependent measure. The
developed model is a classification tree (or data partitioning tree) that
shows how major "types" formed from the independent (predictor
or splitter) variables differentially predict a criterion or dependent
variable. The Measurement Group uses a second-generation CHAID algorithm
called Exhaustive CHAID analysis as implemented in the SPSS Answer Tree
Program (Version 2.1 and 3.1).
Overview: CHAID modeling is an exploratory data
analysis method used to study the relationships between a dependent
measure and a large series of possible predictor variables that themselves
may interact. The dependent measure may be a qualitative (nominal or
ordinal) one or a quantitative indicator. For qualitative variables, a
series of chi-square analyses are conducted between the dependent and
predictor variables. For quantitative variables, analysis of variance
methods are used where intervals (splits) are determined optimally for the
independent variables so as to maximize the ability to explain a dependent
measure in terms of variance components.
Reading a CHAID Diagram:
CHAID diagrams should be thought of as a "tree trunk" with
progressive splits into smaller and smaller "branches." The initial
"tree trunk" is all of the participants in the study. A series
of "predictor" variables are assessed to see if splitting the
sample based on these predictors leads to a statistically significant
discrimination in the dependent measure. For instance, if our dependent
measure is quality of life and our potential predictor variables (or
splitting variables) are client characteristics, we would first assess
whether there are different levels of quality of life for two or more
groups formed on the basis of one of the predictor variables. The
"most significant" of these would define the first split of the
sample, or the first branching of the tree. Then, for each of the new
groups formed, we would ask if the subgroup could be further significantly
split by another of the predictor variables. And so on. After each split,
we ask if the new subgroup can be further split on another variable so
that there are significant differences in the dependent variable. The
result at the end of the tree building process is that we have a series of
groups that are maximally different from one another on the dependent
variable. At each step, statistical tests are made to determine if a
significant split can be made (correcting very conservatively for the fact
that we are examining many possible ways of splitting the data at one
time). In the example, the ultimate result would be a series of groups
defined by one of more of the predictor variables, that are different from
one another in overall quality of life levels. Note that the tree can be
pictured in an orientation from "top-to-bottom" or
"left-to-right" or "right-to-left" and that the
results are identical. Different orientations of the same tree are
sometimes useful to highlight different portions of the results. [Also
note that on the various CHAID diagrams in our work, certain boxes have
thicker lines around them. This is a "quirk" of the Answer Tree
program that at least one box must be highlighted in the diagrams. In our
work we usually highlight the top (base) node of the tree, but at other
times might highlight the final or terminal nodes, or certain notes of
special interest given the topic addressed or hypothesis tested.]
Click
graphic to expand.

[This is the
same tree as that shown later a second time below with the orientation changed to
show a slightly different way of looking at the same result. The model
shows the total health-related quality of life score for a group of
HIV/AIDS patients at the time of their enrollment into innovative service
programs for their HIV disease.]
Advantages: The CHAID
method has certain advantages as a way of looking for patterns in
complicated datasets. First, the level of measurement for the dependent
variable and predictor variables can be nominal (categorical), ordinal
(ordered categories ranked from small to large), or interval (a
"scale"). Second, the level of measurement for the predictor
variables can be nominal, ordinal, or interval. Third, not all predictor
variables need be measured at the same level (nominal, ordinal, interval).
Fourth, missing values in predictor variables can be treated as a
"floating category" so that partial data can be used whenever
possible within the tree. Fifth, if an appropriately conservative set of
statistical criteria are used, the resulting models will primarily
emphasize strong results without over-capitalizing on chance. On the
other hand, it must always be remembered that CHAID modeling is
essentially a "stepwise" statistical method and that there is
always a potential for too much to be seen in the data even when very
conservative statistical criteria are used. Nonetheless, in those cases in
which there is not a strong theory in an area that would clearly indicate
which variables are, and are not, probably predictors of some dependent
measure, CHAID will be very useful in identifying major data trends.
Known Issues/Problems with
the Method: The program Answer Tree used here permits a
Bonferroni-type probability to be used to correct for the number of
different ways a single predictor variable can be split (see Biggs,
deVille, and Suen, 1991). The program does not permit one to correct for
the number of potential splitter (predictor) variables being considered.
Also, Monte Carlo studies have not established the implications of mixing
nominal, ordinal, and continuous indicators in the prediction of either a
nominal, ordinal, or continuous dependent variable. Monte Carlo studies
have also not been extensively used to study the implications of different
potential ways of handling missing observations. Additionally, CHAID is
primarily a step-forward modeling fitting method. Known problems with
step-forward regression fitting models are probably applicable for this
method of analysis. Finally, CHAID is a sequential fitting algorithm and
its statistical tests are sequential with later effects being dependent
upon earlier ones, and not simultaneous as would be the case in a
regression model or analysis of variance where all effects are fit
simultaneously.
Programs: All analyses in this Knowledge Base use
the Answer Tree computer program published by SPSS.
Our analyses use the Exhaustive CHAID method, which tends to be more computationally
difficult, but which produces more accurate results. In most cases, in
addition to the analyses presented here, we have conducted alternate
analyses using alternate methods, or with alternate ways of setting up the
same problem, to confirm the general pattern of results presented is not
dependent upon the statistical analysis method.
Technical Options: Typically the technical options used
for the analyses include the following: Bonferroni .05 adjustment of the
probabilities; a minimum parent node size of 10; a minimum child node size
of 5; the ability to split or combine continuously the categories of
predictor variables. In some cases, these technical options are adjusted
for the sample size or based on prior knowledge about the variables. Look
for a hyperlink to Technical
Options in various Knowledge Items to show the exact way the program
was set up for the analyses presented in that Knowledge Item. When sample
sizes are large or the variables are fairly "coarse" ones, the minimum parent node size
is sometimes set at 20 and
the minimum child node size at 10.
Note
on CHAID as a Modeling Mechanism: CHAID is a useful method of
summarizing data, and can show major natural divisions of the clients by
various defining variables. It must be recognized, however, that CHAID is
analogous to a "forward" stepwise regression analysis and has
all of the possible attendant difficulties of such stepwise regression.
The models presented should be considered as suggestive, but not absolutely definitive as there may be alternate models that may also fit the data in a statistically or theoretically acceptable manner. Note that in most cases, fairly conservative modeling methods
are used because Bonferroni confidence intervals are used to correct
individual predictor variables. In virtually all cases here, we have used statistical
criteria which are fairly conservative, so we do not show every possible
"significant" relationship, but instead focus on those that are
"important" in a statistical sense, and presumably more likely
to be replicated in new samples.
Technical Citation: The analyses
conducted here use the algorithm discussed by D. Biggs, B. deVille, and E.
Suen (1991), A method of choosing multiway partitions for classification
and decision trees, Journal of Applied Statistics, 18(1), 49-62. Biggs,
deVille, and Suen show that their algorithm more correctly protects
standard statistical testing assumptions than earlier CHAID and AID
(Automatic Interaction Detection) algorithms. We thank SPSS for giving us
access to their internal statistical algorithms so that we could fully
understand the calculation steps made in this proprietary software.
[This is the same tree as
that shown above with the orientation changed to show a slightly different
way of looking at the same result. The model shows the total
health-related quality of life score for a group of HIV/AIDS patients at
the time of their enrollment into innovative service programs for their
HIV disease.]
Click
graphic to expand.

Also see: The
implications of using alternate algorithms
to develop classification trees for the types of data typically analyzed
in studies by The Measurement Group.
Also see: General index of data mining applications (including those
using CHAID) conducted by The Measurement Group.
TheMeasurementGroup.com Glossary Index
Copyright © 1999-2005 by The Measurement Group LLC. All rights reserved. This may not be current and will not be updated. |