Study the data file in order to understand it before performing the following exercises. Exercise 1: An R x C Table with Chi-Square Test of Independen...

0 downloads 0 Views 1MB Size

N U I T, NEWCASTLE UNIVERSITY

IBM SPSS STATISTICS for Windows Intermediate / Advance A Training Manual for Intermediate / Experience Users, Faculty of Medical Sciences Dr S. T. Kometa

Table of Contents Ordinary Regression ................................................................................................................ 3 Repeated Measures Analysis................................................................................................... 9 Data Analysis Using Crosstabulation Techniques .............................................................. 12 Type of Survival Analysis / Kaplan-Meier .......................................................................... 18 Binary Logistics Regression .................................................................................................. 21 Multivariate Analysis of Variance (MANOVA).................................................................. 24

2

Ordinary Linear Regression Model with Two Independent Variables Why fit a regression model? To build a model for predicting the outcome variable for a new sample of data. To see how well the independent (explanatory) variables explain the dependent (response) variable. To identify which subsets from many independent variables is most effective for estimating the dependent variable. Open the data set called world95.sav. To do this, follow these instructions: 1. Select Start -> Programs -> Statistical Software -> IBM SPSS Statistics -> IBM SPSS Statistics 19. 2. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear. 3. In the text area for File name: type \\campus\software\dept\spss and then click on Open. 4. Select the file world95.sav and click on Open. 5. Spend some time to study the data file. How many cases and variables make up the data file? Cases:…….. Variables:……… 6. Are there any missing values in the data? Yes No Assumptions for Ordinary Linear Regression

All observations should be independent. Your data should not suffer from multicollinearity. That is the independent variables should not be highly related. To find out if your data suffer from multicollinearity, you have to look at the tolerances for each of the independent variables in the model. These are printed if you select Collinearity Diagnostics in the Linear Regression Statistics dialogue box. If any of the tolerances are small, less than 0.1 for example, multicollinearity may be a problem. Residual from model fit should follow a normal distribution. Each of the independent (explanatory or predictor) continuous variables should have a linear relationship with the dependent (response or outcome) variable. It is always a good idea to check this assumption using a scatterplots.

Simple Linear Regression Is the female literacy of a country useful in predicting their life expectancy? We want to build model of the form: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑡𝑒𝑟𝑎𝑐𝑦 + 𝜀 Where Average female life expectancy (lifeexpf) is the dependent (response, y, or outcome) variable, female who can read (%) (lit_fema) is the independent (explanatory or predictor) variable, 𝑏0 is the intercept of the line of best fit, b1 is its slope and 𝜀 is the error term. Is there a linear relationship between average female life expectancy and female literacy? Produce a scatter plot to help you answer this question.

3

To produce the output for the regression model, from the menus choose: Analyze -> Regression -> Linear…. Dependent Variable: Average female life expectancy [lifeexpf] Independent: female who can read (%) [lit_fema] Statistics… Descriptives Make sure that Estimates and Model fit are selected. Select Collinearity diagnostics Residuals Casewise disnognotics Select Outlier outside 1.0 standard deviations Plots… Y: *ZRESID X: *ZPRED Click Next Y: ZPRED X: Dependent Select Histogram and Normal probability plot These steps will generate lots of output. Now examine the output and attempt to interpret it. Look at the table Descriptive Statistics. What will you conclude?

Look at the table Correlations. What are the hypotheses being tested? What will you conclude?

Look at the table Model Summary. What do you conclude?

. 4

Look at the table ANOVA. Explain what the Degrees of Freedom (DF), Sums of Squares (SS) and Mean Squares (MS) represent. How they are related?

State the hypotheses being tested in the ANOVA table. How is the test statistic calculated and what would your decision be?

Look at the table Coefficients. What do you conclude? Write an equation for the regression model and use it to predict the average female life expectancy of a country whose female literacy is 86%. What are the hypotheses being tested?

5

The last two columns of the Coefficient table give information about collinearity statistics. Looking at the Tolerance, can you say if there is any problem with multicollinearity?

The rest of the output deals with the residuals. This helps to find out if the assumptions to run a linear are met and to identify any outliers or influential cases. Look at the table Casewise Diagnostics. What is standardised residual? What do you conclude?

Look at the table Residual Statistics. What do you conclude?

Look at the Histogram and Normal P-P Plot. What do you conclude about the residuals?

Now look at the two Scatter Plots. What do you conclude?

Can you think of any restriction when using your model to predict female life expectancy? 6

How would you validate a model like this?

Multiple Linear Regression While a simple linear regression can have just one independent variable, a multiple linear regression can have more than one independent variable. The following is a model with two independent variables: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑖𝑛𝑓𝑎𝑛𝑡 𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝑏2 ∗ 𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 + 𝜀 where infant mortality (deaths per 1000 live births) [babymort] is the number of dead babies during their first year per thousand live births and average number of kids [fertilty] is the average number of children per family. We found that literacy explained 67% of the variability of life expectancy. Now we examine a model using infant mortality (babymort) and fertility (fertility) to predict life expectancy. To run the analysis, select Analyze -> Regression -> Linear….Click on Reset. Dependent Variable: Average female life expectancy [lifeexpf] Independent: average number of kids [fertilty], infant mortality (deaths per 1000 live births) [babymort] Case Labels: country Statistics… Descriptives Make sure that Estimates and Model fit are selected. Select Collinearity diagnostics Plots… Produce all partial plots Save... Predicted values: Standardised Look at the table Descriptive Statistics. What do you conclude?

Look at the table Correlations. What do you conclude?

7

Look at the table Model Summary. What do you conclude?

Look at the ANOVA table. What do you conclude?

Look at the table Coefficients. What do you conclude? Write an equation for the regression model and use it to predict the female life expectancy of a country whose fertility is 3 and infant mortality is 23 per 1000 live births.

8

Repeated Measures Analysis of Variance Does the anxiety rating of a person affect performance on a learning task? Twelve subjects were assigned to one of two anxiety groups on the basis of an anxiety test, and the number of errors made in four blocks of trials on a learning task was measured. We use repeated measures analysis of variance technique to study the data. Open the SPSS data file called anxiety2. Notice that there is one case for each subject and four trials variables (trial1, trial2, trial3 and trial4). In repeated measures analysis-of-variance technique, we distinguish two types of factors in the model: between-subject factors and within-subject factors. A between-subject factor as the name suggest, divides the subjects into discrete subgroups, for example anxiety in this data file. Anxiety divides the cases into two groups of high anxiety scores and low anxiety scores. A within-subjects factor is any factor that distinguishes measurements made on the same subject. For example, trail distinguishes the four measurements taken for each subjects. To produce the output in this example, from the menus choose: Analyze General Linear Model Repeated Measures… Within-Subject Factors name: replace factor1 with trial Number of Levels: 4 Click Add and Click Define Within-Subjects Variables (trial): trial1, trial2, trial3 and trial4 Between-Subjects Factor(s): anxiety Options… Select Homogeneity tests Contrasts… Factors: trial Contrasts: Repeated (click Change) Click on Continue and then OK. Examine the results and try to interpret it. Between-Subject Test The test of between-subject effects is shown on the table Tests of Between-Subjects Effects. Examine this table. What do you conclude?

9

Multivariate Tests The multivariate table contains tests of the within-subjects factor, trial, and the interaction of within-subjects factor and the between-subjects factor, trail*anxiety.

Examine the Multivariate Tests table. What do you conclude?

Assumptions The vector of the dependent variables follows a normal distribution, and the variancecovariance matrices are equal across the cells formed by the between-subject effects. The test for this assumption is shown on the table Box’s test of Equality of Covariance Matrices. Examine this table, what do you conclude?

It is assumed that the variance-covariance matrix of the dependent variables is circular. The test of this assumption is shown on the table Mauchly’s Test of Sphericity. Examine this table, what do you conclude?

If the test of sphericity was not satisfied use Greenhouse-Geisser, Huynh-Feldt or Lowerbound to make your conclusion. Now let us look at the table of Tests of Within-Subjects Effects. Examine the table, what can you conclude?

10

Contrasts A repeated contrast measures compares one level of trial with the subsequent level. The first column (source) indicates the effect being tested. For example, the label trial test the hypothesis that averaged over the two anxiety groups, the mean of the specified contrast is zero. The second column trial represents the contrasts. For example, Level 1 vs Level 2 represents the transformation trial 1 – trial 2. This compares the first level of trial with the second level of trial, and so on. The label trial*anxiety tests the hypothesis that the mean of the specified contrast is the same for the two anxiety groups. Now look at the Tests of Within-Subjects Contrasts. What do you conclude?

11

Data Analysis Using Crosstabulations Techniques in SPSS Introduction Crosstabulation is a powerful technique that helps you to describe the relationships between categorical (nominal or ordinal) variables. With Crosstabulation, we can produce the following statistics: Observed Counts and Percentages Expected Counts and Percentages Residuals Chi-Square Relative Risk and Odds Ratio for a 2 x 2 table Kappa Measure of agreement for an R x R table Examples will be used to demonstrate how to produce these statistics using SPSS. The data set used for the demonstration comes with SPSS and it is called GSS_93.sav. It has 67 variables and 1500 cases (observations). Open this data file which is located in the SPSS folder. Study the data file in order to understand it before performing the following exercises. Exercise 1: An R x C Table with Chi-Square Test of Independence Chi-Square tests the hypothesis that the row and column variables are independent, without indicating strength or direction of the relationship. Like most statistics test, to use the ChiSquare test successfully, certain assumptions must be met. They are:

No cell should have expected value (count) less than 0, and No more than 20% of the cells have expected values (counts) less than 5

In the SPSS file, there is a variable called relig short for religion (Protestant, Catholic, Jewish, None, Other) and another one called region4 (Northeast, Midwest, South, West). In this example, we want to find out if religious preferences vary by region of the country. To produce the output, from the menu choose: Analyze -> Descriptive Statistics -> Crosstabs…. Row(s): Religious Preferences [relig] Column(s): Region [region4] Statistics… select Chi-Square, click Continue then OK In the SPSS output, Pearson chi-square, likelihood-ratio chi-square, and linear-by-linear association chi-square are displayed. Fisher's exact test and Yates' corrected chi-square are computed for 2x2 tables. State the null and alternative hypothesis that is being tested.

12

Examine the output. What conclusion can you draw from the output?

However, you will notice that certain assumptions are not met. The results could be misleading. What should you do? We will discuss this further in example 2 below.

Example 2: Percentages, Expected Values, and Residuals and Omitting Categories From the last example, we noticed that 40% of the cells had expected counts less than 5. So this assumption was violated. Since Other and Jewish had just 15 cases each, we can drop them out of the analysis by using Select Cases. In other words, the religious preference is restricted to Protestant, Catholic and None. To produce the output, use Select Cases from the Data menu to select cases with relig not equal to 3 and relig not equal to 5 (relig ~=3 & relig ~=5). Call up the dialogue box for Crosstabs. Reset it to default and select: Row(s): Region [region4] Column(s): Religious Preferences [relig] Statistics… Select Chi-Square Nominal: select Contingency coefficient, Phi and Cramer’s V, Lambda, Uncertainty coefficient, click Continue Cells… Counts: select Expected Percentages: select Row Residuals: select Adjusted Standardized, click Continue then OK Now examine the output and try to interpret it. You can pivot the table so each group of statistics appears in its own panel. Demonstrate. Double-click Table and drag region on the row tray to the right of statistics. Look at the Region4*Religion Preference Crosstabulation. What can you conclude?

13

Look at the Chi-Square Tests table. What can you conclude?

Look at the Symmetric Measures table. What can you conclude about the strength of the relationship between religion preference and region?

Examine the table Directional Measures what do you conclude?

14

Example 3: Tests Within Layers of a Multiway Table Multiway table allows you to examine the relationship between two categorical variables within a controlling variable. For example, is the relationship between marital status and view of life the same for males and females? This example shows you how to answer this type of question in SPSS. Use Select Cases from the Data menu to select cases with marital not equal to 4 (marital ~= 4). Can you think of any reason why we have decided to exclude cases where marital status is equal to 4 (i.e. separted)?

Call up the Crosstabs dialogue box. Click Reset to restore the dialogue box defaults. Then select: Row(s): Marital satus [marital] Column(s): Is Life Exciting or Dull [life] Layer 1 of 1: Respondent’s Sex [sex] Statistics… Select Chi-Square Cells… Counts: select Expected Percentages: select Row Residuals: select Standardized and Adjusted Standardized Examine the results and try to interpret it. Is there a relationship between marital status and view on life? Is this relationship the same between male and female?

Example 4: The Relative Risk and Odds Ration for a 2 x 2 Table The Relative Risk for 2 x 2 tables is a measure of the strength of the association between the presence of a factor and the occurrence of an event. If the confidence interval for the statistic includes a value of 1, you cannot assume that the factor is associated with the event. The odds ratio can be used as an estimate or relative risk when the occurrence of the factor is rare.

15

In the GSS93 data file, there is a variable (dwelown) that measure home ownership (owner or renter) and another variable (vote92) that measure voting (voted or did not vote). We will like to find out whether home owners are more likely to vote than renters. Through the Variable window note all the codes that have been used for the two variables of interest. For example, dwelown uses code 3 for other and code 8 for don’t know, while vote92 uses code 3 for not eligible and code 4 for refused. Select the cases with dwelown less than 3 and vote92 less than 3. From the menus choose: Data -> Select Cases Select If condition is satisfied and click If. Enter dwelown < 3 & vote92 < 3 as the condition and click Continue then OK. In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then select: Row(s): Homeowner or Renter [dwelown] Column(s): Voting in 1992 Election [vote92] Cells… Percentages select Row, click Continue then OK Examine and interpret the output. From the crosstabulation table, what can you conclude?

Recall the crosstabs dialogue box. In the Crosstabs dialogue box, select: Statistics… Select Risk, click Continue Examine the output and interpret it. Look at the table called Risk Estimate, what can you conclude?

16

The odds ratio should be used as an approximation to the relative risk when the following conditions are met: The probability of the event is small (<0.1). This condition guarantees that the odds ratio will make a good approximation to the relative risk. The design of the study is case-control. These conditions are not met in this present example. In smoking and lung cancer study, the conditions are met. You can use the odds ratio. Example 5: The Kappa Measure of Agreement for an R x R Table Cohen's kappa measures the agreement between the evaluations of two raters when both are rating the same object. A value of 1 indicates perfect agreement. A value of 0 indicates that agreement is no better than chance. Values of Kappa greater than 0.75 indicates excellent agreement beyond chance; values between 0.40 to 0.75 indicate fair to good; and values below 0.40 indicate poor agreement. Kappa is only available for tables in which both variables use the same category values and both variables have the same number of categories. The table structure of the Kappa statistics is a square R x R and has the same row and column categories because each subject is classified or rated twice. For example, doctor A and doctor B diagnose the same patients as schizophrenic, manic depressive, or behaviour-disorder. Do the two doctors agree or disagree in their diagnosis? Two teachers assess a class of 18 years old students. Do the teachers agree or disagree in their assessment? In the GSS93 subset data file, we have variables that assess the educational level of respondent’s father (padeg) and mother (madeg). Is there any agreement between father and mother educational level? To produce the output, use Select Cases from the Data menu to select cases with madeg not equal to 2 and padeg not equal to 2 (madeg ~= 2 & padeg ~= 2). In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then select: Row(s): Father’s Highest Degree [padeg] Column(s): Mother’s Highest Degree [madeg] Statistics… Select kappa, click Continue Cells… Percentages: select Total, click Continue then OK Examine and interpret the output. Look at the tables from the output. What can you conclude?

17

Intraclass Correlation Coefficients (ICC) We can use ICC to assess inter-rater agreement when there are more than two raters. For example, the International Olympic Committee (IOC) trains judges to assess gymnastics competitions. How can we find out if the judges are in agreement? ICC can help us to answer this question. Judges have to be trained to ensure that good performances receive higher scores than average performances, and average performances receive higher scores than poor performances; even though two judges may differ on the precise score that should be assigned to a particular performance. Use the data set judges.sav to illustrate how to use SPSS to calculate ICC. Open the data set. From the menu select: Analyze -> Scale -> Reliability Items: judge1, judge2, judge3, judge4, judge5, judge6, judge7 Statistics… Under Descriptives for check Item. Check Intraclass correlation coefficient Model: Two-Way Random Type: Consistency Confidence interval: 95% Test value: 0 Examine and interpret the output. What would you conclude?

Types of Survival Analyses and when to use them in SPSS Life Tables: Use life tables if cases can be classified into meaningful equal time interval. Life table can be used to calculate the probability of a terminal event during any interval under study. Kaplan-Meier: Use this technique if cases cannot be classified into equal time intervals as above. This is common to many clinical and experimental studies. Cox Regression: Use this technique if you want to see the relation between survival time and a predictor variable, for instant age or tumour type.

18

Using Kaplan-Meier Survival Analysis to Test Competing Pain Relief Treatments A pharmaceutical company is developing an anti-inflammatory medication for treating chronic arthritic pain. Of particular interest is the time it takes for the drug to take effect and how it compares to an existing medication. Shorter times to effect are considered better. The results of a clinical trial are collected in pain_medication.sav. This data file is stored in the following folders \\campus\software\dept\spss. Open the file and study it. Use KaplanMeier Survival Analysis to examine the distribution of "time to effect" and compare the effectiveness of the two treatments.

To run a Kaplan-Meier Survival Analysis, from the menus choose:

Analyze Survival Kaplan-Meier...

Select Time to effect [time] as the Time variable. Select Effect status [status] as the Status variable. Click Define Event. Under Value(s) Indicating Event Has Occurred type 1 in the text area next to Single value:. Click Continue. Select Treatment [treatment] as a Factor. Click Compare Factor. Select Log rank, Breslow, and Tarone-Ware. Click Continue. Click Options in the Kaplan-Meier dialog box. Select Quartiles in the Statistics group and Survival in the Plots group. Click Continue. Click OK in the Kaplan-Meier dialog box.

Interpretation Survival Table The survival table is a descriptive table that details the time until the drug takes effect. The table is sectioned by each level of Treatment, and each observation occupies its own row in the table. Time: The time at which the event or censoring occurred. Status: Indicates whether the case experienced the terminal event or was censored. Cumulative Proportion Surviving at the Time: The proportion of cases surviving from the start of the table until this time. When multiple cases experience the terminal event at the

19

same time, these estimates are printed once for that time period and apply to all the cases whose drug took effect at that time. N of Cumulative Events: The number of cases that have experienced the terminal event from the start of the table until this time. N of Remaining Cases: The number of cases that, at this time, have yet to experience the terminal event or be censored. Survival Functions (Curves) The survival curves give a visual representation of the life tables. The horizontal axis shows the time to event. In this plot, drops in the survival curve occur whenever the medication takes effect in a patient. The vertical axis shows the probability of survival. Thus, any point on the survival curve shows the probability that a patient on a given treatment will not have experienced relief by that time. The plot for the New drug below that of the Existing drug throughout most of the trial, which suggests that the new drug may give faster relief than the old. To determine whether these differences are due to chance, look at the comparisons tables. Mean and Medians for Survival Time The means and medians for survival time table offers a quick numerical comparison of the "typical" times to effect for each of the medications. Since there is a lot of overlap in the confidence intervals, it is unlikely that there is much difference in the "average" survival time. Percentiles The percentiles table gives estimates of the first quartile, median, and third quartile of the survival distribution. The interpretation of percentiles for survival curves is that the 75th percentile is the latest time that at least 75 percent of the patients have yet to feel relief. Overall Comparisons This table provides overall tests of the equality of survival times across groups. Since the significance values of the tests are all greater than 0.05, you cannot determine a difference between the survival curves. Summary With the Kaplan-Meier Survival Analysis procedure, you have examined the distribution of time to effect for two different medications. The comparison tests show that there is not a statistically significant difference between them. Recommended Readings 1. Hosmer, D. W., and S. Lemeshow. 1999. Applied Survival Analysis. New York: John Wiley and Sons. 2. Kleinbaum, D. G. 1996. Survival Analysis: A Self-Learning Text. New York: Springer-Verlag. 3. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc..

20

Binary Logistic Regression Model In this type of model, you estimate the probability of an event occurring. The model can be written as: 𝑷𝒓𝒐𝒃(𝒆𝒗𝒆𝒏𝒕) =

𝟏 𝟏 + 𝒆−𝒛

For a single independent variable 𝒛 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 For multiple independent variables: 𝒛 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ . 𝒃𝒏 𝒙𝒏 where b0 and b1, b2, are coefficients estimated from the data, x1, x2, are the independent variables, n is the number of independent variables and e is the base of natural logarithms (2.781). Exercise The data held in the file cancer.sav are from a study reported by Brown (1980) and are commonly cited in texts considering binary logistic regression. The prognosis for prostate cancer is based upon whether or not the cancer has spread to the surrounding lymph nodes. In this classic study Brown et al (see Brown, 1980) explored the following separate indicators for lymph node involvement in a group of 53 men known to have prostate cancer. To open the data file, follow these instructions: 1. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear. 2. In the text area for File name: type \\campus\software\dept\spss and then click on Open. 3. Select the file cancer.sav and click on Open. 4. Spend some time to study the data file. How many cases and variables make up the data file? Cases:…….. Variables:……… 5. Are there any missing values in the data? Yes No The variables (corresponding to columns in the data file) are: 1) age - age of patients in years. 2) acid - level of serum acid phosphates (acid level in King-Armstrong units) 3) xray - x-ray result (0 = negative, 1 - positive) 4) size - size of tumour (0 = small, 1 = large) 5) stage - stage of tumour (0 = less serious, 1 = more serious) 6) nodes - nodal involvement (0 = not involved, 1 = involved) Modelling Carry out a Forward Conditional logistic regression analysis of the data using nodal involvement as the dependent variable and the other variables as independent variables (i.e. covariates). You do not need to define xray, size or stage as being categorical variables, since 21

they are already binary variables. Follow these steps to carry out the Forward Conditional binary logistic regression: Analyze -> Regression -> Binary Logistic…. Dependent: Nodal involvement [nodes] Covariates: age acid xray size stage Method: Forward Conditional Use the output to answer the following questions. Look at the table Case Processing Summary. What do you conclude?

Now look at the three tables under Block 0: Beginning Block. What do you conclude?

Now look at the tables under Block 1: Method=Forward (stepwise) conditional. What do you conclude?

Give the logistic regression equation for the final model.

Predictions Carry out another logistic regression analysis of the data using nodal involvement as the dependent variable but this time including ALL the covariates in the model, i.e. using the ENTER method. Also request the Odd Ratio (OR) and the 95% Confidence Interval (CI) of OR. Follow these steps:

22

Analyze -> Regression -> Binary Logistic…. Dependent: Nodal involvement [nodes] Covariates: age acid xray size stage Save: Under Predicted Values select Probabilities Options: CI for exp(B): Method: Enter 1. Give the coefficients for the full model, i.e. including all the variables. [Normally you would only consider the statistically significant variables].

2. Which coefficients are statistically significant and why?

3. What is the probability of nodal involvement for each man in the data set? Which case has the highest probability and which case the lowest probability of nodal involvement?

4. Select one significant variable give the OR and its 95% CI? How would you interpret the OR and its 95% CI?

Reference Brown, B. W., Jr et al. 1980 Prediction Analyses for Binary Data. In Biostatistics Casebook, New York: John Wiley and Sons. 23

Multivariate Analysis of Variance (MANOVA) The GLM Multivariate procedure allows you to model the values of multiple dependent scale variables, based on their relationships to categorical and scale predictors. The GLM Multivariate procedure is based on the general linear model, in which factors and covariates are assumed to have linear relationships to the dependent variables. Fixed Factors: Categorical predictors should be selected as factors in the model. Each level of a factor can have a different linear effect on the value of the dependent variables. The GLM Multivariate procedure assumes that all the model factors are fixed; that is, they are generally thought of as variables whose values of interest are all represented in the data file, usually by design. Covariates: Scale predictors should be selected as covariates in the model. Within combinations of factor levels (or cells), values of covariates are assumed to be linearly correlated with values of the dependent variables. Interactions: By default, the GLM Multivariate procedure produces a model with all factorial interactions, which means that each combination of factor levels can have a different linear effect on the dependent variable. Additionally, you may specify factor-covariate interactions, if you believe that the linear relationship between a covariate and the dependent variables changes for different levels of a factor. For the purposes of testing hypotheses concerning parameter estimates, the GLM Multivariate procedure assumes: • The values of errors are independent of each other across observations and the independent variables in the model. Good study design generally avoids violation of this assumption. • The covariance of dependent variables is constant across cells. This can be particularly important when there are unequal cell sizes; that is, different numbers of observations across factor-level combinations. • Across the dependent variables, the errors have a multivariate normal distribution with a mean of 0 As part of the initial treatment for myocardial infarction (MI, or "heart attack"), a thrombolytic, or "clot-busting", drug is sometimes administered to help clear the patient's arteries before surgery. Three of the available drugs are alteplase, reteplase, and streptokinase. Alteplase and reteplase are newer, more expensive drugs, and a regional health care system wants to determine whether they are cost-effective enough to adopt in place of streptokinase. One of the benefits of thrombolytic drugs is that surgery generally proceeds more smoothly, resulting in a shorter recovery period. If the newer drugs are effective, then patients given those drugs should have shorter lengths of stay in the hospital. Hopefully, the shorter lengths of stay will help to make up for the greater initial cost of the newer drugs. Running The Analysis 1. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear. 2. In the text area for File name: type \\campus\software\dept\spss and then click on Open. 24

3. Select the file heart.sav and click on Open. 4. Spend some time to study the data file. How many cases and variables make up the data file? Cases:…….. Variables:……… 5. Are there any missing values in the data? Yes No To run a GLM Multivariate analysis, from the menus choose: 1. Analyze -> General Linear Model -> Multivariate... 2. Select Length of stay [los] and Treatment costs [cost] as dependent variables. 3. Select Clot-dissolving drugs [clotsolv] and Surgical treatment [proc] as fixed factors. 4. Click Contrasts

5. Select clotsolv (None) as the contrast to change. 6. In the Change Contrast group, select Simple as the contrast type. 7. Select First as the reference category.

8. Click Change then click Continue 9. Click Option in the GLM Multivariate dialogue box 10. Select Estimates of effect size, SSCP matrices, Homogeneity tests and Spread vs. level plot. 11. Click Continue and OK in the GLM Multivariate dialogue box.

25

By default, a model is fit with clot-dissolving drugs and Surgical treatment as main effects and their interaction as a two-way effect. Interpretation of Results SSCP Matrices and Multivariate Test This table displays the hypothesis and error sum-of-squares and cross-products (SSCP) matrices for testing model effects. Since there are two dependent variables, each matrix has two columns and two rows. For example, the 2x2 matrix associated with CLOTSOLV in the table is the hypothesis matrix for testing the significance of Clot-dissolving drugs

The matrix associated with PROC in the table is the hypothesis matrix for testing the significance of Surgical treatment, and the matrix associated with PROC*CLOTSOLV is used for testing their interaction effect

26

The error matrix is used in testing each effect. In analogy to the test for models with one dependent variable, the “ratio” of the hypothesis SSCP matrix to the error matrix used to evaluate the effect of interest. The multivariate test table displays four tests of significance for each model effect. Pillai's trace is a positive-valued statistic. Increasing values of the statistic indicate effects that contribute more to the model. Wilks' Lambda is a positive-valued statistic that ranges from 0 to 1. Decreasing values of the statistic indicate effects that contribute more to the model. Hotelling's trace is the sum of the eigenvalues of the test matrix. It is a positive-valued statistic for which increasing values indicate effects that contribute more to the model. Hotelling's trace is always larger than Pillai's trace, but when the eigenvalues of the test matrix are small, these two statistics will be nearly equal. This indicates that the effect probably does not contribute much to the model. Roy's largest root is the largest eigenvalue of the test matrix. Thus, it is a positive-valued statistic for which increasing values indicate effects that contribute more to the model. Roy's largest root is always less than or equal to Hotelling's trace. When these two statistics are equal, the effect is predominantly associated with just one of the dependent variables, there is a strong correlation between the dependent variables, or the effect does not contribute much to the model. There is evidence that Pillai's trace is more robust than the other statistics to violations of model assumptions (Olson, 1974). Each multivariate statistic is transformed into a test statistic with an approximate or exact F distribution. The hypothesis (numerator) and error (denominator) degrees of freedom for that F distribution are shown. The significance values of the main effects, CLOTSOLV and PROC, are less than 0.05, indicating that the effects contribute to the model. By contrast, their interaction effect does not contribute to the model. 27

However, though CLOTSOLV does contribute to the model, since the value of Pillai's trace is close to Hotelling's trace, it doesn't contribute very much. The multivariate test table

A more straightforward way to see this is to look at the partial eta squared. The partial eta squared statistics reports the ‘practical’ significance of each term, based upon the ‘ratio’ of the variation accounted for by the effect to the sum of the variation accounted for by the effect and the variation left to error. Larger values of partial eta squared indicate a greater amount of variation accounted for by the model effect, to a maximum of 1. Since the partial eta squared is very small for CLOTSOLV, it does not contribute very much to the model. By contrast, the partial eta squared for PROC is quite large, which is to be expected. The surgical procedure a patient must undergo for MI treatment is going to have a much greater effect on the length of their hospital stay and final cost than the type of thrombolytic they receive. In this case, it is enough for the multivariate tests to show that CLOTSOLV is significant, which means that the effect of at least one of the drugs is different from the others. The contrast results will show you where the differences are. This table displays results for each contrast. Simple contrasts using the first level of Clotdissolving drugs as the reference category were specified.

28

Thus, one contrast compares the second level to the first level; that is, the effect of reteplase to the effect of streptokinase. The contrast estimates show that, on average, patients given reteplase spend 0.382 fewer days in the hospital and incur almost 600 dollars more in treatment costs than patients given streptokinase. Since the significance value for Length of stay is less than 0.05, you can conclude this difference is not due to chance. The significance value for Treatment costs is greater than 0.10, so this difference may be entirely due to chance variation. The second contrast compares the third level to the first level; that is, the effect of alteplase to the effect of streptokinase. The contrast estimates show that, on average, patients given alteplase spend about half a day less in the hospital and incur slightly over 700 dollars more in treatment costs. Since the significance value for Length of stay is less than 0.05, you can conclude this difference is not due to chance. The significance value for Treatment costs is greater than 0.10, so this difference may be entirely due to chance variation. The contrast results show that alteplase and reteplase seem to reduce patient length of stay. Moreover, the reduction is enough to equalize the treatment costs, or at least bring the difference within the random variation. Thus, the model suggests that alteplase and reteplase should be used in place of streptokinase. However, before adopting this plan, you should check some tests of the model assumptions The assumption for the multivariate approach is that the vector of the dependent variables follows a multivariate normal distribution, and the variance-covariance matrices are equal across the cells formed by the between-subject effects.

29

Box's M tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups.

The Box's M test statistic is transformed to an F statistic with df1 and df2 degrees of freedom. Here, the significance value of the test is less than 0.05, suggesting that the assumptions are not met, and thus the model results are suspect. Box's M is sensitive to large data files, meaning that when there are a large number of cases, it can detect even small departures from homogeneity. Moreover, it can be sensitive to departures from the assumption of normality. As an additional check of the diagonals of the covariance matrices, look at Levene's tests. This table tests equality of the error variances across the cells defined by the combination of factor levels.

A separate test is performed for each dependent variable. The significance value for Length of stay is greater than 0.10, so there is no reason to believe that the equal variances assumption is violated for this variable. However, the significance value for the test of Treatment costs is less than 0.05, indicating that the equal variances assumption is violated for this variable. Like Box's M, Levene's test can be sensitive to large data files, so look at the spread vs. level plot for Treatment costs for visual confirmation. The spread-versus-level plot is a scatterplot of the cell means and standard deviations.

30

It provides a visual test of the equal variances assumption, with the added benefit of helping you to assess whether violations of the assumption are due to a relationship between the cell means and standard deviations. This plot agrees with the result of Levene's test, that the equal variances assumption is violated for Treatment costs. There is also a clear positive relationship in the scatterplot, showing that as the cell mean increases, so does the variability. This relationship suggests a possible solution to the problem. Since Treatment costs is a positive-valued variable, you could propose that the error term has a multiplicative, rather than additive, effect on cost. Instead of modeling Treatment costs, you will analyze Log-cost To run an analysis using log-transformed costs, click the Dialog Recall tool and select GLM Multivariate (or select Analyze -> General Linear Model -> Multivariate...). 1. Deselect Treatment costs as a dependent variable 2. Select Log-cost as the dependent variable. 3. Click OK in the GLM Multivariate dialogue box.

Box's M is significant, while Levene's test is not. This can happen for several reasons: • • • •

The covariance between Length of stay and Log-cost is not constant across cells, and thus the model results are suspect. The covariances are unequal, though not by much, but the large size of the data file causes Box's M to be overly sensitive to this departure from homogeneity. The covariances are equal, but the test procedure for computing Box's M, a multivariate test, simply comes up with a different result than the univariate test. The distribution of Length of stay and Log-cost is different enough from a multivariate normal distribution to cause Box's M to be significant. 31

In order to help decide whether you should be concerned about the significance of Box's M, some exploratory data analysis is in order. You can use the Explore procedure to check the assumption of normality. With the data file split by the cells, you can use the Bivariate Correlations procedure to see whether the correlations are constant across cells.

32

The results for Length of stay are identical to the results from the previous model. However, the results for Log-cost are different from those for Treatment costs. The significance values for both contrasts are less than 0.05, suggesting that the differences in costs between the newer drugs and streptokinase are not due to chance. The contrast estimate for the difference between reteplase and streptokinase is 0.0217. Since you are looking at differences in log-transformed cost, this means that the ratio of costs is exp(0.0217) = 1.0219. That is, the ratio of the costs incurred by patients given reteplase is approximately 2.19 percent higher than the costs incurred by patients given streptokinase. If the typical MI patient incurs 25,000 to 35,000 dollars in treatment costs, that means reteplase patients incur, roughly, an extra 550 to 770 dollars in costs. The contrast estimate for the difference between alteplase and streptokinase is 0.0243. Since you are looking at differences in log-transformed cost, this means that the ratio of costs is exp(0.0243) = 1.0246. That is, the ratio of the costs incurred by patients given alteplase is approximately 2.43 percent higher than the costs incurred by patients given streptokinase. If the typical MI patient incurs 25,000 to 35,000 dollars in treatment costs, that means alteplase patients incur, roughly, an extra 600 to 860 dollars in costs. These contrast results show that while alteplase and reteplase do seem to reduce patient length of stay, the reduction is not enough to equalize the treatment costs. Thus, determining whether alteplase and reteplase should be used in place of streptokinase will require further study of the cost of these drugs versus their effectiveness at increasing the success of surgery. Using the GLM Multivariate procedure, you have performed a multivariate analysis of variance on the patient lengths of stay and treatment costs, using the surgical procedure performed and thrombolytic administered as fixed factors. Your initial model indicated that the final treatment costs for reteplase and alteplase are not significantly different from those for streptokinase. However, that model violated the equal variances assumption. The spread vs. level plot showed that a log-transformation of Treatment costs might be appropriate, so the model was re-run, replacing Treatment costs with Log-cost as a dependent variable. This 33

second model passed Levene's test, but now showed a significant difference in the final costs for thrombolytics. The new difference in costs translates to an extra 550 to 860 dollars for the "average" MI patient, so further study of the cost-effectiveness of the new drugs is necessary. What happened? The differences in Treatment costs in the original model fall in the range of 550 to 860 dollars, but that model did not find the difference to be significant. Why should it matter now? Since Treatment costs is a positive-valued variable, its distribution is probably right-skew, so it is likely that there are patients who incurred unusually high costs, thus inflating the error variation in the first model. By log-transforming Treatment costs, the influence of these high-cost patients is reduced. In this case, it was enough to make the differences in costs to be statistically significant. Once satisfied with Log-cost as a dependent variable, you should fit a "final" model without the interaction term, because it has not contributed to either of the first two models. Recommended Reading See the following texts for more information on multivariate linear models: 1. Bray, J. H., and S. E. Maxwell. 1985. Multivariate Analysis of Variance. Thousand Oaks, Calif.: Sage Publications, Inc.. 2. Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc.. 3. Olson, C. L. 1974. Comparative Robustness of Six Tests in Multivariate Analysis of Variance. Journal of the American Statistical Association, 69:348, 894-908.

34